feat: add vision transfer backbones and IMF variants

2026-04-09 14:02:24 +08:00
parent d51b3ecafa
commit ff7c9c1f2a
58 changed files with 2788 additions and 26 deletions
--- a/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
+++ b/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
@@ -0,0 +1,92 @@
+# LEWM ViT Backbone Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
+
+**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
+
+**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
+
+---
+
+### Task 1: Add failing tests for LEWM joint-vision backbone contract
+
+**Files:**
+- Create: `tests/test_lewm_vit_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write the failing backbone shape/load test**
+- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
+- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
+- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
+
+### Task 2: Implement LEWM joint-multiview frozen backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
+
+- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
+- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
+- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
+- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
+- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
+
+### Task 3: Add minimal agent support for joint visual dim
+
+**Files:**
+- Modify: `roboimi/vla/agent.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
+- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
+- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
+
+### Task 4: Add Hydra configs for LEWM backbone training
+
+**Files:**
+- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+
+- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
+- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
+- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
+
+### Task 5: Verify focused local tests
+
+**Files:**
+- Reuse the above
+
+- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
+- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
+
+### Task 6: Sync to 5880 and remote smoke with real checkpoint
+
+**Files:**
+- Remote target: `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
+- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
+- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
+
+### Task 7: Launch two real 50k runs on the 5880 machine
+
+**Files:**
+- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
+
+- [ ] **Step 1: Launch embed384/layer12 on GPU0**
+- [ ] **Step 2: Launch embed256/layer12 on GPU1**
+- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
+- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
+
+### Task 8: Update experiment tracking docs and commit
+
+**Files:**
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
+
+- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
+- [ ] **Step 2: Record running status after launch**
+- [ ] **Step 3: Commit implementation + docs with a focused message**
--- a/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
+++ b/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
@@ -0,0 +1,81 @@
+# ResNet Multitoken IMF Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
+
+**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
+
+**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
+
+---
+
+### Task 1: Add failing tests for multi-token conditioning
+
+**Files:**
+- Modify: `tests/test_imf_vla_agent.py`
+- Modify: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: Add a direct agent test**
+  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
+  - Assert state is paired with each camera token, not concatenated across cameras first.
+- [ ] **Step 2: Add Hydra wiring test**
+  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
+  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
+- [ ] **Step 3: Run focused tests and verify RED**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+
+### Task 2: Implement multi-token ResNet conditioning path
+
+**Files:**
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+- Modify: `roboimi/vla/agent.py`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
+
+- [ ] **Step 1: Extend ResNet backbone**
+  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
+  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
+- [ ] **Step 2: Extend VLAAgent condition building**
+  - Support visual features with rank 4 `(B,T,K,D)`.
+  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
+  - Track `condition_tokens_per_step` and `condition_sequence_length`.
+- [ ] **Step 3: Update transformer-head instantiation**
+  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
+- [ ] **Step 4: Add Hydra config**
+  - New agent config uses:
+    - separate ResNet-18 per camera
+    - standard residual vision trunk (`vision_backbone_mode=resnet`)
+    - condition projector output dim tied to `${agent.head.n_emb}`
+    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
+
+### Task 3: Verify locally
+
+**Files:**
+- Modify only if verification reveals issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+- [ ] **Step 2: Run regression subset**
+  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 3: Run local smoke instantiation**
+  - instantiate the new Hydra config and verify cond shape / sequence length
+
+### Task 4: Launch 4 L20 experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Sync code to `100.119.99.14`**
+- [ ] **Step 2: Smoke the new config on remote**
+- [ ] **Step 3: Launch runs**
+  - `(n_emb=256, n_layer=12)`
+  - `(n_emb=256, n_layer=16)`
+  - `(n_emb=384, n_layer=12)`
+  - `(n_emb=384, n_layer=16)`
+- [ ] **Step 4: Keep fixed across runs**
+  - rollout episodes `10`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - standard ResNet-18 vision trunk
+  - three separate camera weights
+- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
--- a/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
+++ b/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
@@ -0,0 +1,78 @@
+# SigLIP2 Multiview VLA Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
+
+**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
+
+**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
+
+---
+
+### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
+
+**Files:**
+- Create: `tests/test_siglip2_diffusion_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write failing backbone tests**
+  - Instantiate the new backbone with a stub SigLIP2 vision model.
+  - Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
+  - Assert three views are encoded independently and projected.
+- [ ] **Step 2: Run focused tests and verify RED**
+  - Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
+  - Expect failure because the backbone/config/projector do not exist yet.
+- [ ] **Step 3: Extend agent wiring tests**
+  - Add a Hydra/instantiate test for a new SigLIP2 IMF config.
+  - Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
+
+### Task 2: Implement SigLIP2 backbone and optional condition projector
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
+- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
+- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
+- Modify: `roboimi/vla/models/backbones/__init__.py`
+- Modify: `roboimi/vla/agent.py`
+
+- [ ] **Step 1: Implement backbone**
+  - Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
+  - Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
+  - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
+- [ ] **Step 2: Implement optional condition projector**
+  - Allow `VLAAgent` to accept `cond_projector`.
+  - Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
+  - Apply the projector in `_build_cond()` after visual+state concatenation.
+- [ ] **Step 3: Add Hydra configs**
+  - New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
+  - Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
+
+### Task 3: Verify locally and prepare remote execution
+
+**Files:**
+- Modify as needed only if tests/smoke reveal issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 2: Run a local smoke instantiation**
+  - Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
+- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
+
+### Task 4: Sync to 5880 and launch experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Stop superseded remote jobs**
+- [ ] **Step 2: Sync updated code to remote**
+  - Prefer `rsync` or `git push/pull` without overwriting unrelated files.
+- [ ] **Step 3: Remote smoke test**
+  - Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
+  - Confirm headless rollout path still uses `256x256` eval resize.
+- [ ] **Step 4: Launch experiment A**
+  - `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
+- [ ] **Step 5: Launch experiment B**
+  - `per_view_output_dim=192`, same other hyperparameters.
+- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**
--- a/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
+++ b/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
@@ -0,0 +1,138 @@
+# LEWM ViT Backbone Replacement Design
+
+## Goal
+将当前 roboimi VLA policy 中的 ResNet 视觉编码器替换为来自 LEWM checkpoint 的冻结 ViT 视觉编码器（encoder + projector），仅使用最终 CLS token 的 192 维 embedding 作为视觉特征。
+
+## User constraints
+- 使用 `/home/droid/下载/lewm_sim_transfer_checkpoint_usage.md` 中确认的训练好 checkpoint
+- 只使用视觉编码部分：`encoder + projector`
+- 权重冻结
+- 维持“视觉特征 + state 拼接，再送入 diffusion transformer”这一总体处理方式
+- 输入使用三视角：`[r_vis, top, front]`
+- 在 5880 机器上启动两个训练：`embed=384/layer=12` 和 `embed=256/layer=12`
+- `pred_horizon=16`
+- `num_action_steps=8`
+- 每个训练 `50k` steps
+- rollout 验证每次用 `10` 个 episodes，不是之前的 `5`
+
+## Trusted existing facts
+1. LEWM checkpoint 路径：
+   - `/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt`
+2. 需要加载的 state_dict 前缀：
+   - `model.encoder.*`
+   - `model.projector.*`
+3. LEWM ViT 配置：
+   - encoder scale: `tiny`
+   - hidden size: `192`
+   - layers: `12`
+   - attention heads: `3`
+   - patch size: `14`
+   - projector: `MLP(192 -> 2048 -> 192)` with `BatchNorm1d + GELU`
+4. LEWM 训练时三视角先拼成单图，再送入单个 ViT encoder；输出整体视觉 embedding 是 **192 维**。
+
+## Key design decision
+### Chosen design: fuse 3 cameras into one LEWM-style image, output one 192-d visual vector per timestep
+不是把 LEWM ViT 当成“每相机一个 192-d encoder”，而是按 LEWM 原训练方式：
+- 输入三视角图像字典 `{r_vis, top, front}`
+- 按固定顺序拼成一张 fused image
+- 走单个 frozen ViT + projector
+- 得到一个 **192 维总视觉特征**
+
+### Why this is the right replacement
+当前 ResNet backbone 对外给到 policy head 的**总视觉特征维度**是：
+- 每相机 `64`
+- 三相机总计 `192`
+
+而 LEWM checkpoint 输出的 CLS/projector embedding 也是：
+- 总计 `192`
+
+因此，最自然的“直接平替当前 ResNet 视觉编码器”的方式是：
+- 用 LEWM backbone 直接产出一个 192-d 总视觉向量
+- 后续和 state `16-d` 拼接后，依旧得到 `208-d` 条件向量
+- 不改 diffusion head 的总体接口和语义
+
+## Interface compatibility plan
+现有 `VLAAgent` 假设 backbone 暴露：
+- `camera_names`
+- `num_cameras`
+- `output_dim`（语义上是“每相机特征维度”）
+- `forward(images_dict) -> (B, T, total_visual_dim)`
+
+为了最小改动兼容现有 agent：
+- 新 LEWM backbone 的 `forward()` 返回 `(B, T, 192)`
+- `camera_names = ('r_vis', 'top', 'front')`
+- `num_cameras = 3`
+- `output_dim = 64`
+
+这样 `VLAAgent` 内部仍会计算：
+- `per_step_cond_dim = output_dim * num_cams + obs_dim = 64*3 + 16 = 208`
+与实际 `forward()` 输出的 `192 + 16 = 208` 保持一致。
+
+> 也就是说：`output_dim` 在这个 backbone 里保留为“与旧 ResNet 总特征等价的单相机占位维度”，而不是“真实 projector 输出维度”。这是一个兼容性 shim，用来避免改 agent 主逻辑。
+
+## Image preprocessing design
+当前 roboimi dataset 已经把每个相机图像读成：
+- `(C, 224, 224)`
+- 值域 `[0, 1]`
+
+新 LEWM backbone 将：
+1. 按顺序取 `r_vis`, `top`, `front`
+2. 在宽度方向拼接，得到 fused image：
+   - `(C, 224, 672)`
+3. 使用 LEWM 一致的 ImageNet normalize：
+   - mean `[0.485, 0.456, 0.406]`
+   - std `[0.229, 0.224, 0.225]`
+4. 调用 `ViTModel(..., interpolate_pos_encoding=True)`
+5. 取 `last_hidden_state[:, 0]`
+6. 送入 frozen projector，得到 `(B*T, 192)`
+
+## Files to create / modify
+### New files
+- `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+- `tests/test_lewm_vit_backbone.py`
+
+### Modified files
+- `roboimi/vla/models/backbones/__init__`（如果需要导出）
+- `tests/test_imf_vla_agent.py`（增加新 backbone 集成用例）
+- `roboimi/demos/vla_scripts/train_vla.py`（如需仅调整 rollout 默认/日志；如果命令覆盖足够，则尽量不改主逻辑）
+- 训练/实验 suite 文档（新增本次 LEWM ViT 训练记录）
+
+## Testing plan
+1. **Unit test: load + forward**
+   - 用 synthetic checkpoint 验证新 backbone 能正确加载 `model.encoder.*` 与 `model.projector.*`
+   - 输入 3 相机 `(B,T,C,224,224)`
+   - 输出 `(B,T,192)`
+2. **Agent integration test**
+   - backbone.output_dim=64, num_cameras=3
+   - agent `_build_cond()` 输出最后维度为 `208`
+3. **Remote smoke test on 5880**
+   - 使用真实 checkpoint
+   - `max_steps=2`
+   - 两个实验各自 smoke 一次
+4. **Full run**
+   - GPU0: `embed=384, layer=12`
+   - GPU1: `embed=256, layer=12`
+   - `rollout_num_episodes=10`
+
+## Training launch contract
+- host: `100.73.14.65`
+- code dir: `/home/droid/roboimi_suite_20260404`
+- python: `/home/droid/miniforge3/envs/roboimi/bin/python`
+- dataset: `/home/droid/sim_dataset/sim_transfer`
+- cameras: `[r_vis, top, front]`
+- agent: new `lewm_imf_attnres`
+- max_steps: `50000`
+- rollout every `5` epochs
+- rollout episodes: `10`
+
+## Risks
+1. LEWM 训练时的 fused image 预处理如果方向实现错了（224x672 vs 672x224），会导致分布偏移。
+2. 当前 roboimi env 需确保安装 `transformers`；从 `environment.yml` 看本地已有该依赖，但远端训练环境要 smoke 确认。
+3. 因为这是 frozen ViT + projector，若 projector BN 仍保持 train 模式，统计量会漂移，所以必须整体 `eval()` 并冻结。
+
+## Recommended first implementation path
+- 先实现一个独立 `LEWMViTBackbone` 类，不改现有 `ResNetDiffusionBackbone` 主逻辑。
+- 再通过新的 hydra backbone/agent 配置接入。
+- 优先做到“最少侵入 + smoke 可跑 + 远端可训”。
--- a/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
+++ b/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
@@ -0,0 +1,32 @@
+# ResNet Multitoken IMF Design
+
+**Status:** user-specified architecture, treated as approved on 2026-04-06.
+
+## Goal
+Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
+
+## Approved architecture
+- Vision trunk: standard `resnet18` residual network
+- Cameras: `front`, `top`, `r_vis`
+- Each camera uses its **own** ResNet-18 weights (`use_separate_rgb_encoder_per_camera=true`)
+- Each camera produces one visual token
+- For each obs step and each camera:
+  1. take that camera visual token
+  2. concatenate robot state
+  3. project to one condition token
+- IMF input should receive **3 condition tokens per obs step**, not one concatenated token
+- With `obs_horizon=2`, IMF cond sequence length becomes `2 * 3 = 6`
+- IMF head remains on the existing IMF/AttnRes implementation path
+- Vision trunk remains standard ResNet; **no AttnRes vision replacement**
+
+## Design choices
+- Extend `ResNetDiffusionBackbone` with an opt-in mode that returns per-camera tokens shaped `(B, T, num_cams, D)` instead of concatenating camera features into `(B, T, num_cams * D)`.
+- Teach `VLAAgent` to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten `(T, num_cams)` into one cond sequence for the IMF head.
+- Keep `per_step_cond_dim` as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
+- For the new experiments, set the condition-token width equal to `n_emb` via `cond_projector.output_dim=${agent.head.n_emb}`.
+
+## Files expected to change
+- `roboimi/vla/models/backbones/resnet_diffusion.py`
+- `roboimi/vla/agent.py`
+- new Hydra agent config for the multitoken ResNet IMF variant
+- focused tests in `tests/test_imf_vla_agent.py` and/or `tests/test_resnet_transformer_agent_wiring.py`
--- a/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
+++ b/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
@@ -0,0 +1,41 @@
+# SigLIP2 Multiview VLA Design
+
+**Status:** user-specified architecture, treated as approved on 2026-04-06
+
+## Goal
+Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
+
+## Approved architecture
+- Backbone model: `google/siglip2-base-patch16-256`
+- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
+- Input size:
+  - dataset images stay at native `256x256` (no dataset-side resize)
+  - eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
+- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
+- Per-view projection experiments:
+  1. `768 -> 96`
+  2. `768 -> 192`
+- Conditioning pipeline:
+  1. concatenate 3 projected camera vectors
+  2. concatenate robot state
+  3. project concatenated condition to `384`
+  4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
+- Training/run defaults for requested experiments:
+  - `n_emb=384`
+  - `n_layer=12`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
+
+## Design decisions
+- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
+- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
+- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
+- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
+
+## Files expected to change
+- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
+- `roboimi/vla/agent.py` for optional post-concat condition projection
+- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
+- tests for backbone wiring and agent conditioning dims
+- remote launch commands/scripts only as needed for training
--- a/experiment_suites/2026-04-05-camera-ablation-summary.md
+++ b/experiment_suites/2026-04-05-camera-ablation-summary.md
@@ -0,0 +1,69 @@
+# Camera Ablation Summary (`pred_horizon=16`, `num_action_steps=8`, ResNet IMF)
+
+- Generated: 2026-04-05
+- Common setup: original ResNet vision backbone, `n_emb=384`, `n_layer=12`, `batch_size=80`, `lr=2.5e-4`, `max_steps=50k`, rollout every 5 epochs with 5 episodes, headless eval.
+- Metric for comparison: `checkpoints/vla_model_best.pt -> rollout_avg_reward`.
+
+## Leaderboard
+
+| Rank | Cameras | Best avg_reward | Best step | Final loss | Run name |
+|---:|---|---:|---:|---:|---|
+| 1 | `top + front` | **274.8** | 48124 | 0.0056 | `imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023` |
+| 2 | `top` | **271.2** | 43749 | 0.0052 | `imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844` |
+| 3 | `r_vis + front` | **244.0** | 21874 | 0.0043 | `imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029` |
+| 4 | `r_vis` | **6.4** | 17499 | 0.0047 | `imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844` |
+| 5 | `r_vis + top` | **1.2** | 4374 | 0.0047 | `imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844` |
+| 6 | `front` | **0.0** | 4374 | 0.0074 | `imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607` |
+
+## Main takeaways
+
+1. **`top` 是最关键的单相机视角**：`top only = 271.2`，几乎与 `top + front = 274.8` 持平。
+2. **`front` 单独几乎没有效用**：`front only = 0.0`。
+3. **`r_vis` 单独也基本无效**：`r_vis only = 6.4`。
+4. **`r_vis + front` 可以显著优于单独 `front` / `r_vis`**，说明这两个视角有一定互补性，但仍明显弱于任何包含 `top` 且表现正常的配置。
+5. **`r_vis + top` 的结果异常差**：只有 `1.2`，远低于 `top only = 271.2`。这说明简单加入 `r_vis` 并不保证增益，甚至可能破坏当前设置下的学习。
+6. **训练 loss 与 rollout reward 明显不一致**：例如 `r_vis + top` 和 `r_vis only` 的 final loss 都不高，但 reward 很差，因此本组实验必须以 rollout reward 而不是 loss 选型。
+
+## Horizontal comparison views
+
+### Single-camera comparison
+
+- `top`: **271.2**
+- `r_vis`: **6.4**
+- `front`: **0.0**
+
+结论：**`top >>> r_vis > front`**。
+
+### Two-camera comparison
+
+- `top + front`: **274.8**
+- `r_vis + front`: **244.0**
+- `r_vis + top`: **1.2**
+
+结论：
+- **最稳妥的双相机组合是 `top + front`**。
+- `r_vis + front` 有效，但不如 `top + front`。
+- `r_vis + top` 在当前设置下几乎失效。
+
+### Incremental effect of adding a second view
+
+- 在 `top` 基础上加 `front`：`271.2 -> 274.8`，**增益很小**。
+- 在 `front` 基础上加 `r_vis`：`0.0 -> 244.0`，**增益很大**。
+- 在 `top` 基础上加 `r_vis`：`271.2 -> 1.2`，**显著退化**。
+
+## Practical recommendation
+
+如果只从这 6 个实验里选：
+
+- **首选**：`top + front`
+- **次选**：`top only`
+- 如果必须不用 `top`：`r_vis + front` 明显优于 `front only` / `r_vis only`
+- **不建议**：`r_vis + top`
+
+## Note relative to previous 3-camera baseline
+
+此前 3 相机 `[r_vis, top, front]` 的最佳 reward 为 **610.8**。
+因此这次 6 个 camera ablation 的最佳结果（`top + front = 274.8`）说明：
+
+- 当前这个训练批次里，**去掉任意一个视角都会显著低于之前的 3 相机最优结果**；
+- 但在去掉视角的约束下，**`top` 仍然是最核心的保留对象**。
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/CHECKLIST.md
@@ -0,0 +1,8 @@
+# CHECKLIST
+
+- [x] Confirm remote free GPU
+- [x] Create front-only run contract
+- [x] Remote smoke test passes
+- [x] Launch 50k run on remote GPU0
+- [x] Record pid / log / SwanLab
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/PLAN.md
@@ -0,0 +1,28 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone, using only the `front` camera as image conditioning.
+
+## Fixed comparison contract
+- Same as the active `top/front` run except image input is reduced to `[front]`
+- Agent: `resnet_imf_attnres`
+- Vision backbone mode: `resnet`
+- `pred_horizon=16`, `num_action_steps=8`
+- `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
+- `inference_steps=1`
+- `batch_size=80`, `lr=2.5e-4`, cosine, warmup=2000
+- dataset: `/home/droid/sim_dataset/sim_transfer`
+- cameras: `[front]` only
+- rollout every 5 epochs with 5 episodes, headless
+
+## Resource plan
+- Host: `100.119.99.14`
+- GPU: `0`
+
+## Important dimension override
+- Single-camera visual cond dim = `64 + 16 = 80`, so override `agent.head.cond_dim=80` and `agent.num_cams=1`.
+
+## Execution path
+1. 2-step smoke test on remote GPU0.
+2. If smoke passes, launch 50k main run with SwanLab.
+3. Record pid / run_dir / log / URL locally.
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/notes.md
@@ -0,0 +1,6 @@
+# Notes
+
+- 2026-04-05 09:55:27: remote 2-step smoke passed on `100.119.99.14` GPU0 with `front` only, batch=80, no OOM.
+- 2026-04-05 09:56:26: launched main run `imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607`.
+- 2026-04-05 09:57:36: confirmed training is stable through step 200, latest loss 0.2830.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/7kdii8oc6tjkcyu5y0lwq
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/status.json
@@ -0,0 +1,51 @@
+{
+  "suite_name": "2026-04-05-front-only-resnet-1cam",
+  "updated_at": "2026-04-05 09:57:36",
+  "phase": "running",
+  "baseline_reference": {
+    "source_run": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
+    "notes": "Same hyperparameters as the active top/front run, but image input is reduced to [front] only."
+  },
+  "smoke_test": {
+    "status": "passed",
+    "host": "100.119.99.14",
+    "gpu": 0,
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-frontonly-resnet-ph16-ex08-20260405-095509",
+    "batch_size": 80,
+    "max_steps": 2,
+    "note": "2-step remote CUDA smoke passed on L20 GPU0 without OOM."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "100.119.99.14",
+    "gpu": 0,
+    "launch_pid": 158874,
+    "pid": 158877,
+    "run_name": "imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607",
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607",
+    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607/train_vla.log",
+    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607.launch.log",
+    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
+    "camera_names": [
+      "front"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_cond_dim": 80,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/7kdii8oc6tjkcyu5y0lwq",
+    "latest_step": 200,
+    "latest_loss": 0.283,
+    "process_running": true
+  }
+}
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/CHECKLIST.md
@@ -0,0 +1,8 @@
+# CHECKLIST
+
+- [x] Confirm camera mapping (`right` -> `r_vis`)
+- [x] Create front+r_vis run contract
+- [x] Remote smoke test passes
+- [x] Launch 50k run on remote GPU1
+- [x] Record pid / log / SwanLab
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/PLAN.md
@@ -0,0 +1,23 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone, using `front` + `r_vis` cameras only.
+
+## Fixed comparison contract
+- Same hyperparameters as the active top/front and front-only runs
+- Agent: `resnet_imf_attnres`
+- Vision backbone mode: `resnet`
+- `pred_horizon=16`, `num_action_steps=8`
+- `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
+- `inference_steps=1`
+- `batch_size=80`, `lr=2.5e-4`, cosine warmup 2000
+- dataset: `/home/droid/sim_dataset/sim_transfer`
+- cameras: `[r_vis, front]`
+- rollout every 5 epochs with 5 episodes, headless
+
+## Important dimension override
+- Two-camera visual cond dim = `64*2 + 16 = 144`, so set `agent.num_cams=2`, `agent.head.cond_dim=144`.
+
+## Resource plan
+- Host: `100.119.99.14`
+- GPU: `1`
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/notes.md
@@ -0,0 +1,6 @@
+# Notes
+
+- 2026-04-05 10:20:09: remote 2-step smoke passed on `100.119.99.14` GPU1 with `r_vis + front`, batch=80, no OOM.
+- 2026-04-05 10:20:49: launched main run `imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029`.
+- 2026-04-05 10:22:03: confirmed training is stable through step 200, latest loss 0.3321.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/3fyzjfdcbiq7frtbqv6ss
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/status.json
@@ -0,0 +1,55 @@
+{
+  "suite_name": "2026-04-05-front-rvis-resnet-2cam",
+  "updated_at": "2026-04-05 10:22:03",
+  "phase": "running",
+  "interpretation": {
+    "right_camera_name": "r_vis"
+  },
+  "baseline_reference": {
+    "source_run": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
+    "notes": "Same hyperparameters as the active top/front run, replacing top with r_vis."
+  },
+  "smoke_test": {
+    "status": "passed",
+    "host": "100.119.99.14",
+    "gpu": 1,
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-frontrvis-resnet-ph16-ex08-20260405-102001",
+    "batch_size": 80,
+    "max_steps": 2,
+    "note": "2-step remote CUDA smoke passed on L20 GPU1 without OOM."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "100.119.99.14",
+    "gpu": 1,
+    "launch_pid": 159910,
+    "pid": 159913,
+    "run_name": "imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029",
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029",
+    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029/train_vla.log",
+    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029.launch.log",
+    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
+    "camera_names": [
+      "r_vis",
+      "front"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_cond_dim": 144,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/3fyzjfdcbiq7frtbqv6ss",
+    "latest_step": 200,
+    "latest_loss": 0.3321,
+    "process_running": true
+  }
+}
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json
@@ -0,0 +1,73 @@
+{
+  "date": "2026-04-06",
+  "branch": "feat-imf-attnres-policy",
+  "worktree": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy",
+  "model": "LEWM ViT frozen visual encoder + IMF AttnRes diffusion head",
+  "checkpoint_path": "/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt",
+  "visual_contract": {
+    "input_camera_names": ["r_vis", "top", "front"],
+    "fused_camera_names": ["front", "top", "r_vis"],
+    "joint_output_dim": 192,
+    "freeze_backbone": true,
+    "dataset_image_resize_shape": null,
+    "eval_image_resize_shape": [256, 256],
+    "fused_short_side_resize": 224
+  },
+  "training_contract": {
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 10,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "scheduler_type": "cosine",
+    "warmup_steps": 2000,
+    "min_lr": 1e-06,
+    "weight_decay": 1e-05,
+    "grad_clip": 1.0
+  },
+  "verification": {
+    "local_tests": "38 passed",
+    "remote_dataset_shape": [2, 3, 256, 256],
+    "remote_eval_prepared_shape": [3, 256, 256],
+    "remote_smoke_run": {
+      "run_name": "smoke-lewm-imf-rawpath-emb384-20260406-002002",
+      "result": "passed",
+      "details": "2-step train + checkpoint-triggered 1-episode headless rollout succeeded with corrected raw256 path"
+    }
+  },
+  "superseded_runs": [
+    {
+      "run_name": "lewm-vit-imf-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260405-201914",
+      "reason": "stopped due to incorrect early per-camera 224 resize"
+    },
+    {
+      "run_name": "lewm-vit-imf-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260405-201914",
+      "reason": "stopped due to incorrect early per-camera 224 resize"
+    }
+  ],
+  "full_runs": [
+    {
+      "host": "100.73.14.65",
+      "gpu": 0,
+      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124",
+      "pid": 1058589,
+      "log_path": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124.launch.log",
+      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/y5tzgqe0u966w9ak41i31",
+      "head_n_emb": 384,
+      "head_n_layer": 12
+    },
+    {
+      "host": "100.73.14.65",
+      "gpu": 1,
+      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124",
+      "pid": 1058590,
+      "log_path": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124.launch.log",
+      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/2esr9y7t2dgesstgrn5i6",
+      "head_n_emb": 256,
+      "head_n_layer": 12
+    }
+  ]
+}
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/notes.md
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/notes.md
@@ -0,0 +1,25 @@
+# 2026-04-06 LEWM ViT Transfer Notes
+
+## Root-cause fix
+
+The first LEWM runs were stopped because the data path still resized each camera view to `224x224` **before** multiview fusion. That preserved the final tensor shape but broke the original LEWM geometry.
+
+Corrected path now is:
+
+- **Training dataset**: keep stored per-view `256x256` images (`data.image_resize_shape=null` at launch; dataset instantiate override is `None` for LEWM)
+- **Eval rollout input**: resize live MuJoCo `480x640` camera images to `256x256` per view
+- **Backbone**: fuse `front, top, r_vis` on the LEWM axis, then resize fused short side to `224`
+
+## Verification
+
+- Local tests passed (`38 passed` across the focused suite)
+- Remote check:
+  - dataset sample image shape: `(2, 3, 256, 256)`
+  - eval-prepared live frame shape: `(3, 256, 256)`
+- Remote smoke passed with real checkpoint:
+  - `smoke-lewm-imf-rawpath-emb384-20260406-002002`
+
+## Current runs
+
+- `lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124`
+- `lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124`
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/status.json
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/status.json
@@ -0,0 +1,19 @@
+{
+  "status": "running",
+  "updated_at": "2026-04-06T00:22:10+08:00",
+  "remote_host": "100.73.14.65",
+  "runs": [
+    {
+      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124",
+      "pid": 1058589,
+      "gpu": 0,
+      "state": "running"
+    },
+    {
+      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124",
+      "pid": 1058590,
+      "gpu": 1,
+      "state": "running"
+    }
+  ]
+}
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/CHECKLIST.md
@@ -0,0 +1,7 @@
+# CHECKLIST
+
+- [x] Create run contract
+- [x] Remote smoke test passes
+- [x] Launch 50k main run
+- [x] Record pid / log / SwanLab
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/PLAN.md
@@ -0,0 +1,12 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone using r_vis only as the only image conditioning.
+
+## Fixed comparison contract
+- same hyperparameters as the active top/front run
+- cameras: ['r_vis']
+- num_cams=1
+- head.cond_dim=80
+- host: 100.119.99.14
+- gpu: 3
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/notes.md
@@ -0,0 +1,6 @@
+# Notes
+
+- 2026-04-05 12:58:22: smoke passed for ['r_vis'] on 100.119.99.14 GPU3.
+- 2026-04-05 12:59:24: launched main run `imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844`.
+- 2026-04-05 13:01:20: latest confirmed progress step=400, loss=0.1165.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/qnuh7vln9mqomxxldyecq
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/status.json
@@ -0,0 +1,47 @@
+{
+  "suite_name": "2026-04-05-rvis-only-resnet-1cam",
+  "updated_at": "2026-04-05 13:01:20",
+  "phase": "running",
+  "smoke_test": {
+    "status": "passed",
+    "host": "100.119.99.14",
+    "gpu": 3,
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-rvisonly-resnet-ph16-ex08-20260405-125812",
+    "batch_size": 80,
+    "max_steps": 2,
+    "note": "2-step remote CUDA smoke passed without OOM."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "100.119.99.14",
+    "gpu": 3,
+    "launch_pid": 164812,
+    "pid": 164816,
+    "run_name": "imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844",
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844",
+    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844/train_vla.log",
+    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844.launch.log",
+    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
+    "camera_names": [
+      "r_vis"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_cond_dim": 80,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/qnuh7vln9mqomxxldyecq",
+    "latest_step": 400,
+    "latest_loss": 0.1165,
+    "process_running": true
+  }
+}
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/CHECKLIST.md
@@ -0,0 +1,7 @@
+# CHECKLIST
+
+- [x] Create run contract
+- [x] Remote smoke test passes
+- [x] Launch 50k main run
+- [x] Record pid / log / SwanLab
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/PLAN.md
@@ -0,0 +1,12 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone using r_vis + top as the only image conditioning.
+
+## Fixed comparison contract
+- same hyperparameters as the active top/front run
+- cameras: ['r_vis', 'top']
+- num_cams=2
+- head.cond_dim=144
+- host: 100.119.99.14
+- gpu: 2
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/notes.md
@@ -0,0 +1,6 @@
+# Notes
+
+- 2026-04-05 12:58:22: smoke passed for ['r_vis', 'top'] on 100.119.99.14 GPU2.
+- 2026-04-05 12:59:24: launched main run `imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844`.
+- 2026-04-05 13:01:20: latest confirmed progress step=200, loss=0.2845.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/umsm6402eb81et7wx7z4a
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/status.json
@@ -0,0 +1,48 @@
+{
+  "suite_name": "2026-04-05-rvistop-resnet-2cam",
+  "updated_at": "2026-04-05 13:01:20",
+  "phase": "running",
+  "smoke_test": {
+    "status": "passed",
+    "host": "100.119.99.14",
+    "gpu": 2,
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-rvistop-resnet-ph16-ex08-20260405-125812",
+    "batch_size": 80,
+    "max_steps": 2,
+    "note": "2-step remote CUDA smoke passed without OOM."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "100.119.99.14",
+    "gpu": 2,
+    "launch_pid": 164745,
+    "pid": 164749,
+    "run_name": "imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844",
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844",
+    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844/train_vla.log",
+    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844.launch.log",
+    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
+    "camera_names": [
+      "r_vis",
+      "top"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_cond_dim": 144,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/umsm6402eb81et7wx7z4a",
+    "latest_step": 200,
+    "latest_loss": 0.2845,
+    "process_running": true
+  }
+}
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/CHECKLIST.md
@@ -0,0 +1,8 @@
+# CHECKLIST
+
+- [x] Confirm baseline hyperparameters from trusted prior run
+- [x] Confirm local GPU availability
+- [x] Smoke test with `top/front` cameras only
+- [x] Launch 50k run
+- [x] Record pid / run dir / log path / SwanLab URL
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/PLAN.md
@@ -0,0 +1,30 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone (no full-AttnRes vision replacement), using only `top` and `front` cameras as image conditioning.
+
+## Fixed comparison contract
+- Agent: `resnet_imf_attnres`
+- Vision backbone mode: `resnet`
+- `pred_horizon=16`
+- `num_action_steps=8`
+- `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
+- `inference_steps=1`
+- `batch_size=80`, `lr=2.5e-4`, cosine scheduler, warmup 2000
+- dataset: `/home/droid/project/diana_sim/sim_transfer`
+- cameras: `[top, front]` only
+- training budget: `max_steps=50000`
+- rollout validation: every 5 epochs, 5 episodes, headless
+
+## Resource plan
+- Host: local
+- GPU: RTX 5090 (GPU 0)
+
+## Execution path
+1. Run a short 2-step smoke test on GPU with the exact 2-camera config.
+2. If smoke passes, launch the 50k main run with durable log redirection.
+3. Record run name, pid, log path, and SwanLab URL into suite status.
+
+## Fallbacks
+- If batch 80 OOMs, fall back to batch 64 with scaled lr 2.0e-4.
+- If dataloader startup is unstable, reduce num_workers from 12 to 8.
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/notes.md
@@ -0,0 +1,5 @@
+# Notes
+
+- 2026-04-05 08:50:04: 2-step smoke test passed locally on RTX 5090 with `top/front` cameras, batch=80, no OOM.
+- 2026-04-05 08:50:42: launched main run `imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023` on local GPU0.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/vi77mn5dwd19z4nttxab8
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/status.json
@@ -0,0 +1,51 @@
+{
+  "suite_name": "2026-04-05-top-front-resnet-2cam",
+  "updated_at": "2026-04-05 08:52:12",
+  "phase": "running",
+  "baseline_reference": {
+    "source_run": "imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223",
+    "best_rollout_avg_reward": 610.8,
+    "best_step": 21874,
+    "notes": "Same IMF baseline as Phase-1 best, but switch cameras from [r_vis, top, front] to [top, front] and keep the original ResNet vision backbone."
+  },
+  "smoke_test": {
+    "status": "passed",
+    "run_dir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/smoke-topfront-resnet-ph16-ex08-20260405-085000",
+    "batch_size": 80,
+    "num_workers": 4,
+    "max_steps": 2,
+    "note": "2-step local CUDA smoke passed without OOM using top/front only."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "local",
+    "gpu": 0,
+    "pid": 1693348,
+    "run_name": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
+    "run_dir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
+    "log_path": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023/train_vla.log",
+    "launch_log": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/experiment_suites/2026-04-05-top-front-resnet-2cam/launch_logs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023.launch.log",
+    "dataset_dir": "/home/droid/project/diana_sim/sim_transfer",
+    "camera_names": [
+      "top",
+      "front"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/vi77mn5dwd19z4nttxab8",
+    "latest_step": 500,
+    "latest_loss": 0.0978,
+    "process_running": true
+  }
+}
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/CHECKLIST.md
@@ -0,0 +1,7 @@
+# CHECKLIST
+
+- [x] Create run contract
+- [x] Remote smoke test passes
+- [x] Launch 50k main run
+- [x] Record pid / log / SwanLab
+- [x] Report status back to user
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/PLAN.md
@@ -0,0 +1,12 @@
+# PLAN
+
+## Goal
+Train a 50k-step IMF baseline with the original ResNet vision backbone using top only as the only image conditioning.
+
+## Fixed comparison contract
+- same hyperparameters as the active top/front run
+- cameras: ['top']
+- num_cams=1
+- head.cond_dim=80
+- host: 100.119.99.14
+- gpu: 4
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/notes.md
@@ -0,0 +1,6 @@
+# Notes
+
+- 2026-04-05 12:58:22: smoke passed for ['top'] on 100.119.99.14 GPU4.
+- 2026-04-05 12:59:24: launched main run `imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844`.
+- 2026-04-05 13:01:20: latest confirmed progress step=400, loss=0.1233.
+- SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/egzo29l3z9ftsaunhf025
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/status.json
@@ -0,0 +1,47 @@
+{
+  "suite_name": "2026-04-05-top-only-resnet-1cam",
+  "updated_at": "2026-04-05 13:01:20",
+  "phase": "running",
+  "smoke_test": {
+    "status": "passed",
+    "host": "100.119.99.14",
+    "gpu": 4,
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-toponly-resnet-ph16-ex08-20260405-125812",
+    "batch_size": 80,
+    "max_steps": 2,
+    "note": "2-step remote CUDA smoke passed without OOM."
+  },
+  "main_run": {
+    "status": "running",
+    "host": "100.119.99.14",
+    "gpu": 4,
+    "launch_pid": 164808,
+    "pid": 164813,
+    "run_name": "imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844",
+    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844",
+    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844/train_vla.log",
+    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844.launch.log",
+    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
+    "camera_names": [
+      "top"
+    ],
+    "pred_horizon": 16,
+    "num_action_steps": 8,
+    "head_cond_dim": 80,
+    "head_n_emb": 384,
+    "head_n_layer": 12,
+    "vision_backbone_mode": "resnet",
+    "pretrained_backbone_weights": null,
+    "freeze_backbone": false,
+    "batch_size": 80,
+    "lr": 0.00025,
+    "num_workers": 12,
+    "max_steps": 50000,
+    "rollout_val_freq_epochs": 5,
+    "rollout_num_episodes": 5,
+    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/egzo29l3z9ftsaunhf025",
+    "latest_step": 400,
+    "latest_loss": 0.1233,
+    "process_running": true
+  }
+}
--- a/roboimi/demos/vla_scripts/eval_vla.py
+++ b/roboimi/demos/vla_scripts/eval_vla.py
@@ -106,7 +106,11 @@ def load_checkpoint(
    return agent, stats


-def prepare_observation(obs: Dict, camera_names: list) -> Dict:
+def prepare_observation(
+    obs: Dict,
+    camera_names: list,
+    image_resize_shape: Optional[tuple[int, int]] = (224, 224),
+) -> Dict:
    """
    将环境观测转换为 agent 格式。

@@ -117,14 +121,13 @@ def prepare_observation(obs: Dict, camera_names: list) -> Dict:
    Returns:
        agent 格式的观测字典
    """
-    import cv2
-
    # 转换图像: numpy -> tensor, HWC -> CHW
    images = {}
    for cam_name in camera_names:
        img = obs['images'][cam_name]
-        # Resize 到 224x224（与训练时一致）
-        img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)
+        if image_resize_shape is not None:
+            import cv2
+            img = cv2.resize(img, tuple(image_resize_shape), interpolation=cv2.INTER_LINEAR)
        img = rearrange(img, 'h w c -> c h w')
        img = torch.from_numpy(img / 255.0).float()
        images[cam_name] = img
@@ -668,6 +671,8 @@ def _run_eval(cfg: DictConfig):
        agent_cfg=cfg.agent,
        device=device
    )
+    vision_encoder = getattr(agent, 'vision_encoder', None)
+    image_resize_shape = getattr(vision_encoder, 'eval_image_resize_shape', (224, 224))

    # 重置 agent 的队列
    agent.reset()
@@ -725,7 +730,11 @@ def _run_eval(cfg: DictConfig):
                    video_recorder.write(video_frame)

                    # 准备给 agent 的观测
-                    observation = prepare_observation(obs, camera_names)
+                    observation = prepare_observation(
+                        obs,
+                        camera_names,
+                        image_resize_shape=image_resize_shape,
+                    )
                    end_preprocess = time.perf_counter()

                    # 选择动作（agent 内部处理队列管理）
--- a/roboimi/demos/vla_scripts/train_vla.py
+++ b/roboimi/demos/vla_scripts/train_vla.py
@@ -380,7 +380,14 @@ def _run_training(cfg: DictConfig):
        # =========================================================================
        log.info("📦 加载数据集...")
        try:
-            dataset = instantiate(cfg.data)
+            dataset_image_resize_shape = cfg.data.get('image_resize_shape', (224, 224))
+            vision_backbone_cfg = cfg.agent.get('vision_backbone', None)
+            if vision_backbone_cfg is not None and 'dataset_image_resize_shape' in vision_backbone_cfg:
+                dataset_image_resize_shape = vision_backbone_cfg.get('dataset_image_resize_shape')
+            dataset = instantiate(
+                cfg.data,
+                image_resize_shape=dataset_image_resize_shape,
+            )
            log.info(f"✅ 数据集加载成功。总样本数: {len(dataset)}")
        except Exception as e:
            log.error(f"❌ 数据集加载失败: {e}")
--- a/roboimi/vla/agent.py
+++ b/roboimi/vla/agent.py
@@ -27,6 +27,7 @@ class VLAAgent(nn.Module):
        normalization_type='min_max',  # 归一化类型: 'gaussian' 或 'min_max'
        num_action_steps=8,   # 每次推理实际执行多少步动作
        head_type='unet',     # Policy head类型: 'unet' 或 'transformer'
+        cond_projector=None,  # 可选：将视觉+状态条件投影到head期望维度
    ):
        super().__init__()
        # 保存参数
@@ -74,15 +75,32 @@ class VLAAgent(nn.Module):
        self.vision_encoder = vision_backbone
        if self.camera_names is not None:
            self.vision_encoder.camera_names = self.camera_names
+        self.condition_tokens_per_step = int(getattr(self.vision_encoder, 'tokens_per_step', 1))
+        joint_vision_dim = getattr(self.vision_encoder, 'joint_output_dim', None)
+        if joint_vision_dim is not None:
+            per_token_vision_dim = int(joint_vision_dim)
+            self.condition_tokens_per_step = 1
+        else:
            single_cam_feat_dim = self.vision_encoder.output_dim
-        # global_cond_dim: 展平后的总维度（用于UNet）
-        total_vision_dim = single_cam_feat_dim * num_cams * obs_horizon
-        total_prop_dim = obs_dim * obs_horizon
-        self.global_cond_dim = total_vision_dim + total_prop_dim
+            if self.condition_tokens_per_step > 1:
+                per_token_vision_dim = int(single_cam_feat_dim)
+            else:
+                per_token_vision_dim = int(single_cam_feat_dim) * int(num_cams)

-        # per_step_cond_dim: 每步的条件维度（用于Transformer）
-        # 注意：这里不乘以obs_horizon，因为Transformer的输入是序列形式
-        self.per_step_cond_dim = single_cam_feat_dim * num_cams + obs_dim
+        self.condition_sequence_length = self.obs_horizon * self.condition_tokens_per_step
+        self.raw_per_step_cond_dim = per_token_vision_dim + obs_dim
+        if cond_projector is None:
+            self.cond_projector = None
+            self.per_step_cond_dim = self.raw_per_step_cond_dim
+        else:
+            if isinstance(cond_projector, nn.Module):
+                self.cond_projector = cond_projector
+            else:
+                self.cond_projector = cond_projector(input_dim=self.raw_per_step_cond_dim)
+            self.per_step_cond_dim = self._projector_output_dim(self.cond_projector, self.raw_per_step_cond_dim)
+
+        # global_cond_dim: 展平后的总维度（用于UNet）
+        self.global_cond_dim = self.per_step_cond_dim * self.condition_sequence_length

        self.noise_scheduler = DDPMScheduler(
                    num_train_timesteps=diffusion_steps,
@@ -111,7 +129,7 @@ class VLAAgent(nn.Module):
                    input_dim=action_dim,
                    output_dim=action_dim,
                    horizon=pred_horizon,
-                    n_obs_steps=obs_horizon,
+                    n_obs_steps=self.condition_sequence_length,
                    cond_dim=self.per_step_cond_dim  # 每步的条件维度
                )
        else:  # 'unet' (default)
@@ -143,6 +161,20 @@ class VLAAgent(nn.Module):
            return tuple(self._move_to_device(v, device) for v in data)
        return data

+    @staticmethod
+    def _projector_output_dim(projector: nn.Module, fallback: int) -> int:
+        output_dim = getattr(projector, 'output_dim', None)
+        if output_dim is not None:
+            return int(output_dim)
+        out_features = getattr(projector, 'out_features', None)
+        if out_features is not None:
+            return int(out_features)
+        linear = getattr(projector, 'linear', None)
+        linear_out_features = getattr(linear, 'out_features', None)
+        if linear_out_features is not None:
+            return int(linear_out_features)
+        return int(fallback)
+
    def _order_images(self, images: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """按显式配置的相机顺序返回图像字典。"""
        if self.camera_names is None:
@@ -165,7 +197,43 @@ class VLAAgent(nn.Module):
        ordered_images = self._order_images(images)
        visual_features = self.vision_encoder(ordered_images)
        state_features = self.state_encoder(states)
+        if visual_features.ndim == 4:
+            batch_size, obs_steps, token_count, _ = visual_features.shape
+            if obs_steps != state_features.shape[1]:
+                raise RuntimeError(
+                    f"观测时间维不匹配: visual={obs_steps}, state={state_features.shape[1]}"
+                )
+            if token_count != self.condition_tokens_per_step:
+                raise RuntimeError(
+                    f"条件token数量不匹配: got {token_count}, expected {self.condition_tokens_per_step}"
+                )
+            state_features = state_features.unsqueeze(2).expand(-1, -1, token_count, -1)
            cond = torch.cat([visual_features, state_features], dim=-1)
+            if cond.shape[-1] != self.raw_per_step_cond_dim:
+                raise RuntimeError(
+                    f"原始条件维度不匹配: got {cond.shape[-1]}, expected {self.raw_per_step_cond_dim}"
+                )
+            if self.cond_projector is not None:
+                cond = self.cond_projector(cond)
+            if cond.shape[-1] != self.per_step_cond_dim:
+                raise RuntimeError(
+                    f"条件维度不匹配: got {cond.shape[-1]}, expected {self.per_step_cond_dim}"
+                )
+            cond = cond.reshape(batch_size, obs_steps * token_count, self.per_step_cond_dim)
+            expected_length = self.condition_sequence_length
+            if cond.shape[1] != expected_length:
+                raise RuntimeError(
+                    f"条件序列长度不匹配: got {cond.shape[1]}, expected {expected_length}"
+                )
+            return cond
+
+        cond = torch.cat([visual_features, state_features], dim=-1)
+        if cond.shape[-1] != self.raw_per_step_cond_dim:
+            raise RuntimeError(
+                f"原始条件维度不匹配: got {cond.shape[-1]}, expected {self.raw_per_step_cond_dim}"
+            )
+        if self.cond_projector is not None:
+            cond = self.cond_projector(cond)
        if cond.shape[-1] != self.per_step_cond_dim:
            raise RuntimeError(
                f"条件维度不匹配: got {cond.shape[-1]}, expected {self.per_step_cond_dim}"
--- a/roboimi/vla/conf/agent/lewm_imf_attnres.yaml
+++ b/roboimi/vla/conf/agent/lewm_imf_attnres.yaml
@@ -0,0 +1,41 @@
+# @package agent
+defaults:
+  - /backbone@vision_backbone: lewm_vit_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /head: imf_transformer1d
+  - _self_
+
+_target_: roboimi.vla.agent_imf.IMFVLAAgent
+
+action_dim: 16
+obs_dim: 16
+normalization_type: "min_max"
+pred_horizon: 16
+obs_horizon: 2
+num_action_steps: 8
+camera_names: ${data.camera_names}
+num_cams: 3
+
+vision_backbone:
+  num_cameras: ${agent.num_cams}
+  camera_names: ${agent.camera_names}
+  fused_camera_names: [front, top, r_vis]
+
+diffusion_steps: 100
+inference_steps: 1
+head_type: "transformer"
+
+head:
+  input_dim: ${agent.action_dim}
+  output_dim: ${agent.action_dim}
+  horizon: ${agent.pred_horizon}
+  n_obs_steps: ${agent.obs_horizon}
+  cond_dim: 208
+  causal_attn: false
+  time_as_cond: true
+  obs_as_cond: true
+  n_cond_layers: 0
+  backbone_type: attnres_full
+  n_head: 1
+  n_kv_head: 1
--- a/roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml
+++ b/roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml
@@ -0,0 +1,48 @@
+# @package agent
+defaults:
+  - /backbone@vision_backbone: resnet_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /modules@cond_projector: linear_condition_projector
+  - /head: imf_transformer1d
+  - _self_
+
+_target_: roboimi.vla.agent_imf.IMFVLAAgent
+
+action_dim: 16
+obs_dim: 16
+normalization_type: "min_max"
+pred_horizon: 16
+obs_horizon: 2
+num_action_steps: 8
+camera_names: ${data.camera_names}
+num_cams: ${len:${agent.camera_names}}
+
+vision_backbone:
+  num_cameras: ${agent.num_cams}
+  camera_names: ${agent.camera_names}
+  vision_backbone: "resnet18"
+  vision_backbone_mode: "resnet"
+  freeze_backbone: false
+  use_separate_rgb_encoder_per_camera: true
+  output_tokens_per_camera: true
+
+cond_projector:
+  output_dim: ${agent.head.n_emb}
+
+diffusion_steps: 100
+inference_steps: 1
+head_type: "transformer"
+
+head:
+  input_dim: ${agent.action_dim}
+  output_dim: ${agent.action_dim}
+  horizon: ${agent.pred_horizon}
+  cond_dim: ${agent.head.n_emb}
+  causal_attn: false
+  time_as_cond: true
+  obs_as_cond: true
+  n_cond_layers: 0
+  backbone_type: attnres_full
+  n_head: 1
+  n_kv_head: 1
--- a/roboimi/vla/conf/agent/siglip2_imf_attnres.yaml
+++ b/roboimi/vla/conf/agent/siglip2_imf_attnres.yaml
@@ -0,0 +1,44 @@
+# @package agent
+defaults:
+  - /backbone@vision_backbone: siglip2_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /modules@cond_projector: linear_condition_projector
+  - /head: imf_transformer1d
+  - _self_
+
+_target_: roboimi.vla.agent_imf.IMFVLAAgent
+
+action_dim: 16
+obs_dim: 16
+normalization_type: "min_max"
+pred_horizon: 16
+obs_horizon: 2
+num_action_steps: 8
+camera_names: ${data.camera_names}
+num_cams: ${len:${agent.camera_names}}
+
+vision_backbone:
+  num_cameras: ${agent.num_cams}
+  camera_names: ${agent.camera_names}
+
+cond_projector:
+  output_dim: ${agent.head.cond_dim}
+
+diffusion_steps: 100
+inference_steps: 1
+head_type: "transformer"
+
+head:
+  input_dim: ${agent.action_dim}
+  output_dim: ${agent.action_dim}
+  horizon: ${agent.pred_horizon}
+  n_obs_steps: ${agent.obs_horizon}
+  cond_dim: 384
+  causal_attn: false
+  time_as_cond: true
+  obs_as_cond: true
+  n_cond_layers: 0
+  backbone_type: attnres_full
+  n_head: 1
+  n_kv_head: 1
--- a/roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml
+++ b/roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml
@@ -0,0 +1,16 @@
+_target_: roboimi.vla.models.backbones.lewm_vit_backbone.LEWMViTBackbone
+
+# LEWM checkpoint path; override this on the target machine.
+checkpoint_path: null
+
+# Input camera contract for roboimi; internal LEWM fusion order stays front/top/r_vis.
+num_cameras: 3
+camera_names: [r_vis, top, front]
+fused_camera_names: [front, top, r_vis]
+
+freeze_backbone: true
+joint_output_dim: 192
+output_dim: 192
+image_size: 224
+dataset_image_resize_shape: null
+eval_image_resize_shape: [256, 256]
--- a/roboimi/vla/conf/backbone/resnet_diffusion.yaml
+++ b/roboimi/vla/conf/backbone/resnet_diffusion.yaml
@@ -31,6 +31,8 @@ spatial_softmax_num_keypoints: 32  # Spatial Softmax 关键点数量
 # false: 共享编码器（所有摄像头共享一个 ResNet，参数少但容量受限）推荐！
 # true: 独立编码器（每个摄像头有独立的 ResNet，参数多但容量大）
 use_separate_rgb_encoder_per_camera: true
+# false: 将所有相机特征拼成一个条件token；true: 每个相机输出一个独立token
+output_tokens_per_camera: false
 num_cameras: 3  # 摄像头数量

 # ====================
--- a/roboimi/vla/conf/backbone/siglip2_diffusion.yaml
+++ b/roboimi/vla/conf/backbone/siglip2_diffusion.yaml
@@ -0,0 +1,10 @@
+_target_: roboimi.vla.models.backbones.siglip2_diffusion_backbone.SigLIP2DiffusionBackbone
+
+model_name: google/siglip2-base-patch16-256
+camera_names: [r_vis, top, front]
+num_cameras: 3
+per_view_output_dim: 96
+freeze_backbone: true
+
+dataset_image_resize_shape: null
+eval_image_resize_shape: [256, 256]
--- a/roboimi/vla/conf/data/simpe_robot_dataset.yaml
+++ b/roboimi/vla/conf/data/simpe_robot_dataset.yaml
@@ -19,3 +19,6 @@ camera_names:
  - r_vis     # 机器人视角相机
  - top       # 顶部相机
  - front     # 前方相机
+
+# 单视角预缩放尺寸；为 null 时保留数据集中的原始分辨率
+image_resize_shape: [224, 224]
--- a/roboimi/vla/conf/modules/linear_condition_projector.yaml
+++ b/roboimi/vla/conf/modules/linear_condition_projector.yaml
@@ -0,0 +1,5 @@
+_target_: roboimi.vla.modules.projectors.LinearConditionProjector
+_partial_: true
+
+output_dim: 384
+bias: true
--- a/roboimi/vla/data/simpe_robot_dataset.py
+++ b/roboimi/vla/data/simpe_robot_dataset.py
@@ -1,7 +1,7 @@
 import torch
 import h5py
 from torch.utils.data import Dataset
-from typing import List, Dict, Union
+from typing import List, Dict, Union, Optional, Sequence
 from pathlib import Path
 from collections import OrderedDict

@@ -22,6 +22,7 @@ class SimpleRobotDataset(Dataset):
        obs_horizon: int = 2,
        pred_horizon: int = 8,
        camera_names: List[str] = None,
+        image_resize_shape: Optional[Sequence[int]] = (224, 224),
        max_open_files: int = 64,
    ):
        """
@@ -30,6 +31,7 @@ class SimpleRobotDataset(Dataset):
            obs_horizon: 观察过去多少帧
            pred_horizon: 预测未来多少帧动作
            camera_names: 相机名称列表，如 ["r_vis", "top", "front"]
+            image_resize_shape: 图像缩放尺寸 (W, H)；为 None 时保留原始分辨率
            max_open_files: 每个 worker 最多缓存的 HDF5 文件句柄数

        HDF5 文件格式:
@@ -40,6 +42,10 @@ class SimpleRobotDataset(Dataset):
        self.obs_horizon = obs_horizon
        self.pred_horizon = pred_horizon
        self.camera_names = camera_names or []
+        self.image_resize_shape = (
+            tuple(int(v) for v in image_resize_shape)
+            if image_resize_shape is not None else None
+        )
        self.max_open_files = max(1, int(max_open_files))
        self._file_cache: "OrderedDict[str, h5py.File]" = OrderedDict()

@@ -123,9 +129,9 @@ class SimpleRobotDataset(Dataset):
                h5_path = f'observations/images/{cam_name}'
                if h5_path in f:
                    img = f[h5_path][meta["frame_idx"]]
-                    # Resize图像到224x224（减少内存和I/O负担）
+                    if self.image_resize_shape is not None:
                        import cv2
-                    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)
+                        img = cv2.resize(img, self.image_resize_shape, interpolation=cv2.INTER_LINEAR)
                    # 转换为float并归一化到 [0, 1]
                    img = torch.from_numpy(img).float() / 255.0
                    frame[f"observation.{cam_name}"] = img.permute(2, 0, 1)  # HWC -> CHW
--- a/roboimi/vla/models/backbones/init.py
+++ b/roboimi/vla/models/backbones/init.py
@@ -1,4 +1,15 @@
 # Backbone models
-from .resnet_diffusion import ResNetDiffusionBackbone
+__all__ = ["LEWMViTBackbone", "ResNetBackbone", "ResNetDiffusionBackbone", "SigLIP2DiffusionBackbone"]

-__all__ = ["ResNetBackbone", "ResNetDiffusionBackbone"]
+
+def __getattr__(name):
+    if name == "LEWMViTBackbone":
+        from .lewm_vit_backbone import LEWMViTBackbone
+        return LEWMViTBackbone
+    if name == "SigLIP2DiffusionBackbone":
+        from .siglip2_diffusion_backbone import SigLIP2DiffusionBackbone
+        return SigLIP2DiffusionBackbone
+    if name in {"ResNetBackbone", "ResNetDiffusionBackbone"}:
+        from .resnet_diffusion import ResNetDiffusionBackbone
+        return ResNetDiffusionBackbone
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
--- a/roboimi/vla/models/backbones/lewm_vit_backbone.py
+++ b/roboimi/vla/models/backbones/lewm_vit_backbone.py
@@ -0,0 +1,230 @@
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any, Dict, Mapping, Sequence
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from roboimi.vla.core.interfaces import VLABackbone
+
+
+class _LEWMProjector(nn.Module):
+    """LEWM projector MLP: 192 -> 2048 -> 192 with BatchNorm1d + GELU."""
+
+    def __init__(self, input_dim: int = 192, hidden_dim: int = 2048, output_dim: int = 192) -> None:
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            nn.BatchNorm1d(hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, output_dim),
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net(x)
+
+
+class LEWMViTBackbone(VLABackbone):
+    """Frozen LEWM joint-multiview ViT backbone.
+
+    The backbone fuses the three camera views into a single LEWM-style image,
+    runs a ViT-tiny encoder plus the LEWM projector, and returns one joint
+    192-d embedding per timestep.
+    """
+
+    def __init__(
+        self,
+        checkpoint_path: str | Path | None = None,
+        *,
+        checkpoint: Mapping[str, Any] | None = None,
+        camera_names: Sequence[str] = ("r_vis", "top", "front"),
+        fused_camera_names: Sequence[str] = ("front", "top", "r_vis"),
+        num_cameras: int | None = None,
+        dataset_image_resize_shape: Sequence[int] | None = None,
+        eval_image_resize_shape: Sequence[int] | None = (256, 256),
+        freeze_backbone: bool = True,
+        joint_output_dim: int = 192,
+        image_size: int = 224,
+        output_dim: int = 192,
+    ) -> None:
+        super().__init__()
+
+        self.camera_names = tuple(camera_names)
+        self.fused_camera_names = tuple(fused_camera_names)
+        self.num_cameras = int(num_cameras) if num_cameras is not None else len(self.camera_names)
+        self.freeze_backbone = bool(freeze_backbone)
+        self.joint_output_dim = int(joint_output_dim)
+        self.image_size = int(image_size)
+        self._output_dim = int(output_dim)
+        self.dataset_image_resize_shape = (
+            tuple(int(v) for v in dataset_image_resize_shape)
+            if dataset_image_resize_shape is not None else None
+        )
+        self.eval_image_resize_shape = (
+            tuple(int(v) for v in eval_image_resize_shape)
+            if eval_image_resize_shape is not None else None
+        )
+        if self.num_cameras != len(self.camera_names):
+            raise ValueError(
+                f"num_cameras({self.num_cameras}) must match len(camera_names)({len(self.camera_names)})"
+            )
+        if set(self.fused_camera_names) != set(self.camera_names):
+            raise ValueError(
+                "fused_camera_names must contain the same cameras as camera_names. "
+                f"got camera_names={list(self.camera_names)}, fused_camera_names={list(self.fused_camera_names)}"
+            )
+
+        self.encoder = self._build_encoder(self.image_size)
+        self.projector = _LEWMProjector(
+            input_dim=self.encoder.config.hidden_size,
+            hidden_dim=2048,
+            output_dim=self.joint_output_dim,
+        )
+
+        self.register_buffer(
+            "mean",
+            torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(1, 3, 1, 1),
+        )
+        self.register_buffer(
+            "std",
+            torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(1, 3, 1, 1),
+        )
+
+        if checkpoint_path is not None and checkpoint is not None:
+            raise ValueError("checkpoint_path and checkpoint cannot both be provided")
+        if checkpoint_path is not None:
+            self.load_lewm_checkpoint(checkpoint_path)
+        elif checkpoint is not None:
+            self.load_lewm_checkpoint(checkpoint)
+
+        if self.freeze_backbone:
+            self._freeze_encoder_and_projector()
+
+    @staticmethod
+    def _build_encoder_config(image_size: int):
+        from transformers import ViTConfig
+
+        return ViTConfig(
+            image_size=image_size,
+            patch_size=14,
+            num_channels=3,
+            hidden_size=192,
+            intermediate_size=768,
+            num_hidden_layers=12,
+            num_attention_heads=3,
+            qkv_bias=True,
+            hidden_dropout_prob=0.0,
+            attention_probs_dropout_prob=0.0,
+        )
+
+    @classmethod
+    def _build_encoder(cls, image_size: int) -> nn.Module:
+        from transformers import ViTModel
+
+        return ViTModel(cls._build_encoder_config(image_size), add_pooling_layer=False)
+
+    @staticmethod
+    def _unwrap_state_dict(payload: Mapping[str, Any]) -> Mapping[str, torch.Tensor]:
+        state_dict = payload.get("state_dict", payload)
+        if not isinstance(state_dict, Mapping):
+            raise TypeError("checkpoint payload must contain a mapping state_dict")
+        return state_dict
+
+    @staticmethod
+    def _extract_prefixed_state_dict(
+        state_dict: Mapping[str, torch.Tensor],
+        prefix: str,
+    ) -> Dict[str, torch.Tensor]:
+        extracted = {
+            key[len(prefix) :]: value
+            for key, value in state_dict.items()
+            if key.startswith(prefix)
+        }
+        if not extracted:
+            raise KeyError(f"checkpoint missing parameters with prefix {prefix!r}")
+        return extracted
+
+    def load_lewm_checkpoint(self, checkpoint_or_path: str | Path | Mapping[str, Any]) -> None:
+        if isinstance(checkpoint_or_path, (str, Path)):
+            payload = torch.load(Path(checkpoint_or_path), map_location="cpu", weights_only=False)
+        else:
+            payload = checkpoint_or_path
+
+        state_dict = self._unwrap_state_dict(payload)
+        encoder_state_dict = self._extract_prefixed_state_dict(state_dict, "model.encoder.")
+        projector_state_dict = self._extract_prefixed_state_dict(state_dict, "model.projector.")
+
+        self.encoder.load_state_dict(encoder_state_dict, strict=True)
+        self.projector.load_state_dict(projector_state_dict, strict=True)
+
+    def _freeze_encoder_and_projector(self) -> None:
+        for module in (self.encoder, self.projector):
+            module.eval()
+            for parameter in module.parameters():
+                parameter.requires_grad = False
+
+    def train(self, mode: bool = True) -> "LEWMViTBackbone":
+        super().train(mode)
+        if self.freeze_backbone:
+            self._freeze_encoder_and_projector()
+        return self
+
+    def _ordered_images(self, images: Dict[str, torch.Tensor]) -> list[torch.Tensor]:
+        missing = [camera_name for camera_name in self.camera_names if camera_name not in images]
+        if missing:
+            raise ValueError(
+                f"image input missing required cameras. missing={missing}, expected={list(self.camera_names)}"
+            )
+
+        ordered = [images[camera_name] for camera_name in self.camera_names]
+        reference_shape = ordered[0].shape
+        if len(reference_shape) != 5:
+            raise ValueError(f"expected image tensors shaped (B, T, C, H, W), got {reference_shape}")
+
+        for camera_name, image in zip(self.camera_names[1:], ordered[1:]):
+            if image.shape != reference_shape:
+                raise ValueError(
+                    f"camera {camera_name!r} shape {tuple(image.shape)} does not match {tuple(reference_shape)}"
+                )
+
+        return ordered
+
+    def _prepare_pixels(self, images: Dict[str, torch.Tensor]) -> tuple[torch.Tensor, int, int]:
+        self._ordered_images(images)
+        fused = torch.cat([images[camera_name] for camera_name in self.fused_camera_names], dim=-2)
+        bsz, steps = fused.shape[:2]
+        fused = fused.reshape(bsz * steps, *fused.shape[2:]).contiguous().float()
+
+        fused = fused.clamp(0.0, 1.0)
+        fused = (fused - self.mean) / self.std
+
+        height, width = fused.shape[-2:]
+        short_side = min(height, width)
+        if short_side <= 0:
+            raise ValueError(f"invalid fused image shape: {tuple(fused.shape)}")
+        scale = self.image_size / float(short_side)
+        resized_height = int(round(height * scale))
+        resized_width = int(round(width * scale))
+        if (resized_height, resized_width) != (height, width):
+            fused = F.interpolate(
+                fused,
+                size=(resized_height, resized_width),
+                mode="bilinear",
+                align_corners=False,
+                antialias=True,
+            )
+        return fused, bsz, steps
+
+    def forward(self, images: Dict[str, torch.Tensor]) -> torch.Tensor:
+        pixels, bsz, steps = self._prepare_pixels(images)
+        with torch.set_grad_enabled(torch.is_grad_enabled() and not self.freeze_backbone):
+            output = self.encoder(pixel_values=pixels, interpolate_pos_encoding=True)
+            cls = output.last_hidden_state[:, 0]
+            embedding = self.projector(cls)
+        return embedding.view(bsz, steps, self.joint_output_dim)
+
+    @property
+    def output_dim(self) -> int:
+        return self._output_dim
--- a/roboimi/vla/models/backbones/resnet_diffusion.py
+++ b/roboimi/vla/models/backbones/resnet_diffusion.py
@@ -211,6 +211,7 @@ class ResNetDiffusionBackbone(VLABackbone):
        use_group_norm: bool = True,
        spatial_softmax_num_keypoints: int = 32,
        use_separate_rgb_encoder_per_camera: bool = False,  # 新增：是否为每个摄像头使用独立编码器
+        output_tokens_per_camera: bool = False,  # 是否按相机返回多token，而不是拼成一个token
        num_cameras: int = 1,  # 新增：摄像头数量（仅在独立编码器模式下使用）
        camera_names: Optional[Tuple[str, ...]] = None,  # 显式相机顺序
        freeze_backbone: bool = True,  # 新增：是否冻结ResNet backbone（推荐True）
@@ -229,7 +230,9 @@ class ResNetDiffusionBackbone(VLABackbone):
        super().__init__()

        self.use_separate_rgb_encoder_per_camera = use_separate_rgb_encoder_per_camera
+        self.output_tokens_per_camera = bool(output_tokens_per_camera)
        self.num_cameras = num_cameras
+        self.tokens_per_step = self.num_cameras if self.output_tokens_per_camera else 1
        self.camera_names = tuple(camera_names) if camera_names is not None else None
        if self.camera_names is not None and len(self.camera_names) != self.num_cameras:
            raise ValueError(
@@ -319,22 +322,24 @@ class ResNetDiffusionBackbone(VLABackbone):
        B, T = any_tensor.shape[:2]
        cam_names = self._ordered_camera_names(images)

+        features_all = []
        if self.use_separate_rgb_encoder_per_camera:
            # 独立编码器模式：每个摄像头使用对应的编码器
-            features_all = []
            for cam_idx, cam_name in enumerate(cam_names):
                img = images[cam_name]
                encoder = self.rgb_encoder[cam_idx]
                features = encoder.forward_single_image(img.reshape(B * T, *img.shape[2:]))
                features_all.append(features)
-            return torch.cat(features_all, dim=1).view(B, T, -1)
        else:
            # 共享编码器模式：所有摄像头共享同一个编码器
-            features_all = []
            for cam_name in cam_names:
                img = images[cam_name]
                features = self.rgb_encoder.forward_single_image(img.reshape(B * T, *img.shape[2:]))
                features_all.append(features)
+
+        if self.output_tokens_per_camera:
+            stacked = torch.stack(features_all, dim=1)  # (B*T, num_cams, feature_dim)
+            return stacked.view(B, T, len(cam_names), self.feature_dim)
        return torch.cat(features_all, dim=1).view(B, T, -1)

    @property
--- a/roboimi/vla/models/backbones/siglip2_diffusion_backbone.py
+++ b/roboimi/vla/models/backbones/siglip2_diffusion_backbone.py
@@ -0,0 +1,124 @@
+from __future__ import annotations
+
+from typing import Dict, Optional, Sequence, Tuple
+
+import torch
+from torch import nn
+from transformers import SiglipVisionModel
+
+from roboimi.vla.core.interfaces import VLABackbone
+
+
+class SigLIP2DiffusionBackbone(VLABackbone):
+    """Shared SigLIP vision tower for multiview diffusion-policy conditioning.
+
+    We intentionally load the checkpoint `google/siglip2-base-patch16-256` through
+    `SiglipVisionModel.from_pretrained(...)` so each camera can be fed as a normal
+    `(B, C, H, W)` image tensor and produce one pooled global feature vector.
+    """
+
+    def __init__(
+        self,
+        model_name: str = 'google/siglip2-base-patch16-256',
+        *,
+        model_name_or_path: str | None = None,
+        vision_model: nn.Module | None = None,
+        camera_names: Sequence[str] = ('r_vis', 'top', 'front'),
+        num_cameras: Optional[int] = None,
+        per_view_output_dim: int = 96,
+        output_dim: int | None = None,
+        freeze_backbone: bool = True,
+        dataset_image_resize_shape: Sequence[int] | None = None,
+        eval_image_resize_shape: Sequence[int] | None = (256, 256),
+    ) -> None:
+        super().__init__()
+        if model_name_or_path is not None:
+            model_name = model_name_or_path
+        if output_dim is not None:
+            per_view_output_dim = output_dim
+
+        self.model_name = str(model_name)
+        self.camera_names = tuple(camera_names)
+        self.num_cameras = int(num_cameras) if num_cameras is not None else len(self.camera_names)
+        if len(self.camera_names) != self.num_cameras:
+            raise ValueError(
+                f'camera_names length ({len(self.camera_names)}) must match num_cameras ({self.num_cameras})'
+            )
+
+        self._output_dim = int(per_view_output_dim)
+        self.joint_output_dim = self._output_dim * self.num_cameras
+        self.freeze_backbone = bool(freeze_backbone)
+        self.dataset_image_resize_shape = self._normalize_resize_shape(dataset_image_resize_shape)
+        self.eval_image_resize_shape = self._normalize_resize_shape(eval_image_resize_shape)
+
+        self.encoder = vision_model if vision_model is not None else SiglipVisionModel.from_pretrained(self.model_name)
+        hidden_size = int(getattr(self.encoder.config, 'hidden_size'))
+        self.view_projector = nn.Linear(hidden_size, self._output_dim)
+        self.projector = self.view_projector
+
+        self.register_buffer('mean', torch.tensor([0.5, 0.5, 0.5], dtype=torch.float32).view(1, 3, 1, 1))
+        self.register_buffer('std', torch.tensor([0.5, 0.5, 0.5], dtype=torch.float32).view(1, 3, 1, 1))
+
+        if self.freeze_backbone:
+            self._freeze_encoder()
+
+    @staticmethod
+    def _normalize_resize_shape(shape: Sequence[int] | None) -> tuple[int, int] | None:
+        if shape is None:
+            return None
+        normalized = tuple(int(v) for v in shape)
+        if len(normalized) != 2:
+            raise ValueError(f'resize shape must contain exactly two values, got {normalized}')
+        return normalized
+
+    @property
+    def output_dim(self) -> int:
+        return self._output_dim
+
+    def _freeze_encoder(self) -> None:
+        self.encoder.eval()
+        for param in self.encoder.parameters():
+            param.requires_grad = False
+
+    def train(self, mode: bool = True):
+        super().train(mode)
+        if self.freeze_backbone:
+            self._freeze_encoder()
+        return self
+
+    def _ordered_camera_names(self, images: Dict[str, torch.Tensor]) -> Tuple[str, ...]:
+        missing = [camera_name for camera_name in self.camera_names if camera_name not in images]
+        if missing:
+            raise ValueError(
+                f'image input missing required cameras. missing={missing}, expected={list(self.camera_names)}'
+            )
+        return self.camera_names
+
+    def _prepare_pixels(self, image: torch.Tensor) -> torch.Tensor:
+        if image.ndim != 5:
+            raise ValueError(f'expected image tensor shaped (B, T, C, H, W), got {tuple(image.shape)}')
+        pixels = image.reshape(-1, *image.shape[2:]).contiguous().float()
+        pixels = pixels.clamp(0.0, 1.0)
+        return (pixels - self.mean) / self.std
+
+    def forward(self, images: Dict[str, torch.Tensor]) -> torch.Tensor:
+        camera_names = self._ordered_camera_names(images)
+        reference_shape = images[camera_names[0]].shape
+        batch_size, steps = reference_shape[:2]
+        per_view_features = []
+        for camera_name in camera_names:
+            image = images[camera_name]
+            if image.shape != reference_shape:
+                raise ValueError(
+                    f'camera {camera_name!r} shape {tuple(image.shape)} does not match {tuple(reference_shape)}'
+                )
+            pixels = self._prepare_pixels(image)
+            with torch.set_grad_enabled(torch.is_grad_enabled() and not self.freeze_backbone):
+                encoded = self.encoder(pixel_values=pixels)
+                pooled = encoded.pooler_output
+            per_view_features.append(self.view_projector(pooled))
+        features = torch.cat(per_view_features, dim=-1)
+        return features.view(batch_size, steps, self.joint_output_dim)
+
+
+Siglip2DiffusionBackbone = SigLIP2DiffusionBackbone
--- a/roboimi/vla/modules/projectors.py
+++ b/roboimi/vla/modules/projectors.py
@@ -0,0 +1,17 @@
+from __future__ import annotations
+
+import torch
+from torch import nn
+
+
+class LinearConditionProjector(nn.Module):
+    """Projects per-step visual+state conditioning to the head conditioning width."""
+
+    def __init__(self, input_dim: int, output_dim: int, bias: bool = True):
+        super().__init__()
+        self.input_dim = int(input_dim)
+        self.output_dim = int(output_dim)
+        self.linear = nn.Linear(self.input_dim, self.output_dim, bias=bias)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.linear(x)
--- a/tests/test_eval_vla_headless.py
+++ b/tests/test_eval_vla_headless.py
@@ -90,6 +90,24 @@ class _FakeRenderer:


 class EvalVLAHeadlessTest(unittest.TestCase):
+    def test_prepare_observation_skips_resize_when_image_resize_shape_is_none(self):
+        obs = {
+            "images": {
+                "front": np.arange(8 * 8 * 3, dtype=np.uint8).reshape(8, 8, 3),
+            },
+            "qpos": np.zeros(16, dtype=np.float32),
+        }
+
+        with mock.patch("cv2.resize", side_effect=AssertionError("resize should be skipped")):
+            prepared = eval_vla.prepare_observation(
+                obs,
+                ["front"],
+                image_resize_shape=None,
+            )
+
+        self.assertEqual(tuple(prepared["images"]["front"].shape), (3, 8, 8))
+        self.assertEqual(tuple(prepared["qpos"].shape), (16,))
+
    def test_headless_eval_sets_mujoco_gl_to_egl_when_display_missing(self):
        cfg = OmegaConf.create({"eval": {"headless": True}})
        with mock.patch.dict(eval_vla.os.environ, {}, clear=True):
--- a/tests/test_imf_vla_agent.py
+++ b/tests/test_imf_vla_agent.py
@@ -1,5 +1,6 @@
 import contextlib
 import importlib
+import importlib.machinery
 import sys
 import types
 import unittest
@@ -69,6 +70,68 @@ class _FakeRearrange(nn.Module):
        return x


+class _FakeViTConfig:
+    def __init__(self, **kwargs):
+        for key, value in kwargs.items():
+            setattr(self, key, value)
+
+
+class _FakeViTModel(nn.Module):
+    def __init__(self, config, add_pooling_layer=False):
+        super().__init__()
+        del add_pooling_layer
+        self.config = config
+        hidden_size = int(getattr(config, 'hidden_size', 192))
+        self.proj = nn.Linear(hidden_size, hidden_size)
+
+    def forward(self, pixel_values=None, interpolate_pos_encoding=False, **kwargs):
+        del interpolate_pos_encoding, kwargs
+        batch_size = pixel_values.shape[0]
+        hidden_size = int(getattr(self.config, 'hidden_size', 192))
+        seq_len = 2
+        last_hidden_state = torch.zeros(batch_size, seq_len, hidden_size, dtype=pixel_values.dtype, device=pixel_values.device)
+        return types.SimpleNamespace(last_hidden_state=last_hidden_state)
+
+
+class _FakeSiglipVisionOutput:
+    def __init__(self, pooler_output):
+        self.pooler_output = pooler_output
+
+
+class _FakeSiglipVisionConfig:
+    def __init__(self, hidden_size=768, image_size=256):
+        self.hidden_size = hidden_size
+        self.image_size = image_size
+
+
+class _FakeSiglipVisionModel(nn.Module):
+    load_calls = []
+
+    def __init__(self, hidden_size=768):
+        super().__init__()
+        self.config = _FakeSiglipVisionConfig(hidden_size=hidden_size)
+        self.scale = nn.Parameter(torch.tensor(1.0))
+        self.forward_calls = []
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        model = cls()
+        cls.load_calls.append({
+            'pretrained_model_name_or_path': pretrained_model_name_or_path,
+            'args': args,
+            'kwargs': kwargs,
+        })
+        return model
+
+    def forward(self, pixel_values=None, **kwargs):
+        self.forward_calls.append({
+            'pixel_values': pixel_values.detach().clone(),
+            'kwargs': dict(kwargs),
+        })
+        pooled = pixel_values.mean(dim=(2, 3), keepdim=False) * self.scale
+        return _FakeSiglipVisionOutput(pooler_output=pooled)
+
+
 class _StubIMFHead(nn.Module):
    def __init__(
        self,
@@ -105,6 +168,11 @@ class _StubIMFHead(nn.Module):
 def _stub_optional_modules(include_imf_head=False):
    previous_modules = {}

+    def remember_and_remove(name):
+        if name not in previous_modules:
+            previous_modules[name] = sys.modules.get(name, _MISSING)
+        sys.modules.pop(name, None)
+
    def inject(name, module):
        if name not in previous_modules:
            previous_modules[name] = sys.modules.get(name, _MISSING)
@@ -125,6 +193,9 @@ def _stub_optional_modules(include_imf_head=False):
    torchvision_module = types.ModuleType('torchvision')
    models_module = types.ModuleType('torchvision.models')
    transforms_module = types.ModuleType('torchvision.transforms')
+    torchvision_module.__spec__ = importlib.machinery.ModuleSpec('torchvision', loader=None)
+    models_module.__spec__ = importlib.machinery.ModuleSpec('torchvision.models', loader=None)
+    transforms_module.__spec__ = importlib.machinery.ModuleSpec('torchvision.transforms', loader=None)
    models_module.resnet18 = lambda weights=None: _FakeResNet()
    transforms_module.CenterCrop = _IdentityCrop
    transforms_module.RandomCrop = _IdentityCrop
@@ -139,7 +210,14 @@ def _stub_optional_modules(include_imf_head=False):
    einops_module.layers = einops_layers_module
    einops_layers_module.torch = einops_layers_torch_module

+    transformers_module = types.ModuleType('transformers')
+    transformers_module.__spec__ = importlib.machinery.ModuleSpec('transformers', loader=None)
+    transformers_module.ViTConfig = _FakeViTConfig
+    transformers_module.ViTModel = _FakeViTModel
+    transformers_module.SiglipVisionModel = _FakeSiglipVisionModel
+
    try:
+        remember_and_remove('roboimi.vla.models.backbones.siglip2_diffusion_backbone')
        inject('diffusers', diffusers_module)
        inject('diffusers.schedulers', schedulers_module)
        inject('diffusers.schedulers.scheduling_ddpm', ddpm_module)
@@ -150,6 +228,7 @@ def _stub_optional_modules(include_imf_head=False):
        inject('einops', einops_module)
        inject('einops.layers', einops_layers_module)
        inject('einops.layers.torch', einops_layers_torch_module)
+        inject('transformers', transformers_module)

        if include_imf_head:
            import roboimi.vla.models.heads as heads_package
@@ -200,6 +279,67 @@ class _StubVisionBackbone(nn.Module):
        return torch.cat(per_camera_features, dim=-1)


+class _StubJointVisionBackbone(nn.Module):
+    joint_output_dim = 5
+    output_dim = 5
+
+    def __init__(self, camera_names=_CAMERA_NAMES):
+        super().__init__()
+        self.camera_names = tuple(camera_names)
+        self.num_cameras = len(self.camera_names)
+
+    def forward(self, images):
+        batch_size, obs_horizon = next(iter(images.values())).shape[:2]
+        features = []
+        for camera_name in ('front', 'top', 'r_vis'):
+            image_batch = images[camera_name]
+            features.append(image_batch.mean(dim=(2, 3, 4), keepdim=False).unsqueeze(-1))
+        joint_features = torch.cat(features, dim=-1)
+        front_top_sum = joint_features[..., :2].sum(dim=-1, keepdim=True)
+        r_vis_minus_front = (joint_features[..., 2:] - joint_features[..., :1])
+        time_marker = torch.arange(obs_horizon, dtype=joint_features.dtype).view(1, obs_horizon, 1)
+        time_marker = time_marker.expand(batch_size, -1, -1)
+        return torch.cat([joint_features, front_top_sum, r_vis_minus_front + time_marker], dim=-1)
+
+
+class _StubMultiTokenVisionBackbone(nn.Module):
+    output_dim = 2
+    tokens_per_step = 3
+
+    def __init__(self, camera_names=_CAMERA_NAMES):
+        super().__init__()
+        self.camera_names = tuple(camera_names)
+        self.num_cameras = len(self.camera_names)
+
+    def forward(self, images):
+        batch_size, obs_horizon = next(iter(images.values())).shape[:2]
+        features = []
+        time_marker = torch.arange(obs_horizon, dtype=torch.float32).view(1, obs_horizon, 1).expand(batch_size, -1, -1)
+        for camera_name in self.camera_names:
+            image_batch = images[camera_name]
+            camera_marker = image_batch.mean(dim=(2, 3, 4), keepdim=False).unsqueeze(-1)
+            features.append(torch.cat([camera_marker, camera_marker + time_marker], dim=-1))
+        return torch.stack(features, dim=2)
+
+
+class _StubMultiTokenVisionBackbone(nn.Module):
+    output_dim = 2
+    tokens_per_step = 3
+
+    def __init__(self, camera_names=_CAMERA_NAMES):
+        super().__init__()
+        self.camera_names = tuple(camera_names)
+        self.num_cameras = len(self.camera_names)
+
+    def forward(self, images):
+        per_camera = []
+        for camera_name in self.camera_names:
+            image_batch = images[camera_name]
+            base = image_batch.mean(dim=(2, 3, 4), keepdim=False)
+            per_camera.append(torch.stack([base, base + 0.5], dim=-1))
+        return torch.stack(per_camera, dim=2)
+
+
 class _RecordingLinearIMFHead(nn.Module):
    def __init__(self):
        super().__init__()
@@ -390,6 +530,178 @@ class IMFVLAAgentTest(unittest.TestCase):
        self.assertTrue(torch.equal(third_action, second_chunk[0, 1]))
        self.assertEqual(mock_predict_chunk.call_count, 2)

+    def test_joint_visual_backbone_uses_joint_output_dim_for_conditioning(self):
+        agent_cls, _agent_module = _load_imf_agent_class()
+        head = _RecordingLinearIMFHead()
+        vision_backbone = _StubJointVisionBackbone()
+        agent = agent_cls(
+            vision_backbone=vision_backbone,
+            state_encoder=nn.Identity(),
+            action_encoder=nn.Identity(),
+            head=head,
+            action_dim=2,
+            obs_dim=1,
+            pred_horizon=3,
+            obs_horizon=2,
+            diffusion_steps=10,
+            inference_steps=1,
+            num_cams=len(_CAMERA_NAMES),
+            camera_names=_CAMERA_NAMES,
+            num_action_steps=2,
+            head_type='transformer',
+        )
+
+        self.assertEqual(agent.per_step_cond_dim, vision_backbone.joint_output_dim + agent.obs_dim)
+        self.assertEqual(
+            agent.global_cond_dim,
+            vision_backbone.joint_output_dim * agent.obs_horizon + agent.obs_dim * agent.obs_horizon,
+        )
+
+        images = _make_images(
+            batch_size=1,
+            obs_horizon=2,
+            per_camera_fill={'r_vis': 10.0, 'top': 20.0, 'front': 30.0},
+        )
+        qpos = torch.tensor([[[1.0], [2.0]]], dtype=torch.float32)
+        initial_noise = torch.tensor(
+            [[[1.0, -1.0], [0.0, 2.0], [3.0, -2.0]]],
+            dtype=torch.float32,
+        )
+
+        with mock.patch.object(torch, 'randn', return_value=initial_noise):
+            predicted_actions = agent.predict_action(images, qpos)
+
+        self.assertEqual(predicted_actions.shape, (1, 3, 2))
+        self.assertEqual(len(head.calls), 1)
+        expected_cond = torch.tensor(
+            [[[30.0, 20.0, 10.0, 50.0, -20.0, 1.0], [30.0, 20.0, 10.0, 50.0, -19.0, 2.0]]],
+            dtype=torch.float32,
+        )
+        self.assertEqual(head.calls[0]['cond'].shape[-1], 6)
+        self.assertTrue(torch.allclose(head.calls[0]['cond'], expected_cond))
+
+    def test_multitoken_visual_backbone_flattens_camera_tokens_and_projects_each_with_state(self):
+        agent_cls, _agent_module = _load_imf_agent_class()
+        head = _RecordingLinearIMFHead()
+        projector = nn.Linear(3, 4, bias=False)
+        with torch.no_grad():
+            projector.weight.copy_(
+                torch.tensor(
+                    [
+                        [1.0, 0.0, 0.0],
+                        [0.0, 1.0, 0.0],
+                        [0.0, 0.0, 1.0],
+                        [1.0, 0.0, 1.0],
+                    ],
+                    dtype=torch.float32,
+                )
+            )
+        agent = agent_cls(
+            vision_backbone=_StubMultiTokenVisionBackbone(),
+            state_encoder=nn.Identity(),
+            action_encoder=nn.Identity(),
+            head=head,
+            action_dim=2,
+            obs_dim=1,
+            pred_horizon=3,
+            obs_horizon=2,
+            diffusion_steps=10,
+            inference_steps=1,
+            num_cams=len(_CAMERA_NAMES),
+            camera_names=_CAMERA_NAMES,
+            num_action_steps=2,
+            head_type='transformer',
+            cond_projector=projector,
+        )
+
+        self.assertEqual(agent.condition_tokens_per_step, 3)
+        self.assertEqual(agent.condition_sequence_length, 6)
+        self.assertEqual(agent.per_step_cond_dim, 4)
+        self.assertEqual(agent.global_cond_dim, 24)
+
+        images = _make_images(
+            batch_size=1,
+            obs_horizon=2,
+            per_camera_fill={'r_vis': 10.0, 'top': 20.0, 'front': 30.0},
+        )
+        qpos = torch.tensor([[[1.0], [2.0]]], dtype=torch.float32)
+        cond = agent._build_cond(images, qpos)
+
+        expected = torch.tensor(
+            [
+                [
+                    [10.0, 10.5, 1.0, 11.0],
+                    [20.0, 20.5, 1.0, 21.0],
+                    [30.0, 30.5, 1.0, 31.0],
+                    [10.0, 10.5, 2.0, 12.0],
+                    [20.0, 20.5, 2.0, 22.0],
+                    [30.0, 30.5, 2.0, 32.0],
+                ]
+            ],
+            dtype=torch.float32,
+        )
+        self.assertEqual(cond.shape, (1, 6, 4))
+        self.assertTrue(torch.allclose(cond, expected))
+
+    def test_multi_token_visual_backbone_pairs_state_per_camera_and_flattens_condition_sequence(self):
+        agent_cls, agent_module = _load_imf_agent_class()
+        head = _RecordingLinearIMFHead()
+        cond_projector = nn.Linear(3, 4, bias=False)
+        with torch.no_grad():
+            cond_projector.weight.copy_(torch.tensor([
+                [1.0, 0.0, 0.0],
+                [0.0, 1.0, 0.0],
+                [0.0, 0.0, 1.0],
+                [1.0, 0.0, 1.0],
+            ], dtype=torch.float32))
+
+        agent = agent_cls(
+            vision_backbone=_StubMultiTokenVisionBackbone(),
+            state_encoder=nn.Identity(),
+            action_encoder=nn.Identity(),
+            head=head,
+            action_dim=2,
+            obs_dim=1,
+            pred_horizon=3,
+            obs_horizon=2,
+            diffusion_steps=10,
+            inference_steps=1,
+            num_cams=len(_CAMERA_NAMES),
+            camera_names=_CAMERA_NAMES,
+            num_action_steps=2,
+            head_type='transformer',
+            cond_projector=cond_projector,
+        )
+        agent.infer_scheduler = _ForbiddenScheduler()
+
+        images = _make_images(
+            batch_size=1,
+            obs_horizon=2,
+            per_camera_fill={'r_vis': 10.0, 'top': 20.0, 'front': 30.0},
+        )
+        qpos = torch.tensor([[[1.0], [2.0]]], dtype=torch.float32)
+        initial_noise = torch.tensor([[[1.0, -1.0], [0.0, 2.0], [3.0, -2.0]]], dtype=torch.float32)
+
+        with mock.patch.object(agent_module.torch, 'randn', return_value=initial_noise):
+            predicted_actions = agent.predict_action(images, qpos)
+
+        expected_cond = torch.tensor([[[10.0, 10.5, 1.0, 11.0],
+                                       [20.0, 20.5, 1.0, 21.0],
+                                       [30.0, 30.5, 1.0, 31.0],
+                                       [10.0, 10.5, 2.0, 12.0],
+                                       [20.0, 20.5, 2.0, 22.0],
+                                       [30.0, 30.5, 2.0, 32.0]]], dtype=torch.float32)
+
+        self.assertEqual(agent.condition_tokens_per_step, 3)
+        self.assertEqual(agent.condition_sequence_length, 6)
+        self.assertEqual(agent.raw_per_step_cond_dim, 3)
+        self.assertEqual(agent.per_step_cond_dim, 4)
+        self.assertEqual(agent.global_cond_dim, 24)
+        self.assertEqual(predicted_actions.shape, (1, 3, 2))
+        self.assertEqual(len(head.calls), 1)
+        self.assertEqual(head.calls[0]['cond'].shape, (1, 6, 4))
+        self.assertTrue(torch.allclose(head.calls[0]['cond'], expected_cond))
+
    def test_hydra_config_instantiates_resnet_imf_attnres_with_stub_head(self):
        cfg = _compose_cfg(
            overrides=[
@@ -448,6 +760,130 @@ class IMFVLAAgentTest(unittest.TestCase):
        self.assertEqual(agent.per_step_cond_dim, 64 * agent.num_cams + agent.obs_dim)
        self.assertEqual(agent.noise_pred_net.constructor_kwargs['cond_dim'], agent.per_step_cond_dim)

+    def test_hydra_config_instantiates_lewm_imf_attnres_with_joint_visual_condition_dim(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=lewm_imf_attnres',
+                'agent.vision_backbone.checkpoint_path=null',
+                'agent.head.n_layer=1',
+                'agent.head.n_emb=16',
+            ]
+        )
+
+        self.assertEqual(cfg.agent._target_, 'roboimi.vla.agent_imf.IMFVLAAgent')
+        self.assertEqual(cfg.agent.vision_backbone._target_, 'roboimi.vla.models.backbones.lewm_vit_backbone.LEWMViTBackbone')
+        self.assertEqual(list(cfg.agent.camera_names), list(_CAMERA_NAMES))
+        self.assertEqual(list(cfg.agent.vision_backbone.camera_names), list(_CAMERA_NAMES))
+        self.assertEqual(list(cfg.agent.vision_backbone.fused_camera_names), ['front', 'top', 'r_vis'])
+        self.assertIsNone(cfg.agent.vision_backbone.dataset_image_resize_shape)
+        self.assertEqual(list(cfg.agent.vision_backbone.eval_image_resize_shape), [256, 256])
+        self.assertEqual(cfg.agent.head.cond_dim, 208)
+
+        with _stub_optional_modules(include_imf_head=True):
+            agent = instantiate(cfg.agent)
+
+        self.assertEqual(agent.per_step_cond_dim, agent.vision_encoder.joint_output_dim + agent.obs_dim)
+        self.assertEqual(agent.per_step_cond_dim, 208)
+        self.assertEqual(agent.global_cond_dim, agent.obs_horizon * 208)
+        self.assertIsNone(agent.vision_encoder.dataset_image_resize_shape)
+        self.assertEqual(agent.vision_encoder.eval_image_resize_shape, (256, 256))
+        self.assertIsInstance(agent.noise_pred_net, _StubIMFHead)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['cond_dim'], 208)
+
+    def test_hydra_config_instantiates_resnet_imf_attnres_multitoken_with_projected_camera_tokens(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=resnet_imf_attnres_multitoken',
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+                'agent.head.n_layer=1',
+                'agent.head.n_emb=32',
+            ]
+        )
+
+        self.assertEqual(cfg.agent._target_, 'roboimi.vla.agent_imf.IMFVLAAgent')
+        self.assertEqual(cfg.agent.vision_backbone.vision_backbone_mode, 'resnet')
+        self.assertTrue(cfg.agent.vision_backbone.use_separate_rgb_encoder_per_camera)
+        self.assertTrue(cfg.agent.vision_backbone.output_tokens_per_camera)
+        self.assertEqual(cfg.agent.cond_projector.output_dim, 32)
+        self.assertEqual(cfg.agent.head.cond_dim, 32)
+
+        with _stub_optional_modules(include_imf_head=True):
+            agent = instantiate(cfg.agent)
+
+        self.assertEqual(agent.condition_tokens_per_step, 3)
+        self.assertEqual(agent.condition_sequence_length, agent.obs_horizon * 3)
+        self.assertEqual(agent.per_step_cond_dim, 32)
+        self.assertEqual(agent.global_cond_dim, agent.condition_sequence_length * 32)
+        self.assertIsInstance(agent.noise_pred_net, _StubIMFHead)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['cond_dim'], 32)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['n_obs_steps'], 6)
+
+
+    def test_hydra_config_instantiates_siglip2_imf_attnres_with_condition_projection(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=siglip2_imf_attnres',
+                'agent.vision_backbone.per_view_output_dim=96',
+                'agent.head.n_layer=1',
+                'agent.head.n_emb=16',
+                'agent.cond_projector.output_dim=384',
+            ]
+        )
+
+        self.assertEqual(cfg.agent._target_, 'roboimi.vla.agent_imf.IMFVLAAgent')
+        self.assertEqual(
+            cfg.agent.vision_backbone._target_,
+            'roboimi.vla.models.backbones.siglip2_diffusion_backbone.SigLIP2DiffusionBackbone',
+        )
+        self.assertEqual(list(cfg.agent.camera_names), list(_CAMERA_NAMES))
+        self.assertIsNone(cfg.agent.vision_backbone.dataset_image_resize_shape)
+        self.assertEqual(list(cfg.agent.vision_backbone.eval_image_resize_shape), [256, 256])
+        self.assertEqual(cfg.agent.head.cond_dim, 384)
+
+        with _stub_optional_modules(include_imf_head=True):
+            agent = instantiate(cfg.agent)
+
+        self.assertEqual(agent.raw_per_step_cond_dim, 3 * 96 + agent.obs_dim)
+        self.assertEqual(agent.per_step_cond_dim, 384)
+        self.assertEqual(agent.global_cond_dim, agent.obs_horizon * 384)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['cond_dim'], 384)
+        self.assertEqual(agent.vision_encoder.output_dim, 96)
+        self.assertEqual(agent.vision_encoder.eval_image_resize_shape, (256, 256))
+
+
+    def test_hydra_config_instantiates_resnet_imf_attnres_multitoken_with_sequence_length_three_times_obs_horizon(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=resnet_imf_attnres_multitoken',
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+                'agent.vision_backbone.freeze_backbone=false',
+                'agent.head.n_layer=1',
+                'agent.head.n_emb=16',
+            ]
+        )
+
+        self.assertEqual(cfg.agent._target_, 'roboimi.vla.agent_imf.IMFVLAAgent')
+        self.assertEqual(list(cfg.agent.camera_names), list(_CAMERA_NAMES))
+        self.assertTrue(cfg.agent.vision_backbone.use_separate_rgb_encoder_per_camera)
+        self.assertTrue(cfg.agent.vision_backbone.output_tokens_per_camera)
+        self.assertEqual(cfg.agent.vision_backbone.vision_backbone_mode, 'resnet')
+        self.assertEqual(cfg.agent.cond_projector.output_dim, 16)
+        self.assertEqual(cfg.agent.head.cond_dim, 16)
+
+        with _stub_optional_modules(include_imf_head=True):
+            agent = instantiate(cfg.agent)
+
+        self.assertEqual(agent.condition_tokens_per_step, 3)
+        self.assertEqual(agent.condition_sequence_length, agent.obs_horizon * 3)
+        self.assertEqual(agent.per_step_cond_dim, 16)
+        self.assertEqual(agent.global_cond_dim, agent.condition_sequence_length * 16)
+        self.assertEqual(agent.vision_encoder.tokens_per_step, 3)
+        self.assertIsInstance(agent.noise_pred_net, _StubIMFHead)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['cond_dim'], 16)
+        self.assertEqual(agent.noise_pred_net.constructor_kwargs['n_obs_steps'], agent.condition_sequence_length)
+

 if __name__ == '__main__':
    unittest.main()
--- a/tests/test_lewm_vit_backbone.py
+++ b/tests/test_lewm_vit_backbone.py
@@ -0,0 +1,220 @@
+import tempfile
+import types
+import unittest
+from pathlib import Path
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import ViTConfig, ViTModel
+
+
+_INPUT_CAMERA_NAMES = ("r_vis", "top", "front")
+_FUSED_CAMERA_NAMES = ("front", "top", "r_vis")
+
+
+class _ReferenceProjector(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(192, 2048),
+            nn.BatchNorm1d(2048),
+            nn.GELU(),
+            nn.Linear(2048, 192),
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+def _build_reference_encoder() -> ViTModel:
+    return ViTModel(
+        ViTConfig(
+            image_size=224,
+            patch_size=14,
+            num_channels=3,
+            hidden_size=192,
+            intermediate_size=768,
+            num_hidden_layers=12,
+            num_attention_heads=3,
+            qkv_bias=True,
+        ),
+        add_pooling_layer=False,
+    )
+
+
+def _write_synthetic_lightning_ckpt(path: Path):
+    torch.manual_seed(7)
+    encoder = _build_reference_encoder()
+    projector = _ReferenceProjector()
+    lightning_state_dict = {}
+    for key, value in encoder.state_dict().items():
+        lightning_state_dict[f"model.encoder.{key}"] = value.detach().clone()
+    for key, value in projector.state_dict().items():
+        lightning_state_dict[f"model.projector.{key}"] = value.detach().clone()
+    torch.save({"state_dict": lightning_state_dict}, path)
+    return encoder.state_dict(), projector.state_dict()
+
+
+class LEWMViTBackboneTest(unittest.TestCase):
+    def test_loads_lightning_encoder_and_projector_checkpoint_and_emits_joint_embedding(self):
+        from roboimi.vla.models.backbones.lewm_vit_backbone import LEWMViTBackbone
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            ckpt_path = Path(tmpdir) / "synthetic-lewm.ckpt"
+            reference_encoder_state, reference_projector_state = _write_synthetic_lightning_ckpt(
+                ckpt_path
+            )
+
+            backbone = LEWMViTBackbone(
+                checkpoint_path=ckpt_path,
+                camera_names=_INPUT_CAMERA_NAMES,
+                fused_camera_names=_FUSED_CAMERA_NAMES,
+                freeze_backbone=True,
+            )
+
+            self.assertEqual(backbone.camera_names, _INPUT_CAMERA_NAMES)
+            self.assertEqual(backbone.fused_camera_names, _FUSED_CAMERA_NAMES)
+            self.assertEqual(backbone.num_cameras, 3)
+            self.assertEqual(backbone.joint_output_dim, 192)
+            self.assertEqual(backbone.output_dim, 192)
+            self.assertEqual(backbone.encoder.config.hidden_size, 192)
+            self.assertEqual(backbone.encoder.config.patch_size, 14)
+            self.assertEqual(backbone.encoder.config.num_hidden_layers, 12)
+            self.assertEqual(backbone.encoder.config.num_attention_heads, 3)
+
+            for key, value in reference_encoder_state.items():
+                self.assertTrue(torch.equal(backbone.encoder.state_dict()[key], value), key)
+            for key, value in reference_projector_state.items():
+                self.assertTrue(torch.equal(backbone.projector.state_dict()[key], value), key)
+
+            images = {
+                cam_name: torch.rand(1, 1, 3, 224, 224)
+                for cam_name in _INPUT_CAMERA_NAMES
+            }
+            output = backbone(images)
+
+            self.assertEqual(output.shape, (1, 1, 192))
+            self.assertFalse(output.requires_grad)
+
+    def test_forward_uses_front_top_rvis_fusion_order_and_exact_lewm_cwh_resize_path(self):
+        from roboimi.vla.models.backbones.lewm_vit_backbone import LEWMViTBackbone
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            ckpt_path = Path(tmpdir) / "synthetic-lewm.ckpt"
+            _write_synthetic_lightning_ckpt(ckpt_path)
+
+            backbone = LEWMViTBackbone(
+                checkpoint_path=ckpt_path,
+                camera_names=_INPUT_CAMERA_NAMES,
+                fused_camera_names=_FUSED_CAMERA_NAMES,
+                freeze_backbone=True,
+            )
+            captured = {}
+
+            def fake_encoder_forward(module, pixel_values, interpolate_pos_encoding=False, **kwargs):
+                del module, kwargs
+                captured["pixel_values"] = pixel_values.detach().clone()
+                captured["interpolate_pos_encoding"] = interpolate_pos_encoding
+                batch = pixel_values.shape[0]
+                patch_tokens = (pixel_values.shape[-2] // 14) * (pixel_values.shape[-1] // 14)
+                cls = (
+                    torch.arange(192, dtype=pixel_values.dtype, device=pixel_values.device)
+                    .unsqueeze(0)
+                    .expand(batch, -1)
+                )
+                last_hidden_state = torch.zeros(
+                    batch,
+                    patch_tokens + 1,
+                    192,
+                    dtype=pixel_values.dtype,
+                    device=pixel_values.device,
+                )
+                last_hidden_state[:, 0] = cls
+                return types.SimpleNamespace(last_hidden_state=last_hidden_state)
+
+            backbone.encoder.forward = types.MethodType(fake_encoder_forward, backbone.encoder)
+
+            r_vis = torch.full((1, 1, 3, 256, 256), 0.30)
+            top = torch.full((1, 1, 3, 256, 256), 0.20)
+            front = torch.full((1, 1, 3, 256, 256), 0.10)
+            bn = backbone.projector.net[1]
+            running_mean_before = bn.running_mean.detach().clone()
+            running_var_before = bn.running_var.detach().clone()
+
+            backbone.train()
+            self.assertFalse(backbone.encoder.training)
+            self.assertFalse(backbone.projector.training)
+
+            output = backbone({"r_vis": r_vis, "top": top, "front": front})
+
+            self.assertEqual(output.shape, (1, 1, 192))
+            self.assertEqual(captured["pixel_values"].shape, (1, 3, 672, 224))
+            self.assertTrue(captured["interpolate_pos_encoding"])
+
+            normalized_views = [
+                ((view.reshape(-1, *view.shape[2:]).float()).clamp(0.0, 1.0) - backbone.mean) / backbone.std
+                for view in (front, top, r_vis)
+            ]
+            expected_fuse_then_resize = F.interpolate(
+                torch.cat(normalized_views, dim=-2),
+                size=(672, 224),
+                mode="bilinear",
+                align_corners=False,
+                antialias=True,
+            )
+            expected_pre_resize_then_fuse = torch.cat(
+                [
+                    F.interpolate(
+                        view,
+                        size=(224, 224),
+                        mode="bilinear",
+                        align_corners=False,
+                        antialias=True,
+                    )
+                    for view in normalized_views
+                ],
+                dim=-2,
+            )
+
+            self.assertTrue(
+                torch.allclose(captured["pixel_values"], expected_fuse_then_resize, atol=1e-6, rtol=1e-6)
+            )
+            self.assertFalse(
+                torch.allclose(
+                    expected_fuse_then_resize,
+                    expected_pre_resize_then_fuse,
+                    atol=1e-6,
+                    rtol=1e-6,
+                )
+            )
+            self.assertFalse(
+                torch.allclose(
+                    captured["pixel_values"],
+                    expected_pre_resize_then_fuse,
+                    atol=1e-6,
+                    rtol=1e-6,
+                )
+            )
+            self.assertTrue(
+                torch.allclose(
+                    captured["pixel_values"][0, :, 223, :],
+                    expected_fuse_then_resize[0, :, 223, :],
+                    atol=1e-6,
+                    rtol=1e-6,
+                )
+            )
+            self.assertTrue(
+                torch.allclose(
+                    captured["pixel_values"][0, :, 447, :],
+                    expected_fuse_then_resize[0, :, 447, :],
+                    atol=1e-6,
+                    rtol=1e-6,
+                )
+            )
+            self.assertTrue(torch.equal(bn.running_mean, running_mean_before))
+            self.assertTrue(torch.equal(bn.running_var, running_var_before))
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_resnet_transformer_agent_wiring.py
+++ b/tests/test_resnet_transformer_agent_wiring.py
@@ -180,6 +180,14 @@ def _extract_camera_markers(cond, feature_dim, num_cams):
    return camera_block[:, 0]


+def _extract_token_camera_markers(tokens):
+    return tokens[0, 0, :, 0]
+
+
+def _extract_token_markers(token_sequence):
+    return token_sequence[0, 0, :, 0]
+
+
 class ResNetTransformerAgentWiringTest(unittest.TestCase):
    def test_hydra_wiring_uses_required_three_camera_transformer_conditioning_in_agent_order_and_ignores_extra_keys(self):
        cfg = _compose_cfg(
@@ -246,6 +254,36 @@ class ResNetTransformerAgentWiringTest(unittest.TestCase):
            with self.assertRaisesRegex(ValueError, 'missing=.*top'):
                agent.predict_action(missing_images, proprioception)

+    def test_multitoken_resnet_backbone_emits_one_token_per_camera_in_agent_order(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=resnet_imf_attnres_multitoken',
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+
+        with _stub_optional_modules():
+            backbone = instantiate(cfg.agent.vision_backbone)
+            _patch_backbone_for_order_tracking(backbone)
+            images = _make_images(
+                batch_size=1,
+                obs_horizon=cfg.agent.obs_horizon,
+                image_shape=tuple(cfg.agent.vision_backbone.input_shape),
+                per_camera_fill={
+                    'front': 30.0,
+                    'top': 20.0,
+                    'r_vis': 10.0,
+                    'left_wrist': 99.0,
+                },
+            )
+            tokens = backbone(images)
+
+        self.assertEqual(tokens.shape, (1, cfg.agent.obs_horizon, 3, backbone.output_dim))
+        self.assertEqual(backbone.tokens_per_step, 3)
+        camera_markers = _extract_token_camera_markers(tokens)
+        self.assertTrue(torch.allclose(camera_markers, torch.tensor([10.0, 20.0, 30.0])))
+
    def test_agent_rejects_conflicting_explicit_backbone_camera_names(self):
        cfg = _compose_cfg(
            overrides=[
@@ -382,6 +420,36 @@ class ResNetTransformerAgentWiringTest(unittest.TestCase):
            with self.assertRaisesRegex(InstantiationException, 'num_cams'):
                instantiate(cfg.agent)

+    def test_multitoken_resnet_backbone_emits_one_token_per_camera_in_agent_order(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent=resnet_imf_attnres_multitoken',
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+                'agent.head.n_layer=1',
+                'agent.head.n_emb=32',
+            ]
+        )
+
+        with _stub_optional_modules():
+            backbone = instantiate(cfg.agent.vision_backbone)
+            _patch_backbone_for_order_tracking(backbone)
+            images = _make_images(
+                batch_size=1,
+                obs_horizon=cfg.agent.obs_horizon,
+                image_shape=tuple(cfg.agent.vision_backbone.input_shape),
+                per_camera_fill={
+                    'front': 30.0,
+                    'top': 20.0,
+                    'r_vis': 10.0,
+                },
+            )
+            output = backbone(images)
+
+        self.assertEqual(output.shape, (1, cfg.agent.obs_horizon, 3, backbone.output_dim))
+        token_markers = _extract_token_markers(output)
+        self.assertTrue(torch.allclose(token_markers, torch.tensor([10.0, 20.0, 30.0])))
+

 if __name__ == '__main__':
    unittest.main()
--- a/tests/test_siglip2_diffusion_backbone.py
+++ b/tests/test_siglip2_diffusion_backbone.py
@@ -0,0 +1,121 @@
+import types
+import unittest
+from unittest import mock
+
+import torch
+from torch import nn
+
+
+_CAMERA_NAMES = ("r_vis", "top", "front")
+
+
+class _FakeSiglipVisionOutput:
+    def __init__(self, pooler_output):
+        self.pooler_output = pooler_output
+
+
+class _FakeSiglipVisionConfig:
+    def __init__(self, hidden_size=768, image_size=256):
+        self.hidden_size = hidden_size
+        self.image_size = image_size
+
+
+class _FakeSiglipVisionModel(nn.Module):
+    def __init__(self, hidden_size=768):
+        super().__init__()
+        self.config = _FakeSiglipVisionConfig(hidden_size=hidden_size)
+        self.forward_calls = []
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        del args, kwargs
+        return cls()
+
+    def forward(self, pixel_values=None, **kwargs):
+        self.forward_calls.append({
+            "pixel_values": pixel_values.detach().clone(),
+            "kwargs": dict(kwargs),
+        })
+        pooled = pixel_values.mean(dim=(2, 3), keepdim=False)
+        return _FakeSiglipVisionOutput(pooler_output=pooled)
+
+
+class SigLIP2DiffusionBackboneTest(unittest.TestCase):
+    def test_forward_encodes_each_view_independently_and_concatenates_projected_features(self):
+        from roboimi.vla.models.backbones.siglip2_diffusion_backbone import SigLIP2DiffusionBackbone
+
+        fake_model = _FakeSiglipVisionModel(hidden_size=3)
+        with mock.patch(
+            "roboimi.vla.models.backbones.siglip2_diffusion_backbone.SiglipVisionModel.from_pretrained",
+            return_value=fake_model,
+        ) as mock_from_pretrained:
+            backbone = SigLIP2DiffusionBackbone(
+                model_name="google/siglip2-base-patch16-256",
+                camera_names=_CAMERA_NAMES,
+                num_cameras=3,
+                per_view_output_dim=2,
+                freeze_backbone=True,
+            )
+
+        self.assertEqual(backbone.camera_names, _CAMERA_NAMES)
+        self.assertEqual(backbone.num_cameras, 3)
+        self.assertEqual(backbone.output_dim, 2)
+        self.assertEqual(backbone.joint_output_dim, 6)
+        self.assertIsNone(backbone.dataset_image_resize_shape)
+        self.assertEqual(backbone.eval_image_resize_shape, (256, 256))
+        mock_from_pretrained.assert_called_once_with("google/siglip2-base-patch16-256")
+        self.assertTrue(all(not p.requires_grad for p in backbone.encoder.parameters()))
+        self.assertFalse(backbone.encoder.training)
+
+        with torch.no_grad():
+            backbone.view_projector.weight.zero_()
+            backbone.view_projector.bias.zero_()
+            backbone.view_projector.weight[0, 0] = 1.0
+            backbone.view_projector.weight[1, 1] = 1.0
+
+        images = {
+            "r_vis": torch.full((1, 2, 3, 256, 256), 0.25),
+            "top": torch.full((1, 2, 3, 256, 256), 0.50),
+            "front": torch.full((1, 2, 3, 256, 256), 0.75),
+        }
+        output = backbone(images)
+
+        self.assertEqual(output.shape, (1, 2, 6))
+        self.assertEqual(len(fake_model.forward_calls), 3)
+
+        expected_per_camera = []
+        for cam_name in _CAMERA_NAMES:
+            img = images[cam_name].reshape(2, 3, 256, 256)
+            normalized = (img - 0.5) / 0.5
+            expected_per_camera.append(normalized.mean(dim=(2, 3))[:, :2])
+        expected = torch.cat(expected_per_camera, dim=-1).view(1, 2, 6)
+        self.assertTrue(torch.allclose(output, expected, atol=1e-6, rtol=1e-6))
+
+        for call, cam_name in zip(fake_model.forward_calls, _CAMERA_NAMES):
+            pixels = call["pixel_values"]
+            self.assertEqual(tuple(pixels.shape), (2, 3, 256, 256))
+            self.assertTrue(
+                torch.allclose(
+                    pixels,
+                    (images[cam_name].reshape(2, 3, 256, 256) - 0.5) / 0.5,
+                )
+            )
+
+    def test_forward_rejects_missing_required_camera(self):
+        from roboimi.vla.models.backbones.siglip2_diffusion_backbone import SigLIP2DiffusionBackbone
+
+        backbone = SigLIP2DiffusionBackbone(
+            vision_model=_FakeSiglipVisionModel(hidden_size=4),
+            camera_names=_CAMERA_NAMES,
+            num_cameras=3,
+        )
+
+        with self.assertRaisesRegex(ValueError, "missing"):
+            backbone({
+                "r_vis": torch.rand(1, 1, 3, 256, 256),
+                "top": torch.rand(1, 1, 3, 256, 256),
+            })
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_simple_robot_dataset_image_loading.py
+++ b/tests/test_simple_robot_dataset_image_loading.py
@@ -56,3 +56,26 @@ class SimpleRobotDatasetImageLoadingTest(unittest.TestCase):

        self.assertEqual(len(resize_calls), 2)
        self.assertEqual(tuple(sample["observation.front"].shape), (2, 3, 8, 8))
+
+    def test_getitem_skips_resize_when_image_resize_shape_is_none(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            dataset_dir = Path(tmpdir)
+            self._write_episode(dataset_dir)
+            dataset = SimpleRobotDataset(
+                dataset_dir,
+                obs_horizon=2,
+                pred_horizon=3,
+                camera_names=["front"],
+                image_resize_shape=None,
+            )
+
+            fake_cv2 = types.SimpleNamespace(
+                INTER_LINEAR=1,
+                resize=mock.Mock(side_effect=AssertionError("resize should be skipped when image_resize_shape=None")),
+            )
+
+            with mock.patch.dict(sys.modules, {"cv2": fake_cv2}):
+                sample = dataset[1]
+
+        fake_cv2.resize.assert_not_called()
+        self.assertEqual(tuple(sample["observation.front"].shape), (2, 3, 8, 8))
--- a/tests/test_train_vla_rollout_validation.py
+++ b/tests/test_train_vla_rollout_validation.py
@@ -159,6 +159,92 @@ class TrainVLARolloutValidationTest(unittest.TestCase):
        self.assertGreater(cfg.train.num_workers, 8)
        self.assertEqual(cfg.train.rollout_val_freq_epochs, 50)

+    def test_training_passes_backbone_image_resize_override_to_dataset_instantiation(self):
+        cfg = OmegaConf.create(
+            {
+                'agent': {
+                    'vision_backbone': {
+                        'dataset_image_resize_shape': None,
+                    },
+                    'normalization_type': 'min_max',
+                },
+                'data': {
+                    'dataset_dir': 'unused',
+                    'camera_names': ['front'],
+                },
+                'train': {
+                    'batch_size': 2,
+                    'lr': 1e-4,
+                    'max_steps': 0,
+                    'device': 'cpu',
+                    'disable_cudnn': False,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 42,
+                    'log_freq': 1,
+                    'save_freq': 10,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 0,
+                    'rollout_validate_on_checkpoint': False,
+                    'rollout_num_episodes': 1,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 1e-6,
+                    'weight_decay': 1e-5,
+                    'grad_clip': 1.0,
+                    'pretrained_ckpt': None,
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 1,
+                    'headless': True,
+                    'device': 'cpu',
+                    'verbose_action': False,
+                },
+                'experiment': {},
+            }
+        )
+        captured_dataset_kwargs = {}
+
+        def fake_instantiate(config_node, **kwargs):
+            if config_node is cfg.data:
+                captured_dataset_kwargs.update(kwargs)
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return _FakeAgent()
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(_dataset, *, shuffle, **_kwargs):
+            del shuffle, _kwargs
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=1,
+            )
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla, '_init_swanlab', return_value=None), \
+                     mock.patch.object(train_vla, '_finish_swanlab', return_value=None), \
+                     mock.patch.object(train_vla.torch, 'save', return_value=None):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertIn('image_resize_shape', captured_dataset_kwargs)
+        self.assertIsNone(captured_dataset_kwargs['image_resize_shape'])
+
    def test_eval_main_delegates_to_plain_run_eval_helper(self):
        cfg = OmegaConf.create(
            {