merge: imf attnres policy

# Conflicts: # roboimi/demos/vla_scripts/eval_vla.py # roboimi/envs/double_base.py
2026-05-02 22:23:29 +08:00
parent a2c018acce ff7c9c1f2a
commit b1116e489f
90 changed files with 6824 additions and 87 deletions
--- a/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
+++ b/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
@@ -0,0 +1,268 @@
+# IMF-AttnRes Policy Migration Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 将 external `diffusion_policy@185ed659` 的 IMF-AttnRes 模型、训练目标和一步推理机制迁移到 RoboIMI，并在保持三相机视觉条件输入与现有训练/rollout 工作流的前提下启动同参数训练。
+
+**Architecture:** 保留 RoboIMI 现有 ResNet 三相机观测编码、normalization、queue-based online rollout 和训练脚本；新增 AttnRes 组件与 IMF transformer head，并新增 IMF 专用 agent 以覆盖 DDPM loss / DDIM inference 语义。训练脚本只做最小接线修改，让新 head/agent 能用现有 optimizer、checkpoint、SwanLab 和 headless rollout。
+
+**Tech Stack:** PyTorch, Hydra, diffusers schedulers (仅保留兼容初始化), MuJoCo rollout, unittest, SwanLab
+
+---
+
+## File Map
+
+### New files
+- `roboimi/vla/models/heads/attnres_transformer_components.py` — 本地 IMF AttnRes 基础组件
+- `roboimi/vla/models/heads/imf_transformer1d.py` — IMF transformer head，暴露 `forward(sample, r, t, cond=None)`
+- `roboimi/vla/agent_imf.py` — IMF 专用 VLA agent，复用现有观测/队列/normalization 逻辑并覆盖 loss / inference
+- `roboimi/vla/conf/head/imf_transformer1d.yaml` — IMF head 配置
+- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml` — IMF agent + backbone/head 组合配置
+- `tests/test_imf_transformer1d_external_alignment.py` — external `185ed659` 对齐测试
+- `tests/test_imf_vla_agent.py` — IMF agent 的 loss / inference / queue 语义测试
+
+### Modified files
+- `roboimi/demos/vla_scripts/train_vla.py` — 优化器参数分组接线；确保新 agent 能无缝训练
+- `roboimi/vla/conf/config.yaml` — 保持默认配置不变，仅支持通过 override 启用 IMF agent
+- `tests/test_train_vla_transformer_optimizer.py` — 覆盖 IMF head 的 optimizer-group 行为
+- （如需要）`roboimi/vla/models/heads/__init__.py` 或相近导出文件 — 暴露新 head
+
+---
+
+### Task 1: 写 IMF transformer 对齐测试
+
+**Files:**
+- Create: `tests/test_imf_transformer1d_external_alignment.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+
+- [ ] **Step 1: 写失败测试，验证 local IMF head 与 external `185ed659` 的 state-dict key、前向 shape、forward 数值、optim groups 对齐**
+
+```python
+with torch.no_grad():
+    external_out = external_model(sample=sample, r=r, t=t, cond=cond)
+    local_out = local_model(sample=sample, r=r, t=t, cond=cond)
+assert torch.allclose(local_out, external_out, atol=1e-6, rtol=1e-5)
+```
+
+- [ ] **Step 2: 运行单测，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: FAIL，提示 `imf_transformer1d` / `attnres` 模块不存在
+
+- [ ] **Step 3: 若测试需要复用现有 external-loader 逻辑，则从 `tests/test_transformer1d_external_alignment.py` 复制最小必要 helper，避免重复依赖 session context**
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_transformer1d_external_alignment.py
+git commit -m "test: add IMF transformer external alignment coverage"
+```
+
+### Task 2: 实现 AttnRes 组件与 IMF transformer head
+
+**Files:**
+- Create: `roboimi/vla/models/heads/attnres_transformer_components.py`
+- Create: `roboimi/vla/models/heads/imf_transformer1d.py`
+- Modify: `tests/test_imf_transformer1d_external_alignment.py`
+
+- [ ] **Step 1: 按 external `185ed659` 迁移 AttnRes 基础组件，保持命名和参数语义一致**
+
+必须包含：
+- `RMSNorm`
+- `RMSNormNoWeight`
+- `precompute_rope_freqs`
+- `apply_rope`
+- `GroupedQuerySelfAttention`
+- `SwiGLUFFN`
+- `AttnResOperator`
+- `AttnResSubLayer`
+- `AttnResTransformerBackbone`
+
+- [ ] **Step 2: 在 `imf_transformer1d.py` 中实现本地 IMF head**
+
+必须满足：
+- `forward(sample, r, t, cond=None)`
+- 默认支持 `backbone_type='attnres_full'`
+- token 序列为 `[r_token, t_token, cond_tokens..., sample_tokens...]`
+- 输出只切回 sample token 段
+- 保留 `get_optim_groups()` 供 AdamW 分组
+
+- [ ] **Step 3: 运行对齐测试，修正 state-dict key / init / no-decay 参数分组不一致问题**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: PASS
+
+- [ ] **Step 4: 提交模型组件实现**
+
+```bash
+git add roboimi/vla/models/heads/attnres_transformer_components.py \
+        roboimi/vla/models/heads/imf_transformer1d.py \
+        tests/test_imf_transformer1d_external_alignment.py
+git commit -m "feat: add IMF AttnRes transformer head"
+```
+
+### Task 3: 写 IMF agent 行为测试
+
+**Files:**
+- Create: `tests/test_imf_vla_agent.py`
+- Reference: `roboimi/vla/agent.py`
+- Reference: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: 写失败测试，覆盖 IMF agent 的核心契约**
+
+需要覆盖：
+1. `compute_loss()` 接受当前 batch 结构并返回标量 loss
+2. `predict_action()` 输出 `(B, pred_horizon, action_dim)`
+3. `select_action()` 仍按 queue/chunk 语义工作
+4. `predict_action()` 不走 DDIM 多步循环，而是只触发一步 IMF sample
+5. `action_is_pad` 存在时仅在有效 action 上计 loss
+
+- [ ] **Step 2: 用 stub backbone / stub head 记录调用参数，验证 `r,t,cond` 的传递与 observation conditioning 维度正确**
+
+```python
+self.assertEqual(recorded['cond'].shape, (B, obs_horizon, expected_cond_dim))
+self.assertTrue(torch.allclose(recorded['r'], torch.zeros(B)))
+self.assertTrue(torch.allclose(recorded['t'], torch.ones(B)))
+```
+
+- [ ] **Step 3: 运行测试，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_vla_agent -v`
+Expected: FAIL，提示 `roboimi.vla.agent_imf` 不存在
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_vla_agent.py
+git commit -m "test: add IMF VLA agent behavior coverage"
+```
+
+### Task 4: 实现 IMF agent 与 Hydra 接线
+
+**Files:**
+- Create: `roboimi/vla/agent_imf.py`
+- Create: `roboimi/vla/conf/head/imf_transformer1d.yaml`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_transformer_optimizer.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: 以 `VLAAgent` 为基础实现 `IMFVLAAgent`**
+
+实现策略：
+- 复用 `VLAAgent.__init__`、`_build_cond()`、`reset()`、`_populate_queues()`、`_prepare_observation_batch()`、`select_action()`、`get_normalization_stats()`
+- 覆盖：
+  - `compute_loss()` -> IMF objective
+  - `predict_action()` -> one-step sample
+- 提供内部 helper：
+  - `_broadcast_batch_time`
+  - `_apply_conditioning`（如需）
+  - `_compute_u_and_du_dt`
+  - `_compound_velocity`
+  - `_sample_one_step`
+
+- [ ] **Step 2: 在 JVP 路径中加入 CUDA math SDPA fallback，保持 external repo 的稳定性策略**
+
+- [ ] **Step 3: 新增 Hydra 配置，让 `agent=resnet_imf_attnres` 可实例化**
+
+关键默认值：
+- `_target_: roboimi.vla.agent_imf.IMFVLAAgent`
+- `head._target_: roboimi.vla.models.heads.imf_transformer1d.IMFTransformer1D`
+- `head.backbone_type: attnres_full`
+- `head.causal_attn: false`
+- `head.time_as_cond: true`
+- `head.n_cond_layers: 0`
+- `inference_steps: 1`
+- `camera_names: ${data.camera_names}`
+- `vision_backbone.camera_names: ${agent.camera_names}`
+
+- [ ] **Step 4: 让训练脚本对任何带 `get_optim_groups()` 的 head 复用参数分组，而不是硬编码旧 transformer head_type**
+
+推荐最小改法：
+```python
+use_head_groups = callable(getattr(noise_pred_net, 'get_optim_groups', None))
+```
+
+- [ ] **Step 5: 运行测试并修复 wiring 问题**
+
+Run:
+- `python -m unittest tests.test_imf_vla_agent -v`
+- `python -m unittest tests.test_train_vla_transformer_optimizer -v`
+
+Expected: PASS
+
+- [ ] **Step 6: 提交 agent / config / train-script 接线**
+
+```bash
+git add roboimi/vla/agent_imf.py \
+        roboimi/vla/conf/head/imf_transformer1d.yaml \
+        roboimi/vla/conf/agent/resnet_imf_attnres.yaml \
+        roboimi/demos/vla_scripts/train_vla.py \
+        tests/test_imf_vla_agent.py \
+        tests/test_train_vla_transformer_optimizer.py
+git commit -m "feat: add IMF VLA agent and training wiring"
+```
+
+### Task 5: 集成验证与训练启动
+
+**Files:**
+- Modify: none required unless验证暴露真实问题
+- Use run artifacts under: `runs/`
+
+- [ ] **Step 1: 运行聚焦测试集**
+
+Run:
+```bash
+python -m unittest \
+  tests.test_imf_transformer1d_external_alignment \
+  tests.test_imf_vla_agent \
+  tests.test_resnet_transformer_agent_wiring \
+  tests.test_train_vla_transformer_optimizer -v
+```
+Expected: PASS
+
+- [ ] **Step 2: 运行一个最小 GPU 训练冒烟任务（不必长跑）**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.max_steps=2 train.batch_size=4 train.num_workers=2 \
+  train.use_swanlab=false train.rollout_val_freq_epochs=0
+```
+Expected: 成功完成 2 steps，生成 checkpoint / log，无 shape 或 JVP 错误
+
+- [ ] **Step 3: 用正式参数启动 IMF 训练**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.val_split=0.0 train.seed=42 \
+  train.batch_size=80 train.lr=5e-4 train.num_workers=12 train.max_steps=150000 \
+  train.log_freq=100 train.save_freq=10000 train.use_swanlab=true \
+  train.swanlab_project=roboimi-vla \
+  train.rollout_val_freq_epochs=5 train.rollout_validate_on_checkpoint=false \
+  train.rollout_num_episodes=5 train.warmup_steps=2000 \
+  train.scheduler_type=cosine train.min_lr=1e-6 train.weight_decay=1e-5 train.grad_clip=1.0 \
+  agent.pred_horizon=16 agent.inference_steps=1 \
+  agent.head.n_emb=384 agent.head.n_layer=18 agent.head.n_head=1 agent.head.n_kv_head=1 \
+  agent.vision_backbone.pretrained_backbone_weights=null \
+  agent.vision_backbone.freeze_backbone=false \
+  agent.vision_backbone.use_separate_rgb_encoder_per_camera=true
+```
+Expected: 训练启动成功，SwanLab 记录完整 config，5 epoch 一次 headless rollout
+
+- [ ] **Step 4: 记录 run 路径、训练 PID、SwanLab 运行名并向用户汇报**
+
+- [ ] **Step 5: 提交最终收尾改动（如果 smoke fix 需要额外 patch）**
+
+```bash
+git add <changed files>
+git commit -m "chore: verify IMF AttnRes training launch"
+```
--- a/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
+++ b/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
@@ -0,0 +1,79 @@
+# IMF Rollout Trajectory Images and Short-Horizon Training Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add training-time rollout front trajectory image export plus SwanLab image logging, then start a new local IMF training run with `emb=384`, `layer=12`, `pred_horizon=8`, `num_action_steps=4`, `max_steps=50000`.
+
+**Architecture:** Extend `eval_vla.py` so a rollout can emit one per-episode static front-view image with red EE trajectory overlay. Extend `train_vla.py` so rollout validation forces image export, forces video off, and uploads those per-episode images to SwanLab. Launch the requested new run through explicit command-line overrides rather than branch-default config changes.
+
+**Tech Stack:** Python, PyTorch, Hydra/OmegaConf, MuJoCo, OpenCV, SwanLab.
+
+---
+
+### Task 1: Add and validate rollout image tests
+
+**Files:**
+- Modify: `tests/test_eval_vla_rollout_artifacts.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Add/adjust eval tests so they assert per-episode trajectory image paths are produced without requiring video export.
+- [ ] Add/adjust training tests so they assert training-time rollout validation forces `record_video=false`.
+- [ ] Add/adjust training tests so they assert trajectory image paths flow from eval summary into SwanLab media logging.
+- [ ] Add/adjust training tests so they assert image media is logged, not only scalar reward metrics.
+
+### Task 2: Implement per-episode front trajectory image export in eval
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Reuse/Read: `roboimi/utils/raw_action_trajectory_viewer.py`
+- Modify: `roboimi/vla/conf/eval/eval.yaml`
+
+- [ ] Add config plumbing for `save_trajectory_image` and `trajectory_image_camera_name`.
+- [ ] Ensure the default training-time camera resolution path is pinned to `front`.
+- [ ] Implement distinct per-episode image naming so 5 rollout episodes create 5 distinct PNGs.
+- [ ] Reuse the existing red trajectory representation logic when composing the PNG.
+- [ ] Ensure headless eval works under EGL even on machines with `DISPLAY` set.
+
+### Task 3: Implement SwanLab rollout image logging in training
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Make `run_rollout_validation()` force `record_video=false`.
+- [ ] Make `run_rollout_validation()` force `save_trajectory_image=true` and `trajectory_image_camera_name=front`.
+- [ ] Ensure rollout validation still uses 5 episodes per validation event for the requested run.
+- [ ] Add a best-effort helper that converts per-episode image paths into SwanLab image media payloads.
+- [ ] Keep image-upload failures non-fatal and warning-only.
+
+### Task 4: Verify action-chunk semantics for the new run
+
+**Files:**
+- Verify: `roboimi/vla/agent.py`
+- Verify: `roboimi/vla/agent_imf.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] Confirm the existing queue logic still means “predict 8, execute first 4”.
+- [ ] Do not change branch defaults unless strictly necessary; prefer launch-time overrides.
+
+### Task 5: Verify and launch the requested local training run
+
+**Files:**
+- Use: `roboimi/demos/vla_scripts/train_vla.py`
+- Use: `roboimi/demos/vla_scripts/eval_vla.py`
+
+- [ ] Run the targeted verification suite.
+- [ ] Run one real headless smoke eval and confirm a front trajectory PNG is produced while `video_mp4` stays null.
+- [ ] Launch the new local training run with explicit overrides including:
+  - `agent=resnet_imf_attnres`
+  - `agent.head.n_emb=384`
+  - `agent.head.n_layer=12`
+  - `agent.pred_horizon=8`
+  - `agent.num_action_steps=4`
+  - `train.max_steps=50000`
+  - `train.rollout_num_episodes=5`
+  - `train.use_swanlab=true`
+  - current local baseline dataset/camera/CUDA/batch/lr/num_workers/backbone settings
+- [ ] Verify PID, GPU allocation, log tail, and SwanLab run URL.
--- a/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
+++ b/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
@@ -0,0 +1,68 @@
+# IMF Horizon Grid and AttnRes Ablation Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Run a 6-run Phase-1 IMF horizon/action-step experiment grid across available GPUs, monitor progress and collect best rollout metrics, then use the best horizon setting for a Phase-2 visual-attnres ablation.
+
+**Architecture:** Use the current IMF training code as-is for Phase-1 by sweeping explicit `(pred_horizon, num_action_steps)` overrides while keeping emb=384, layer=12, and max_steps=50k fixed. Maintain a local experiment suite directory with a manifest and machine-readable status snapshots so progress can be resumed and summarized across turns. After Phase-1 completes, compare the current head-only attnres setup against a variant that also adds attnres into the visual ResNet path.
+
+**Tech Stack:** Python, Hydra/OmegaConf, PyTorch, SSH/Tailscale, JSON/CSV status files, SwanLab.
+
+---
+
+### Task 1: Prepare the experiment suite manifest and state tracking
+
+**Files:**
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/manifest.json`
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
+
+- [ ] Define the 6 legal Phase-1 combinations: `(8,8)`, `(16,8)`, `(16,16)`, `(32,8)`, `(32,16)`, `(32,32)`.
+- [ ] Record for each run: name, host, GPU slot, command, log path, SwanLab run name, and completion criteria.
+- [ ] Define the comparison metric as the maximum rollout average reward seen during training (`max avg_reward`), preferably read from the best-checkpoint metadata and cross-checked against logs.
+- [ ] Keep `status.json` updated with per-run state: queued / running / finished / failed plus latest parsed progress.
+
+### Task 2: Prepare the remote 8-GPU execution target
+
+**Files:**
+- Remote working directory under `/home/droid/`
+- Reuse or create a synced code directory for this suite
+
+- [ ] Verify the remote dataset path and environment path.
+- [ ] Verify GPU availability and reserve 6 GPUs for Phase-1 launches.
+- [ ] Sync the required code to a dedicated remote suite directory.
+- [ ] Record exact remote paths back into the local suite manifest.
+
+### Task 3: Launch the 6 Phase-1 experiments in parallel
+
+**Files:**
+- Reuse: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify only local suite tracking files unless a launch bug is discovered
+
+- [ ] Launch 6 runs concurrently with fixed settings: IMF, emb=384, layer=12, max_steps=50k.
+- [ ] Keep all other relevant training hyperparameters aligned to the current strong baseline unless a concrete blocker appears.
+- [ ] Assign one GPU per run on the 8xL20 host.
+- [ ] Capture PID, log path, and SwanLab URL for each run in `status.json`.
+
+### Task 4: Monitor and summarize Phase-1 until all 6 finish
+
+**Files:**
+- Update: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
+- Update: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
+
+- [ ] Periodically parse each run’s log/checkpoints to extract latest step, latest rollout reward, and best rollout reward so far.
+- [ ] Keep a resumable local summary so progress can be continued in later turns without rediscovery.
+- [ ] After all 6 runs finish, rank them by `max avg_reward` and write a compact Phase-1 summary.
+
+### Task 5: Prepare the Phase-2 visual-attnres ablation
+
+**Files:**
+- Likely modify: vision backbone implementation and config files (to be confirmed after code inspection)
+- Add/update targeted tests for the visual backbone path if code changes are needed
+
+- [ ] Use the best Phase-1 `(pred_horizon, num_action_steps)` combination as the fixed rollout setting for Phase-2.
+- [ ] Compare:
+  1. current setup: attnres only in the IMF head
+  2. ablation setup: attnres in both IMF head and visual encoder path
+- [ ] Keep the rest of the training settings fixed.
+- [ ] Launch and monitor the Phase-2 pair after Phase-1 summary is complete.
--- a/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
+++ b/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
@@ -0,0 +1,92 @@
+# LEWM ViT Backbone Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
+
+**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
+
+**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
+
+---
+
+### Task 1: Add failing tests for LEWM joint-vision backbone contract
+
+**Files:**
+- Create: `tests/test_lewm_vit_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write the failing backbone shape/load test**
+- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
+- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
+- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
+
+### Task 2: Implement LEWM joint-multiview frozen backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
+
+- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
+- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
+- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
+- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
+- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
+
+### Task 3: Add minimal agent support for joint visual dim
+
+**Files:**
+- Modify: `roboimi/vla/agent.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
+- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
+- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
+
+### Task 4: Add Hydra configs for LEWM backbone training
+
+**Files:**
+- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+
+- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
+- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
+- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
+
+### Task 5: Verify focused local tests
+
+**Files:**
+- Reuse the above
+
+- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
+- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
+
+### Task 6: Sync to 5880 and remote smoke with real checkpoint
+
+**Files:**
+- Remote target: `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
+- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
+- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
+
+### Task 7: Launch two real 50k runs on the 5880 machine
+
+**Files:**
+- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
+
+- [ ] **Step 1: Launch embed384/layer12 on GPU0**
+- [ ] **Step 2: Launch embed256/layer12 on GPU1**
+- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
+- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
+
+### Task 8: Update experiment tracking docs and commit
+
+**Files:**
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
+
+- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
+- [ ] **Step 2: Record running status after launch**
+- [ ] **Step 3: Commit implementation + docs with a focused message**
--- a/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
+++ b/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
@@ -0,0 +1,64 @@
+# Phase-2 Full-AttnRes Vision Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.
+
+**Architecture:** Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing `ResNetDiffusionBackbone` via an opt-in mode and keep the agent/head/data interfaces unchanged.
+
+**Tech Stack:** PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.
+
+---
+
+### Task 1: Add failing tests for the new full-AttnRes visual backbone
+
+**Files:**
+- Create: `tests/test_attnres_resnet2d_backbone.py`
+- Update: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write a failing backbone shape test**
+- [ ] **Step 2: Run it to confirm the new backbone/config does not exist yet**
+- [ ] **Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208**
+- [ ] **Step 4: Run the targeted tests and capture the failure**
+
+### Task 2: Implement a ResNet-like 2D AttnRes backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/attnres_resnet2d.py`
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+
+- [ ] **Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling**
+- [ ] **Step 2: Implement `AttnResImageBlock2D` for feature maps**
+- [ ] **Step 3: Implement `AttnResResNetLikeBackbone2D` with stage-wise downsampling**
+- [ ] **Step 4: Wire `_SingleRgbEncoder` to choose between original ResNet trunk and the new full-AttnRes trunk**
+- [ ] **Step 5: Run the new backbone tests**
+
+### Task 3: Expose config switches and agent wiring
+
+**Files:**
+- Modify: `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
+- Modify: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+
+- [ ] **Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk**
+- [ ] **Step 2: Add defaults for attnres image depth/heads/etc. if needed**
+- [ ] **Step 3: Add a Phase-2 launch override path that enables the new visual trunk**
+- [ ] **Step 4: Run agent wiring tests again**
+
+### Task 4: Smoke-verify training path
+
+**Files:**
+- Reuse existing training scripts and configs
+
+- [ ] **Step 1: Run a short CPU or tiny-step smoke instantiation / `compute_loss` test**
+- [ ] **Step 2: If needed, run a very short training smoke launch**
+- [ ] **Step 3: Verify no cond-dim or rollout-loading regressions**
+
+### Task 5: Launch the Phase-2 experiment
+
+**Files:**
+- Update experiment tracking under `experiment_suites/`
+
+- [ ] **Step 1: Use Phase-1 best setting (`pred_horizon=16`, `num_action_steps=8`)**
+- [ ] **Step 2: Launch baseline reference or reuse existing result**
+- [ ] **Step 3: Launch full-AttnRes vision experiment**
+- [ ] **Step 4: Track rollout metrics and compare max avg_reward**
--- a/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
+++ b/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
@@ -0,0 +1,81 @@
+# ResNet Multitoken IMF Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
+
+**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
+
+**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
+
+---
+
+### Task 1: Add failing tests for multi-token conditioning
+
+**Files:**
+- Modify: `tests/test_imf_vla_agent.py`
+- Modify: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: Add a direct agent test**
+  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
+  - Assert state is paired with each camera token, not concatenated across cameras first.
+- [ ] **Step 2: Add Hydra wiring test**
+  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
+  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
+- [ ] **Step 3: Run focused tests and verify RED**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+
+### Task 2: Implement multi-token ResNet conditioning path
+
+**Files:**
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+- Modify: `roboimi/vla/agent.py`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
+
+- [ ] **Step 1: Extend ResNet backbone**
+  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
+  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
+- [ ] **Step 2: Extend VLAAgent condition building**
+  - Support visual features with rank 4 `(B,T,K,D)`.
+  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
+  - Track `condition_tokens_per_step` and `condition_sequence_length`.
+- [ ] **Step 3: Update transformer-head instantiation**
+  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
+- [ ] **Step 4: Add Hydra config**
+  - New agent config uses:
+    - separate ResNet-18 per camera
+    - standard residual vision trunk (`vision_backbone_mode=resnet`)
+    - condition projector output dim tied to `${agent.head.n_emb}`
+    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
+
+### Task 3: Verify locally
+
+**Files:**
+- Modify only if verification reveals issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+- [ ] **Step 2: Run regression subset**
+  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 3: Run local smoke instantiation**
+  - instantiate the new Hydra config and verify cond shape / sequence length
+
+### Task 4: Launch 4 L20 experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Sync code to `100.119.99.14`**
+- [ ] **Step 2: Smoke the new config on remote**
+- [ ] **Step 3: Launch runs**
+  - `(n_emb=256, n_layer=12)`
+  - `(n_emb=256, n_layer=16)`
+  - `(n_emb=384, n_layer=12)`
+  - `(n_emb=384, n_layer=16)`
+- [ ] **Step 4: Keep fixed across runs**
+  - rollout episodes `10`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - standard ResNet-18 vision trunk
+  - three separate camera weights
+- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
--- a/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
+++ b/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
@@ -0,0 +1,78 @@
+# SigLIP2 Multiview VLA Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
+
+**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
+
+**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
+
+---
+
+### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
+
+**Files:**
+- Create: `tests/test_siglip2_diffusion_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write failing backbone tests**
+  - Instantiate the new backbone with a stub SigLIP2 vision model.
+  - Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
+  - Assert three views are encoded independently and projected.
+- [ ] **Step 2: Run focused tests and verify RED**
+  - Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
+  - Expect failure because the backbone/config/projector do not exist yet.
+- [ ] **Step 3: Extend agent wiring tests**
+  - Add a Hydra/instantiate test for a new SigLIP2 IMF config.
+  - Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
+
+### Task 2: Implement SigLIP2 backbone and optional condition projector
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
+- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
+- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
+- Modify: `roboimi/vla/models/backbones/__init__.py`
+- Modify: `roboimi/vla/agent.py`
+
+- [ ] **Step 1: Implement backbone**
+  - Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
+  - Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
+  - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
+- [ ] **Step 2: Implement optional condition projector**
+  - Allow `VLAAgent` to accept `cond_projector`.
+  - Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
+  - Apply the projector in `_build_cond()` after visual+state concatenation.
+- [ ] **Step 3: Add Hydra configs**
+  - New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
+  - Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
+
+### Task 3: Verify locally and prepare remote execution
+
+**Files:**
+- Modify as needed only if tests/smoke reveal issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 2: Run a local smoke instantiation**
+  - Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
+- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
+
+### Task 4: Sync to 5880 and launch experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Stop superseded remote jobs**
+- [ ] **Step 2: Sync updated code to remote**
+  - Prefer `rsync` or `git push/pull` without overwriting unrelated files.
+- [ ] **Step 3: Remote smoke test**
+  - Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
+  - Confirm headless rollout path still uses `256x256` eval resize.
+- [ ] **Step 4: Launch experiment A**
+  - `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
+- [ ] **Step 5: Launch experiment B**
+  - `per_view_output_dim=192`, same other hyperparameters.
+- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**
--- a/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
+++ b/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
@@ -0,0 +1,272 @@
+# IMF-AttnRes Policy Migration Design
+
+**Date:** 2026-04-01
+**Status:** Approved in chat, written spec pending review
+
+## Goal
+
+将 `/home/droid/project/diffusion_policy` 中提交 `185ed659` 的 IMF-AttnRes diffusion policy 迁移到当前 `roboimi` 仓库，作为当前 DiT / Transformer diffusion policy 的替代训练选项；同时迁移其训练目标与一步推理机制，并保持 RoboIMI 现有的仿真环境、三相机视觉输入、数据集格式、训练脚本和 rollout 验证工作流可继续使用。
+
+## Non-Goals
+
+- 不迁移 external repo 中与当前任务无关的 obs encoder、dataset、env wrapper、PushT 专用逻辑。
+- 不强行复刻 external repo 中全部目录结构；仅迁移当前 RoboIMI 训练所必需的模型、loss、inference 语义。
+- 不在本次工作中同时保留旧 DiT 为默认训练目标；旧配置继续可用，但新模型单独提供 config 入口。
+
+## User-Confirmed Requirements
+
+1. 迁移对象是 `185ed659` 中的 **IMF-AttnRes 模型相关代码**。
+2. 不只是迁移骨架，还要迁移：
+   - **训练目标**
+   - **一步推理机制**
+3. 视觉输入与当前 RoboIMI diffusion policy 一致：
+   - 使用三个相机图像作为条件输入
+   - 图像观测必须作为条件，而不是拼进输出预测目标
+4. 当前任务里，IMF policy 用来替代现有 DiT/Transformer diffusion policy 训练。
+5. 训练参数沿用最近一次训练的大体设置（后续由训练命令显式覆盖），但推理方式改为 IMF 的 one-step 机制。
+6. 用户接受 IMF 中“全注意力 / 非因果注意力”的实现约束。
+
+## External Source of Truth
+
+迁移语义以 external repo 的以下文件为准：
+
+- `diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- `diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+- `diffusion_policy/policy/imf_transformer_hybrid_image_policy.py`
+- 参考配置：`image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
+
+其中最关键的差异是：该策略并非 DDPM/DDIM 多步去噪，而是 IMF 训练目标 + one-step 推理。
+
+## Current RoboIMI Baseline
+
+当前 RoboIMI 中与该任务直接相关的基线如下：
+
+- 视觉编码：`ResNetDiffusionBackbone`
+  - 三相机：`r_vis`, `top`, `front`
+  - 每个时间步将相机特征与 `qpos` 拼接为 per-step condition
+- 策略主体：`VLAAgent`
+  - `compute_loss()` 使用 DDPM 噪声预测损失
+  - `predict_action()` 使用 DDIM 多步采样
+  - 在线控制通过动作队列机制在 `select_action()` 中按 chunk 触发预测
+- 训练脚本：`roboimi/demos/vla_scripts/train_vla.py`
+  - 支持 GPU 训练、SwanLab 日志、headless rollout 验证
+
+因此，本次迁移的核心不是换视觉 backbone，而是替换 **head + loss + inference semantics**。
+
+## Recommended Integration Approach
+
+采用 **最小侵入式集成**：
+
+1. **保留当前 RoboIMI 的视觉编码、数据读取、rollout/eval、训练脚本主框架**。
+2. **新增 IMF 专用 head 模块**，在 RoboIMI 内本地实现：
+   - AttnRes 组件
+   - IMF transformer 主体
+3. **新增 IMF 专用 agent**，复用当前 `VLAAgent` 的：
+   - 归一化逻辑
+   - 相机顺序管理
+   - 观测缓存 / 动作 chunk 缓存
+   - rollout 接口
+   但覆盖：
+   - `compute_loss()`
+   - `predict_action()`
+4. **新增独立 Hydra config**，让 IMF policy 作为新的 agent 选项，不破坏已有 resnet_transformer / gr00t_dit 配置。
+
+这样做的原因：
+
+- 迁移 IMF 语义时不必把当前 DDPM agent 搅乱；
+- rollout / eval / checkpoint 逻辑仍然可复用；
+- 便于和现有 Transformer / DiT 直接做 A/B 对比训练。
+
+## Architecture
+
+### 1. Observation / Conditioning Path
+
+沿用当前 RoboIMI 的视觉路径：
+
+- 输入观测：`images={r_vis, top, front}` + `qpos`
+- `ResNetDiffusionBackbone` 对每个相机编码，得到 per-camera feature
+- `state_encoder` 编码 `qpos`
+- 将三相机特征与 state feature 按时间步拼接，形成 `per_step_cond`
+
+这里不迁移 external repo 的 obs_encoder 实现；我们只对齐 **“图像作为条件 token 输入 transformer”** 这一语义。
+
+### 2. Condition Tokenization
+
+对齐 external IMF transformer 的 token 使用方式：
+
+- action trajectory token：由 `(B, pred_horizon, action_dim)` 通过线性层映射到 `n_emb`
+- time token：两个标量 `r` 与 `t`，分别通过 sinusoidal embedding + linear projection 得到 token
+- observation token：`per_step_cond` 通过线性层映射到 `n_emb`
+- 最终 token 序列为：
+  - `[r_token, t_token, obs_cond_tokens..., action_tokens...]`
+
+在当前任务中，obs token 数量等于 `obs_horizon`，且图像观测始终作为条件输入。
+
+### 3. IMF-AttnRes Backbone
+
+在 RoboIMI 内新增 AttnRes backbone 实现，保持 external commit 的关键语义：
+
+- `RMSNorm` / `RMSNormNoWeight`
+- RoPE
+- Grouped Query Self-Attention
+- SwiGLU FFN
+- AttnRes operator / residual source aggregation
+- `AttnResTransformerBackbone`
+
+并保持：
+
+- **full attention**（不使用因果注意力）
+- `backbone_type='attnres_full'`
+- 输出仅切回 action token 部分，再经过最终 norm + head 得到 velocity-like 输出
+
+### 4. Training Objective
+
+训练目标从当前 DDPM epsilon prediction 改为 external IMF 目标：
+
+给定真实轨迹 `x` 与随机噪声 `e`：
+
+1. 采样 `t ~ U(0,1)`、`r ~ U(0,1)`，并排序为 `t >= r`
+2. 构造插值状态：
+   - `z_t = (1 - t) x + t e`
+3. 用模型计算：
+   - `v = f(z_t, t, t, cond)`
+4. 对 `g(z, r, t) = f(z, r, t, cond)` 做 JVP，得到：
+   - `u, du_dt`
+5. 构造 compound velocity：
+   - `V = u + (t - r) * du_dt`
+6. 目标为：
+   - `target = e - x`
+7. 用 action 维度上的 MSE 作为最终损失
+
+RoboIMI 现有 batch 中的 `action_is_pad` 仍要保留支持；如果存在 padding，只在有效 action 上计算损失。
+
+### 5. One-Step Inference
+
+推理改为 external IMF 的一步采样语义：
+
+1. 从标准高斯初始化 action trajectory `z_t`
+2. 计算 `u = f(z_t, r=0, t=1, cond)`
+3. 一步更新：
+   - `x_hat = z_t - (t-r) * u = z_t - u`
+4. 反归一化得到动作序列
+
+这意味着：
+
+- `num_inference_steps` 对 IMF policy 固定为 `1`
+- 不再调用 DDIM scheduler 的多步 `step()`
+- 在线控制中仍沿用当前 chunk 机制：
+  - 动作队列为空时触发一次 `predict_action_chunk()`
+  - 取预测序列中 `[obs_horizon-1 : obs_horizon-1+num_action_steps]` 这一段入队
+
+也就是说，**触发模型前向的规则不变，改变的是每次触发后的动作序列生成方式**。
+
+## API / Code Structure
+
+计划中的主要代码边界如下：
+
+- `roboimi/vla/models/heads/attnres_transformer_components.py`
+  - IMF AttnRes 基础组件
+- `roboimi/vla/models/heads/imf_transformer1d.py`
+  - RoboIMI 版本 IMF transformer head
+  - 对外暴露 `forward(sample, r, t, cond=None)`
+  - 暴露 `get_optim_groups()` 供 AdamW 分组使用
+- `roboimi/vla/agent_imf.py`
+  - 复用 `VLAAgent` 的观测处理 / normalization / queue 基础设施
+  - 覆盖 IMF 的训练损失与 one-step 预测逻辑
+- Hydra config
+  - `roboimi/vla/conf/head/imf_transformer1d.yaml`
+  - `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+
+训练脚本主流程尽量不改；只要求它能 instantiate 新 agent 并继续使用当前 rollout / checkpoint / swanlab 逻辑。
+
+## Compatibility Decisions
+
+## Initial Config Defaults To Preserve
+
+为避免迁移时语义漂移，首版 IMF 配置默认值明确固定为：
+
+- `backbone_type: attnres_full`
+- `n_head: 1`
+- `n_kv_head: 1`
+- `n_cond_layers: 0`
+- `time_as_cond: true`
+- `causal_attn: false`
+- `num_inference_steps: 1`
+
+这些默认值与 external `185ed659` 的 IMF-AttnRes 使用方式保持一致；后续调参可以覆盖，但首版迁移必须先以该语义跑通。
+
+### Reuse From RoboIMI
+
+保留：
+
+- 三相机数据读取方式
+- ResNet visual backbone
+- qpos / action normalization
+- 训练循环、优化器、scheduler、SwanLab、headless rollout
+- `select_action()` 的在线 chunk 执行方式
+
+### Replace With External IMF Semantics
+
+替换：
+
+- transformer head 实现
+- diffusion training objective
+- inference sampling semantics
+
+### Intentionally Not Mirrored 1:1
+
+不强行与 external repo 一致的部分：
+
+- external repo 的整体 policy 基类继承体系
+- external repo 的 obs encoder 模块树
+- external repo 的 normalizer / mask generator 框架
+
+原因是当前 RoboIMI 已有稳定的数据接口和 rollout 流程，直接嫁接进去更稳。
+
+## Testing / Verification Strategy
+
+迁移完成后至少验证以下内容：
+
+1. **单元 / 冒烟验证**
+   - IMF head 前向 shape 正确
+   - IMF agent `compute_loss()` 在真实 batch 上可前向、反向
+   - IMF agent `predict_action()` 能输出 `(B, pred_horizon, action_dim)`
+2. **训练链路验证**
+   - 使用 GPU 跑一个短训练任务，确认：
+     - dataloader 正常
+     - optimizer / lr scheduler 正常
+     - SwanLab 正常记录配置和训练指标
+3. **rollout 验证**
+   - 训练中周期性 headless rollout 能跑通
+   - 环境仍按 EE-style `step()` 接收动作
+4. **最终交付**
+   - 用用户指定的同类超参数启动正式训练
+
+## Risks and Mitigations
+
+### Risk 1: JVP 在 CUDA 注意力内核上不稳定
+
+缓解：沿用 external repo 的策略，在 JVP 路径上切换到 math SDP kernel，必要时 fallback 到 `torch.autograd.functional.jvp`。同时，JVP 的切线构造与 `u, du_dt` 计算流程必须严格对齐 external source，不在本次迁移中自行改写其数学语义。
+
+### Risk 2: Optimizer 参数分组遗漏新模块
+
+缓解：IMF head 提供 `get_optim_groups()`，并在训练脚本中按“只要 head 提供该接口就使用”的策略统一处理，而不是绑定旧 `head_type`。
+
+### Risk 3: 现有 rollout 逻辑假定 DDIM 多步采样
+
+缓解：保持 `select_action()` / `predict_action_chunk()` 接口不变，只替换 `predict_action()` 内部实现，确保 eval 代码无需理解 IMF 细节。
+
+### Risk 4: 训练命令参数与新 config 不一致
+
+缓解：新增独立 agent config，并保留此前训练参数作为显式 CLI override 模板。
+
+## Success Criteria
+
+以下条件全部满足，视为本次迁移成功：
+
+1. RoboIMI 中新增 IMF-AttnRes policy，可通过 Hydra config 单独启用。
+2. 训练时使用 external IMF 的 loss，而不是当前 DDPM epsilon loss。
+3. 推理时使用 one-step IMF 采样，而不是 DDIM 多步采样。
+4. 三相机图像始终作为条件输入参与模型前向。
+5. 在线 rollout 能在 headless 仿真环境中跑通。
+6. 能按最近一次实验参数模板成功启动训练。
--- a/docs/superpowers/specs/2026-04-02-imf-rollout-trajectory-images-design.md
+++ b/docs/superpowers/specs/2026-04-02-imf-rollout-trajectory-images-design.md
@@ -0,0 +1,75 @@
+# IMF Rollout Trajectory Images + Short-Horizon Training Design
+
+## Background
+The current RoboIMI IMF training flow can perform rollout validation and log scalar reward metrics to SwanLab, but it does not yet emit the qualitative rollout artifacts now required for analysis. The user wants training-time rollout validation to save front-view trajectory images with the model-generated trajectory drawn in red, upload those images to SwanLab, and then start a new local short-horizon IMF training run.
+
+## Goals
+1. During training-time rollout validation, save one **front-camera** trajectory image per rollout episode.
+2. The image must show the rollout EE trajectory in red.
+3. Reuse the existing repository trajectory visualization logic as much as practical, especially the existing red capsule-marker trajectory representation.
+4. Save 5 rollout images locally for each validation event and upload the same 5 images to SwanLab.
+5. Do **not** record rollout videos for this training-time validation flow.
+6. Start a new local IMF-AttnRes training run with:
+   - `agent.head.n_emb=384`
+   - `agent.head.n_layer=12`
+   - `agent.pred_horizon=8`
+   - `agent.num_action_steps=4`
+   - `train.max_steps=50000`
+   - `train.rollout_num_episodes=5`
+   - `train.use_swanlab=true`
+
+## Non-Goals
+- No IMF architecture or loss-function change.
+- No dataset schema change.
+- No rollout video generation for the new training flow.
+- No interactive viewer requirement.
+
+## Existing Relevant Code
+- `roboimi/demos/vla_scripts/eval_vla.py`
+  - already supports rollout summaries, optional trajectory export, and optional video export.
+- `roboimi/utils/raw_action_trajectory_viewer.py`
+  - already contains the red trajectory capsule-marker construction logic.
+- `roboimi/demos/vla_scripts/train_vla.py`
+  - already performs periodic rollout validation and scalar SwanLab logging.
+- `roboimi/vla/agent.py`
+  - already implements “predict pred_horizon, execute first num_action_steps” queue semantics.
+
+## Design Decisions
+
+### 1. Artifact contract
+Each rollout episode will emit one distinct PNG file under the eval artifact directory. The file naming/path contract must be per-episode, not shared, so a 5-episode validation event yields 5 stable image paths without overwriting.
+
+### 2. Trajectory definition
+The red trajectory corresponds to the **actually executed model action sequence** over the rollout loop: the raw EE actions returned and consumed step-by-step by the policy loop. For the requested short-horizon run, this means the visualization reflects repeated execution of the first 4 actions from each predicted 8-action chunk, not every discarded future prediction from replanning.
+
+### 3. Camera choice
+The training-time image export path is explicitly pinned to the repo’s concrete `front` camera key. It must not silently use `camera_names[0]` if that is not `front`.
+
+### 4. Rendering path
+`eval_vla.py` will add a lightweight headless image-export path that:
+- renders the `front` camera frame,
+- overlays the trajectory using the existing red trajectory representation,
+- saves a static PNG per episode.
+
+The implementation may reuse the existing marker-construction logic directly and add a minimal helper for final image composition/export.
+
+### 5. Training-time behavior
+`train_vla.py` rollout validation must explicitly:
+- request/save trajectory images,
+- keep `record_video=false`,
+- return the 5 per-episode image paths in the rollout summary payload,
+- upload those 5 images to SwanLab,
+- keep image-upload failures non-fatal.
+
+## Expected User-Visible Outcome
+For each scheduled validation event in the new training run:
+- 5 rollout episodes execute,
+- 5 front-view PNG trajectory images are saved locally,
+- the same 5 images are uploaded to SwanLab,
+- scalar reward metrics continue to be logged,
+- no rollout videos are generated.
+
+## Risks and Mitigations
+- **Headless rendering conflicts from desktop env vars**: force headless eval onto EGL when `headless=true`.
+- **Image overwrite risk**: use explicit per-episode artifact paths.
+- **SwanLab media API mismatch**: isolate media logging in a small best-effort helper.
--- a/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
+++ b/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
@@ -0,0 +1,138 @@
+# LEWM ViT Backbone Replacement Design
+
+## Goal
+将当前 roboimi VLA policy 中的 ResNet 视觉编码器替换为来自 LEWM checkpoint 的冻结 ViT 视觉编码器（encoder + projector），仅使用最终 CLS token 的 192 维 embedding 作为视觉特征。
+
+## User constraints
+- 使用 `/home/droid/下载/lewm_sim_transfer_checkpoint_usage.md` 中确认的训练好 checkpoint
+- 只使用视觉编码部分：`encoder + projector`
+- 权重冻结
+- 维持“视觉特征 + state 拼接，再送入 diffusion transformer”这一总体处理方式
+- 输入使用三视角：`[r_vis, top, front]`
+- 在 5880 机器上启动两个训练：`embed=384/layer=12` 和 `embed=256/layer=12`
+- `pred_horizon=16`
+- `num_action_steps=8`
+- 每个训练 `50k` steps
+- rollout 验证每次用 `10` 个 episodes，不是之前的 `5`
+
+## Trusted existing facts
+1. LEWM checkpoint 路径：
+   - `/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt`
+2. 需要加载的 state_dict 前缀：
+   - `model.encoder.*`
+   - `model.projector.*`
+3. LEWM ViT 配置：
+   - encoder scale: `tiny`
+   - hidden size: `192`
+   - layers: `12`
+   - attention heads: `3`
+   - patch size: `14`
+   - projector: `MLP(192 -> 2048 -> 192)` with `BatchNorm1d + GELU`
+4. LEWM 训练时三视角先拼成单图，再送入单个 ViT encoder；输出整体视觉 embedding 是 **192 维**。
+
+## Key design decision
+### Chosen design: fuse 3 cameras into one LEWM-style image, output one 192-d visual vector per timestep
+不是把 LEWM ViT 当成“每相机一个 192-d encoder”，而是按 LEWM 原训练方式：
+- 输入三视角图像字典 `{r_vis, top, front}`
+- 按固定顺序拼成一张 fused image
+- 走单个 frozen ViT + projector
+- 得到一个 **192 维总视觉特征**
+
+### Why this is the right replacement
+当前 ResNet backbone 对外给到 policy head 的**总视觉特征维度**是：
+- 每相机 `64`
+- 三相机总计 `192`
+
+而 LEWM checkpoint 输出的 CLS/projector embedding 也是：
+- 总计 `192`
+
+因此，最自然的“直接平替当前 ResNet 视觉编码器”的方式是：
+- 用 LEWM backbone 直接产出一个 192-d 总视觉向量
+- 后续和 state `16-d` 拼接后，依旧得到 `208-d` 条件向量
+- 不改 diffusion head 的总体接口和语义
+
+## Interface compatibility plan
+现有 `VLAAgent` 假设 backbone 暴露：
+- `camera_names`
+- `num_cameras`
+- `output_dim`（语义上是“每相机特征维度”）
+- `forward(images_dict) -> (B, T, total_visual_dim)`
+
+为了最小改动兼容现有 agent：
+- 新 LEWM backbone 的 `forward()` 返回 `(B, T, 192)`
+- `camera_names = ('r_vis', 'top', 'front')`
+- `num_cameras = 3`
+- `output_dim = 64`
+
+这样 `VLAAgent` 内部仍会计算：
+- `per_step_cond_dim = output_dim * num_cams + obs_dim = 64*3 + 16 = 208`
+与实际 `forward()` 输出的 `192 + 16 = 208` 保持一致。
+
+> 也就是说：`output_dim` 在这个 backbone 里保留为“与旧 ResNet 总特征等价的单相机占位维度”，而不是“真实 projector 输出维度”。这是一个兼容性 shim，用来避免改 agent 主逻辑。
+
+## Image preprocessing design
+当前 roboimi dataset 已经把每个相机图像读成：
+- `(C, 224, 224)`
+- 值域 `[0, 1]`
+
+新 LEWM backbone 将：
+1. 按顺序取 `r_vis`, `top`, `front`
+2. 在宽度方向拼接，得到 fused image：
+   - `(C, 224, 672)`
+3. 使用 LEWM 一致的 ImageNet normalize：
+   - mean `[0.485, 0.456, 0.406]`
+   - std `[0.229, 0.224, 0.225]`
+4. 调用 `ViTModel(..., interpolate_pos_encoding=True)`
+5. 取 `last_hidden_state[:, 0]`
+6. 送入 frozen projector，得到 `(B*T, 192)`
+
+## Files to create / modify
+### New files
+- `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+- `tests/test_lewm_vit_backbone.py`
+
+### Modified files
+- `roboimi/vla/models/backbones/__init__`（如果需要导出）
+- `tests/test_imf_vla_agent.py`（增加新 backbone 集成用例）
+- `roboimi/demos/vla_scripts/train_vla.py`（如需仅调整 rollout 默认/日志；如果命令覆盖足够，则尽量不改主逻辑）
+- 训练/实验 suite 文档（新增本次 LEWM ViT 训练记录）
+
+## Testing plan
+1. **Unit test: load + forward**
+   - 用 synthetic checkpoint 验证新 backbone 能正确加载 `model.encoder.*` 与 `model.projector.*`
+   - 输入 3 相机 `(B,T,C,224,224)`
+   - 输出 `(B,T,192)`
+2. **Agent integration test**
+   - backbone.output_dim=64, num_cameras=3
+   - agent `_build_cond()` 输出最后维度为 `208`
+3. **Remote smoke test on 5880**
+   - 使用真实 checkpoint
+   - `max_steps=2`
+   - 两个实验各自 smoke 一次
+4. **Full run**
+   - GPU0: `embed=384, layer=12`
+   - GPU1: `embed=256, layer=12`
+   - `rollout_num_episodes=10`
+
+## Training launch contract
+- host: `100.73.14.65`
+- code dir: `/home/droid/roboimi_suite_20260404`
+- python: `/home/droid/miniforge3/envs/roboimi/bin/python`
+- dataset: `/home/droid/sim_dataset/sim_transfer`
+- cameras: `[r_vis, top, front]`
+- agent: new `lewm_imf_attnres`
+- max_steps: `50000`
+- rollout every `5` epochs
+- rollout episodes: `10`
+
+## Risks
+1. LEWM 训练时的 fused image 预处理如果方向实现错了（224x672 vs 672x224），会导致分布偏移。
+2. 当前 roboimi env 需确保安装 `transformers`；从 `environment.yml` 看本地已有该依赖，但远端训练环境要 smoke 确认。
+3. 因为这是 frozen ViT + projector，若 projector BN 仍保持 train 模式，统计量会漂移，所以必须整体 `eval()` 并冻结。
+
+## Recommended first implementation path
+- 先实现一个独立 `LEWMViTBackbone` 类，不改现有 `ResNetDiffusionBackbone` 主逻辑。
+- 再通过新的 hydra backbone/agent 配置接入。
+- 优先做到“最少侵入 + smoke 可跑 + 远端可训”。
--- a/docs/superpowers/specs/2026-04-05-phase2-full-attnres-vision-design.md
+++ b/docs/superpowers/specs/2026-04-05-phase2-full-attnres-vision-design.md
@@ -0,0 +1,81 @@
+# Phase-2 Full-AttnRes Vision Design
+
+## Goal
+在当前 roboimi IMF policy 中，把视觉 backbone 里原先由 ResNet BasicBlock/Bottleneck 提供的残差单元全部替换为 AttnRes 风格单元，同时尽量保持现有 agent / cond / rollout / 训练脚本接口不变。
+
+## User requirement interpretation
+这里按最严格解释执行：
+- 不是“在 ResNet 后面再加一个 AttnRes 模块”
+- 也不是“只在某几个 stage 加 AttnRes 混合”
+- 而是：视觉主干网络中原本依赖 ResNet residual block 的地方，统一改成 AttnRes residual operator 驱动的 block
+- 最终仍然输出与现有 `ResNetDiffusionBackbone` 相同的每相机特征接口，以便复用 `SpatialSoftmax -> Linear -> ReLU`、多相机拼接、state concat、IMF head 条件输入
+
+## Recommended design
+### Option A (recommended)
+保留 ResNet 的宏观 stage/stem 结构与通道/步幅规划，但把每个 stage 内的 BasicBlock/Bottleneck 替换为新的 `AttnResImageBlock2D`：
+- 输入仍是 `(B, C, H, W)` feature map
+- block 内先把空间维 flatten 成 token 序列 `(B, H*W, C)`
+- 用二维位置编码 / 可学习位置偏置 + AttnRes self-attention + AttnRes FFN 完成 block 变换
+- 再 reshape 回 `(B, C, H, W)`
+- stage 间下采样仍由显式 stride/downsample path 完成
+
+优点：
+- 最接近“ResNet 中所有残差都由 AttnRes 代替”的要求
+- 保留现有视觉输出接口和 cond_dim，不用改 agent/head/data pipeline
+- 仍可沿用现有多相机编码器框架
+
+缺点：
+- 需要新写 2D 版 AttnRes image block，而不是直接复用 1D IMF head block
+
+### Option B
+完全移除 ResNet stage，换成 patchify + ViT/AttnRes 图像 transformer，再接 SpatialSoftmax/MLP。
+
+优点：实现概念更统一。  
+缺点：已经不算“把 ResNet 中残差替换掉”，而是直接换 backbone，和用户要求不完全一致。
+
+### Option C
+保留现有 ResNet block，只在 block 外层加 AttnRes mixing。
+
+不推荐，因为不满足“所有残差均由 AttnRes 替代”。
+
+## Concrete architecture choice
+采用 Option A：
+1. 保留 stem（conv/bn-or-gn/relu/maxpool）与 stage 边界
+2. 新增 `AttnResImageBlock2D`
+3. 新增 `AttnResResNetLikeBackbone2D`，负责堆叠 stage/block
+4. 在 `ResNetDiffusionBackbone` 中增加可选 backbone mode，例如：
+   - `vision_backbone_mode: resnet`
+   - `vision_backbone_mode: attnres_resnet`
+5. `resnet_imf_attnres` agent 配置新增一个 Phase-2 变体，默认打开 `attnres_resnet`
+6. 仍保持：
+   - 每相机输出 `64`
+   - 多相机总视觉输出 `3 * 64`
+   - 与 state 拼接后 `cond_dim = 208`
+
+## Files likely to change
+- `roboimi/vla/models/backbones/resnet_diffusion.py`
+- `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
+- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+- new: `roboimi/vla/models/backbones/attnres_resnet2d.py`
+- tests:
+  - new: `tests/test_attnres_resnet2d_backbone.py`
+  - update/add wiring test for agent cond dims
+
+## Test plan
+1. New backbone instantiates and forwards `(B,T,C,H,W)` multi-camera input
+2. Output shape unchanged vs current backbone
+3. `output_dim == 64`
+4. 3-camera cond path still yields `208`
+5. Phase-2 config instantiates full IMF agent successfully
+6. One short CPU smoke forward for `compute_loss`
+
+## Phase-2 experiment plan
+固定使用 Phase-1 最优组合：
+- `pred_horizon=16`
+- `num_action_steps=8`
+
+比较：
+1. baseline: current IMF head-only AttnRes + original ResNet vision backbone
+2. phase2: IMF head AttnRes + full AttnRes-replaced vision backbone
+
+训练超参保持与 Phase-1 最优设置一致，先跑一组 50k step 对比。
--- a/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
+++ b/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
@@ -0,0 +1,32 @@
+# ResNet Multitoken IMF Design
+
+**Status:** user-specified architecture, treated as approved on 2026-04-06.
+
+## Goal
+Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
+
+## Approved architecture
+- Vision trunk: standard `resnet18` residual network
+- Cameras: `front`, `top`, `r_vis`
+- Each camera uses its **own** ResNet-18 weights (`use_separate_rgb_encoder_per_camera=true`)
+- Each camera produces one visual token
+- For each obs step and each camera:
+  1. take that camera visual token
+  2. concatenate robot state
+  3. project to one condition token
+- IMF input should receive **3 condition tokens per obs step**, not one concatenated token
+- With `obs_horizon=2`, IMF cond sequence length becomes `2 * 3 = 6`
+- IMF head remains on the existing IMF/AttnRes implementation path
+- Vision trunk remains standard ResNet; **no AttnRes vision replacement**
+
+## Design choices
+- Extend `ResNetDiffusionBackbone` with an opt-in mode that returns per-camera tokens shaped `(B, T, num_cams, D)` instead of concatenating camera features into `(B, T, num_cams * D)`.
+- Teach `VLAAgent` to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten `(T, num_cams)` into one cond sequence for the IMF head.
+- Keep `per_step_cond_dim` as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
+- For the new experiments, set the condition-token width equal to `n_emb` via `cond_projector.output_dim=${agent.head.n_emb}`.
+
+## Files expected to change
+- `roboimi/vla/models/backbones/resnet_diffusion.py`
+- `roboimi/vla/agent.py`
+- new Hydra agent config for the multitoken ResNet IMF variant
+- focused tests in `tests/test_imf_vla_agent.py` and/or `tests/test_resnet_transformer_agent_wiring.py`
--- a/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
+++ b/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
@@ -0,0 +1,41 @@
+# SigLIP2 Multiview VLA Design
+
+**Status:** user-specified architecture, treated as approved on 2026-04-06
+
+## Goal
+Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
+
+## Approved architecture
+- Backbone model: `google/siglip2-base-patch16-256`
+- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
+- Input size:
+  - dataset images stay at native `256x256` (no dataset-side resize)
+  - eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
+- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
+- Per-view projection experiments:
+  1. `768 -> 96`
+  2. `768 -> 192`
+- Conditioning pipeline:
+  1. concatenate 3 projected camera vectors
+  2. concatenate robot state
+  3. project concatenated condition to `384`
+  4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
+- Training/run defaults for requested experiments:
+  - `n_emb=384`
+  - `n_layer=12`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
+
+## Design decisions
+- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
+- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
+- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
+- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
+
+## Files expected to change
+- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
+- `roboimi/vla/agent.py` for optional post-concat condition projection
+- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
+- tests for backbone wiring and agent conditioning dims
+- remote launch commands/scripts only as needed for training