merge: imf attnres policy

# Conflicts: # roboimi/demos/vla_scripts/eval_vla.py # roboimi/envs/double_base.py
2026-05-02 22:23:29 +08:00
parent a2c018acce ff7c9c1f2a
commit b1116e489f
90 changed files with 6824 additions and 87 deletions
--- a/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
+++ b/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
@@ -0,0 +1,268 @@
+# IMF-AttnRes Policy Migration Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 将 external `diffusion_policy@185ed659` 的 IMF-AttnRes 模型、训练目标和一步推理机制迁移到 RoboIMI，并在保持三相机视觉条件输入与现有训练/rollout 工作流的前提下启动同参数训练。
+
+**Architecture:** 保留 RoboIMI 现有 ResNet 三相机观测编码、normalization、queue-based online rollout 和训练脚本；新增 AttnRes 组件与 IMF transformer head，并新增 IMF 专用 agent 以覆盖 DDPM loss / DDIM inference 语义。训练脚本只做最小接线修改，让新 head/agent 能用现有 optimizer、checkpoint、SwanLab 和 headless rollout。
+
+**Tech Stack:** PyTorch, Hydra, diffusers schedulers (仅保留兼容初始化), MuJoCo rollout, unittest, SwanLab
+
+---
+
+## File Map
+
+### New files
+- `roboimi/vla/models/heads/attnres_transformer_components.py` — 本地 IMF AttnRes 基础组件
+- `roboimi/vla/models/heads/imf_transformer1d.py` — IMF transformer head，暴露 `forward(sample, r, t, cond=None)`
+- `roboimi/vla/agent_imf.py` — IMF 专用 VLA agent，复用现有观测/队列/normalization 逻辑并覆盖 loss / inference
+- `roboimi/vla/conf/head/imf_transformer1d.yaml` — IMF head 配置
+- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml` — IMF agent + backbone/head 组合配置
+- `tests/test_imf_transformer1d_external_alignment.py` — external `185ed659` 对齐测试
+- `tests/test_imf_vla_agent.py` — IMF agent 的 loss / inference / queue 语义测试
+
+### Modified files
+- `roboimi/demos/vla_scripts/train_vla.py` — 优化器参数分组接线；确保新 agent 能无缝训练
+- `roboimi/vla/conf/config.yaml` — 保持默认配置不变，仅支持通过 override 启用 IMF agent
+- `tests/test_train_vla_transformer_optimizer.py` — 覆盖 IMF head 的 optimizer-group 行为
+- （如需要）`roboimi/vla/models/heads/__init__.py` 或相近导出文件 — 暴露新 head
+
+---
+
+### Task 1: 写 IMF transformer 对齐测试
+
+**Files:**
+- Create: `tests/test_imf_transformer1d_external_alignment.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+
+- [ ] **Step 1: 写失败测试，验证 local IMF head 与 external `185ed659` 的 state-dict key、前向 shape、forward 数值、optim groups 对齐**
+
+```python
+with torch.no_grad():
+    external_out = external_model(sample=sample, r=r, t=t, cond=cond)
+    local_out = local_model(sample=sample, r=r, t=t, cond=cond)
+assert torch.allclose(local_out, external_out, atol=1e-6, rtol=1e-5)
+```
+
+- [ ] **Step 2: 运行单测，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: FAIL，提示 `imf_transformer1d` / `attnres` 模块不存在
+
+- [ ] **Step 3: 若测试需要复用现有 external-loader 逻辑，则从 `tests/test_transformer1d_external_alignment.py` 复制最小必要 helper，避免重复依赖 session context**
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_transformer1d_external_alignment.py
+git commit -m "test: add IMF transformer external alignment coverage"
+```
+
+### Task 2: 实现 AttnRes 组件与 IMF transformer head
+
+**Files:**
+- Create: `roboimi/vla/models/heads/attnres_transformer_components.py`
+- Create: `roboimi/vla/models/heads/imf_transformer1d.py`
+- Modify: `tests/test_imf_transformer1d_external_alignment.py`
+
+- [ ] **Step 1: 按 external `185ed659` 迁移 AttnRes 基础组件，保持命名和参数语义一致**
+
+必须包含：
+- `RMSNorm`
+- `RMSNormNoWeight`
+- `precompute_rope_freqs`
+- `apply_rope`
+- `GroupedQuerySelfAttention`
+- `SwiGLUFFN`
+- `AttnResOperator`
+- `AttnResSubLayer`
+- `AttnResTransformerBackbone`
+
+- [ ] **Step 2: 在 `imf_transformer1d.py` 中实现本地 IMF head**
+
+必须满足：
+- `forward(sample, r, t, cond=None)`
+- 默认支持 `backbone_type='attnres_full'`
+- token 序列为 `[r_token, t_token, cond_tokens..., sample_tokens...]`
+- 输出只切回 sample token 段
+- 保留 `get_optim_groups()` 供 AdamW 分组
+
+- [ ] **Step 3: 运行对齐测试，修正 state-dict key / init / no-decay 参数分组不一致问题**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: PASS
+
+- [ ] **Step 4: 提交模型组件实现**
+
+```bash
+git add roboimi/vla/models/heads/attnres_transformer_components.py \
+        roboimi/vla/models/heads/imf_transformer1d.py \
+        tests/test_imf_transformer1d_external_alignment.py
+git commit -m "feat: add IMF AttnRes transformer head"
+```
+
+### Task 3: 写 IMF agent 行为测试
+
+**Files:**
+- Create: `tests/test_imf_vla_agent.py`
+- Reference: `roboimi/vla/agent.py`
+- Reference: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: 写失败测试，覆盖 IMF agent 的核心契约**
+
+需要覆盖：
+1. `compute_loss()` 接受当前 batch 结构并返回标量 loss
+2. `predict_action()` 输出 `(B, pred_horizon, action_dim)`
+3. `select_action()` 仍按 queue/chunk 语义工作
+4. `predict_action()` 不走 DDIM 多步循环，而是只触发一步 IMF sample
+5. `action_is_pad` 存在时仅在有效 action 上计 loss
+
+- [ ] **Step 2: 用 stub backbone / stub head 记录调用参数，验证 `r,t,cond` 的传递与 observation conditioning 维度正确**
+
+```python
+self.assertEqual(recorded['cond'].shape, (B, obs_horizon, expected_cond_dim))
+self.assertTrue(torch.allclose(recorded['r'], torch.zeros(B)))
+self.assertTrue(torch.allclose(recorded['t'], torch.ones(B)))
+```
+
+- [ ] **Step 3: 运行测试，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_vla_agent -v`
+Expected: FAIL，提示 `roboimi.vla.agent_imf` 不存在
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_vla_agent.py
+git commit -m "test: add IMF VLA agent behavior coverage"
+```
+
+### Task 4: 实现 IMF agent 与 Hydra 接线
+
+**Files:**
+- Create: `roboimi/vla/agent_imf.py`
+- Create: `roboimi/vla/conf/head/imf_transformer1d.yaml`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_transformer_optimizer.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: 以 `VLAAgent` 为基础实现 `IMFVLAAgent`**
+
+实现策略：
+- 复用 `VLAAgent.__init__`、`_build_cond()`、`reset()`、`_populate_queues()`、`_prepare_observation_batch()`、`select_action()`、`get_normalization_stats()`
+- 覆盖：
+  - `compute_loss()` -> IMF objective
+  - `predict_action()` -> one-step sample
+- 提供内部 helper：
+  - `_broadcast_batch_time`
+  - `_apply_conditioning`（如需）
+  - `_compute_u_and_du_dt`
+  - `_compound_velocity`
+  - `_sample_one_step`
+
+- [ ] **Step 2: 在 JVP 路径中加入 CUDA math SDPA fallback，保持 external repo 的稳定性策略**
+
+- [ ] **Step 3: 新增 Hydra 配置，让 `agent=resnet_imf_attnres` 可实例化**
+
+关键默认值：
+- `_target_: roboimi.vla.agent_imf.IMFVLAAgent`
+- `head._target_: roboimi.vla.models.heads.imf_transformer1d.IMFTransformer1D`
+- `head.backbone_type: attnres_full`
+- `head.causal_attn: false`
+- `head.time_as_cond: true`
+- `head.n_cond_layers: 0`
+- `inference_steps: 1`
+- `camera_names: ${data.camera_names}`
+- `vision_backbone.camera_names: ${agent.camera_names}`
+
+- [ ] **Step 4: 让训练脚本对任何带 `get_optim_groups()` 的 head 复用参数分组，而不是硬编码旧 transformer head_type**
+
+推荐最小改法：
+```python
+use_head_groups = callable(getattr(noise_pred_net, 'get_optim_groups', None))
+```
+
+- [ ] **Step 5: 运行测试并修复 wiring 问题**
+
+Run:
+- `python -m unittest tests.test_imf_vla_agent -v`
+- `python -m unittest tests.test_train_vla_transformer_optimizer -v`
+
+Expected: PASS
+
+- [ ] **Step 6: 提交 agent / config / train-script 接线**
+
+```bash
+git add roboimi/vla/agent_imf.py \
+        roboimi/vla/conf/head/imf_transformer1d.yaml \
+        roboimi/vla/conf/agent/resnet_imf_attnres.yaml \
+        roboimi/demos/vla_scripts/train_vla.py \
+        tests/test_imf_vla_agent.py \
+        tests/test_train_vla_transformer_optimizer.py
+git commit -m "feat: add IMF VLA agent and training wiring"
+```
+
+### Task 5: 集成验证与训练启动
+
+**Files:**
+- Modify: none required unless验证暴露真实问题
+- Use run artifacts under: `runs/`
+
+- [ ] **Step 1: 运行聚焦测试集**
+
+Run:
+```bash
+python -m unittest \
+  tests.test_imf_transformer1d_external_alignment \
+  tests.test_imf_vla_agent \
+  tests.test_resnet_transformer_agent_wiring \
+  tests.test_train_vla_transformer_optimizer -v
+```
+Expected: PASS
+
+- [ ] **Step 2: 运行一个最小 GPU 训练冒烟任务（不必长跑）**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.max_steps=2 train.batch_size=4 train.num_workers=2 \
+  train.use_swanlab=false train.rollout_val_freq_epochs=0
+```
+Expected: 成功完成 2 steps，生成 checkpoint / log，无 shape 或 JVP 错误
+
+- [ ] **Step 3: 用正式参数启动 IMF 训练**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.val_split=0.0 train.seed=42 \
+  train.batch_size=80 train.lr=5e-4 train.num_workers=12 train.max_steps=150000 \
+  train.log_freq=100 train.save_freq=10000 train.use_swanlab=true \
+  train.swanlab_project=roboimi-vla \
+  train.rollout_val_freq_epochs=5 train.rollout_validate_on_checkpoint=false \
+  train.rollout_num_episodes=5 train.warmup_steps=2000 \
+  train.scheduler_type=cosine train.min_lr=1e-6 train.weight_decay=1e-5 train.grad_clip=1.0 \
+  agent.pred_horizon=16 agent.inference_steps=1 \
+  agent.head.n_emb=384 agent.head.n_layer=18 agent.head.n_head=1 agent.head.n_kv_head=1 \
+  agent.vision_backbone.pretrained_backbone_weights=null \
+  agent.vision_backbone.freeze_backbone=false \
+  agent.vision_backbone.use_separate_rgb_encoder_per_camera=true
+```
+Expected: 训练启动成功，SwanLab 记录完整 config，5 epoch 一次 headless rollout
+
+- [ ] **Step 4: 记录 run 路径、训练 PID、SwanLab 运行名并向用户汇报**
+
+- [ ] **Step 5: 提交最终收尾改动（如果 smoke fix 需要额外 patch）**
+
+```bash
+git add <changed files>
+git commit -m "chore: verify IMF AttnRes training launch"
+```
--- a/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
+++ b/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
@@ -0,0 +1,79 @@
+# IMF Rollout Trajectory Images and Short-Horizon Training Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add training-time rollout front trajectory image export plus SwanLab image logging, then start a new local IMF training run with `emb=384`, `layer=12`, `pred_horizon=8`, `num_action_steps=4`, `max_steps=50000`.
+
+**Architecture:** Extend `eval_vla.py` so a rollout can emit one per-episode static front-view image with red EE trajectory overlay. Extend `train_vla.py` so rollout validation forces image export, forces video off, and uploads those per-episode images to SwanLab. Launch the requested new run through explicit command-line overrides rather than branch-default config changes.
+
+**Tech Stack:** Python, PyTorch, Hydra/OmegaConf, MuJoCo, OpenCV, SwanLab.
+
+---
+
+### Task 1: Add and validate rollout image tests
+
+**Files:**
+- Modify: `tests/test_eval_vla_rollout_artifacts.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Add/adjust eval tests so they assert per-episode trajectory image paths are produced without requiring video export.
+- [ ] Add/adjust training tests so they assert training-time rollout validation forces `record_video=false`.
+- [ ] Add/adjust training tests so they assert trajectory image paths flow from eval summary into SwanLab media logging.
+- [ ] Add/adjust training tests so they assert image media is logged, not only scalar reward metrics.
+
+### Task 2: Implement per-episode front trajectory image export in eval
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Reuse/Read: `roboimi/utils/raw_action_trajectory_viewer.py`
+- Modify: `roboimi/vla/conf/eval/eval.yaml`
+
+- [ ] Add config plumbing for `save_trajectory_image` and `trajectory_image_camera_name`.
+- [ ] Ensure the default training-time camera resolution path is pinned to `front`.
+- [ ] Implement distinct per-episode image naming so 5 rollout episodes create 5 distinct PNGs.
+- [ ] Reuse the existing red trajectory representation logic when composing the PNG.
+- [ ] Ensure headless eval works under EGL even on machines with `DISPLAY` set.
+
+### Task 3: Implement SwanLab rollout image logging in training
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Make `run_rollout_validation()` force `record_video=false`.
+- [ ] Make `run_rollout_validation()` force `save_trajectory_image=true` and `trajectory_image_camera_name=front`.
+- [ ] Ensure rollout validation still uses 5 episodes per validation event for the requested run.
+- [ ] Add a best-effort helper that converts per-episode image paths into SwanLab image media payloads.
+- [ ] Keep image-upload failures non-fatal and warning-only.
+
+### Task 4: Verify action-chunk semantics for the new run
+
+**Files:**
+- Verify: `roboimi/vla/agent.py`
+- Verify: `roboimi/vla/agent_imf.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] Confirm the existing queue logic still means “predict 8, execute first 4”.
+- [ ] Do not change branch defaults unless strictly necessary; prefer launch-time overrides.
+
+### Task 5: Verify and launch the requested local training run
+
+**Files:**
+- Use: `roboimi/demos/vla_scripts/train_vla.py`
+- Use: `roboimi/demos/vla_scripts/eval_vla.py`
+
+- [ ] Run the targeted verification suite.
+- [ ] Run one real headless smoke eval and confirm a front trajectory PNG is produced while `video_mp4` stays null.
+- [ ] Launch the new local training run with explicit overrides including:
+  - `agent=resnet_imf_attnres`
+  - `agent.head.n_emb=384`
+  - `agent.head.n_layer=12`
+  - `agent.pred_horizon=8`
+  - `agent.num_action_steps=4`
+  - `train.max_steps=50000`
+  - `train.rollout_num_episodes=5`
+  - `train.use_swanlab=true`
+  - current local baseline dataset/camera/CUDA/batch/lr/num_workers/backbone settings
+- [ ] Verify PID, GPU allocation, log tail, and SwanLab run URL.
--- a/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
+++ b/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
@@ -0,0 +1,68 @@
+# IMF Horizon Grid and AttnRes Ablation Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Run a 6-run Phase-1 IMF horizon/action-step experiment grid across available GPUs, monitor progress and collect best rollout metrics, then use the best horizon setting for a Phase-2 visual-attnres ablation.
+
+**Architecture:** Use the current IMF training code as-is for Phase-1 by sweeping explicit `(pred_horizon, num_action_steps)` overrides while keeping emb=384, layer=12, and max_steps=50k fixed. Maintain a local experiment suite directory with a manifest and machine-readable status snapshots so progress can be resumed and summarized across turns. After Phase-1 completes, compare the current head-only attnres setup against a variant that also adds attnres into the visual ResNet path.
+
+**Tech Stack:** Python, Hydra/OmegaConf, PyTorch, SSH/Tailscale, JSON/CSV status files, SwanLab.
+
+---
+
+### Task 1: Prepare the experiment suite manifest and state tracking
+
+**Files:**
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/manifest.json`
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
+- Create: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
+
+- [ ] Define the 6 legal Phase-1 combinations: `(8,8)`, `(16,8)`, `(16,16)`, `(32,8)`, `(32,16)`, `(32,32)`.
+- [ ] Record for each run: name, host, GPU slot, command, log path, SwanLab run name, and completion criteria.
+- [ ] Define the comparison metric as the maximum rollout average reward seen during training (`max avg_reward`), preferably read from the best-checkpoint metadata and cross-checked against logs.
+- [ ] Keep `status.json` updated with per-run state: queued / running / finished / failed plus latest parsed progress.
+
+### Task 2: Prepare the remote 8-GPU execution target
+
+**Files:**
+- Remote working directory under `/home/droid/`
+- Reuse or create a synced code directory for this suite
+
+- [ ] Verify the remote dataset path and environment path.
+- [ ] Verify GPU availability and reserve 6 GPUs for Phase-1 launches.
+- [ ] Sync the required code to a dedicated remote suite directory.
+- [ ] Record exact remote paths back into the local suite manifest.
+
+### Task 3: Launch the 6 Phase-1 experiments in parallel
+
+**Files:**
+- Reuse: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify only local suite tracking files unless a launch bug is discovered
+
+- [ ] Launch 6 runs concurrently with fixed settings: IMF, emb=384, layer=12, max_steps=50k.
+- [ ] Keep all other relevant training hyperparameters aligned to the current strong baseline unless a concrete blocker appears.
+- [ ] Assign one GPU per run on the 8xL20 host.
+- [ ] Capture PID, log path, and SwanLab URL for each run in `status.json`.
+
+### Task 4: Monitor and summarize Phase-1 until all 6 finish
+
+**Files:**
+- Update: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
+- Update: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
+
+- [ ] Periodically parse each run’s log/checkpoints to extract latest step, latest rollout reward, and best rollout reward so far.
+- [ ] Keep a resumable local summary so progress can be continued in later turns without rediscovery.
+- [ ] After all 6 runs finish, rank them by `max avg_reward` and write a compact Phase-1 summary.
+
+### Task 5: Prepare the Phase-2 visual-attnres ablation
+
+**Files:**
+- Likely modify: vision backbone implementation and config files (to be confirmed after code inspection)
+- Add/update targeted tests for the visual backbone path if code changes are needed
+
+- [ ] Use the best Phase-1 `(pred_horizon, num_action_steps)` combination as the fixed rollout setting for Phase-2.
+- [ ] Compare:
+  1. current setup: attnres only in the IMF head
+  2. ablation setup: attnres in both IMF head and visual encoder path
+- [ ] Keep the rest of the training settings fixed.
+- [ ] Launch and monitor the Phase-2 pair after Phase-1 summary is complete.
--- a/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
+++ b/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
@@ -0,0 +1,92 @@
+# LEWM ViT Backbone Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
+
+**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
+
+**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
+
+---
+
+### Task 1: Add failing tests for LEWM joint-vision backbone contract
+
+**Files:**
+- Create: `tests/test_lewm_vit_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write the failing backbone shape/load test**
+- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
+- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
+- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
+
+### Task 2: Implement LEWM joint-multiview frozen backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
+
+- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
+- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
+- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
+- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
+- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
+
+### Task 3: Add minimal agent support for joint visual dim
+
+**Files:**
+- Modify: `roboimi/vla/agent.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
+- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
+- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
+
+### Task 4: Add Hydra configs for LEWM backbone training
+
+**Files:**
+- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+
+- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
+- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
+- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
+
+### Task 5: Verify focused local tests
+
+**Files:**
+- Reuse the above
+
+- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
+- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
+
+### Task 6: Sync to 5880 and remote smoke with real checkpoint
+
+**Files:**
+- Remote target: `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
+- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
+- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
+
+### Task 7: Launch two real 50k runs on the 5880 machine
+
+**Files:**
+- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
+
+- [ ] **Step 1: Launch embed384/layer12 on GPU0**
+- [ ] **Step 2: Launch embed256/layer12 on GPU1**
+- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
+- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
+
+### Task 8: Update experiment tracking docs and commit
+
+**Files:**
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
+
+- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
+- [ ] **Step 2: Record running status after launch**
+- [ ] **Step 3: Commit implementation + docs with a focused message**
--- a/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
+++ b/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
@@ -0,0 +1,64 @@
+# Phase-2 Full-AttnRes Vision Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.
+
+**Architecture:** Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing `ResNetDiffusionBackbone` via an opt-in mode and keep the agent/head/data interfaces unchanged.
+
+**Tech Stack:** PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.
+
+---
+
+### Task 1: Add failing tests for the new full-AttnRes visual backbone
+
+**Files:**
+- Create: `tests/test_attnres_resnet2d_backbone.py`
+- Update: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write a failing backbone shape test**
+- [ ] **Step 2: Run it to confirm the new backbone/config does not exist yet**
+- [ ] **Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208**
+- [ ] **Step 4: Run the targeted tests and capture the failure**
+
+### Task 2: Implement a ResNet-like 2D AttnRes backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/attnres_resnet2d.py`
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+
+- [ ] **Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling**
+- [ ] **Step 2: Implement `AttnResImageBlock2D` for feature maps**
+- [ ] **Step 3: Implement `AttnResResNetLikeBackbone2D` with stage-wise downsampling**
+- [ ] **Step 4: Wire `_SingleRgbEncoder` to choose between original ResNet trunk and the new full-AttnRes trunk**
+- [ ] **Step 5: Run the new backbone tests**
+
+### Task 3: Expose config switches and agent wiring
+
+**Files:**
+- Modify: `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
+- Modify: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+
+- [ ] **Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk**
+- [ ] **Step 2: Add defaults for attnres image depth/heads/etc. if needed**
+- [ ] **Step 3: Add a Phase-2 launch override path that enables the new visual trunk**
+- [ ] **Step 4: Run agent wiring tests again**
+
+### Task 4: Smoke-verify training path
+
+**Files:**
+- Reuse existing training scripts and configs
+
+- [ ] **Step 1: Run a short CPU or tiny-step smoke instantiation / `compute_loss` test**
+- [ ] **Step 2: If needed, run a very short training smoke launch**
+- [ ] **Step 3: Verify no cond-dim or rollout-loading regressions**
+
+### Task 5: Launch the Phase-2 experiment
+
+**Files:**
+- Update experiment tracking under `experiment_suites/`
+
+- [ ] **Step 1: Use Phase-1 best setting (`pred_horizon=16`, `num_action_steps=8`)**
+- [ ] **Step 2: Launch baseline reference or reuse existing result**
+- [ ] **Step 3: Launch full-AttnRes vision experiment**
+- [ ] **Step 4: Track rollout metrics and compare max avg_reward**
--- a/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
+++ b/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
@@ -0,0 +1,81 @@
+# ResNet Multitoken IMF Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
+
+**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
+
+**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
+
+---
+
+### Task 1: Add failing tests for multi-token conditioning
+
+**Files:**
+- Modify: `tests/test_imf_vla_agent.py`
+- Modify: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: Add a direct agent test**
+  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
+  - Assert state is paired with each camera token, not concatenated across cameras first.
+- [ ] **Step 2: Add Hydra wiring test**
+  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
+  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
+- [ ] **Step 3: Run focused tests and verify RED**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+
+### Task 2: Implement multi-token ResNet conditioning path
+
+**Files:**
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+- Modify: `roboimi/vla/agent.py`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
+
+- [ ] **Step 1: Extend ResNet backbone**
+  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
+  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
+- [ ] **Step 2: Extend VLAAgent condition building**
+  - Support visual features with rank 4 `(B,T,K,D)`.
+  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
+  - Track `condition_tokens_per_step` and `condition_sequence_length`.
+- [ ] **Step 3: Update transformer-head instantiation**
+  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
+- [ ] **Step 4: Add Hydra config**
+  - New agent config uses:
+    - separate ResNet-18 per camera
+    - standard residual vision trunk (`vision_backbone_mode=resnet`)
+    - condition projector output dim tied to `${agent.head.n_emb}`
+    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
+
+### Task 3: Verify locally
+
+**Files:**
+- Modify only if verification reveals issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+- [ ] **Step 2: Run regression subset**
+  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 3: Run local smoke instantiation**
+  - instantiate the new Hydra config and verify cond shape / sequence length
+
+### Task 4: Launch 4 L20 experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Sync code to `100.119.99.14`**
+- [ ] **Step 2: Smoke the new config on remote**
+- [ ] **Step 3: Launch runs**
+  - `(n_emb=256, n_layer=12)`
+  - `(n_emb=256, n_layer=16)`
+  - `(n_emb=384, n_layer=12)`
+  - `(n_emb=384, n_layer=16)`
+- [ ] **Step 4: Keep fixed across runs**
+  - rollout episodes `10`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - standard ResNet-18 vision trunk
+  - three separate camera weights
+- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
--- a/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
+++ b/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
@@ -0,0 +1,78 @@
+# SigLIP2 Multiview VLA Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
+
+**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
+
+**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
+
+---
+
+### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
+
+**Files:**
+- Create: `tests/test_siglip2_diffusion_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write failing backbone tests**
+  - Instantiate the new backbone with a stub SigLIP2 vision model.
+  - Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
+  - Assert three views are encoded independently and projected.
+- [ ] **Step 2: Run focused tests and verify RED**
+  - Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
+  - Expect failure because the backbone/config/projector do not exist yet.
+- [ ] **Step 3: Extend agent wiring tests**
+  - Add a Hydra/instantiate test for a new SigLIP2 IMF config.
+  - Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
+
+### Task 2: Implement SigLIP2 backbone and optional condition projector
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
+- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
+- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
+- Modify: `roboimi/vla/models/backbones/__init__.py`
+- Modify: `roboimi/vla/agent.py`
+
+- [ ] **Step 1: Implement backbone**
+  - Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
+  - Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
+  - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
+- [ ] **Step 2: Implement optional condition projector**
+  - Allow `VLAAgent` to accept `cond_projector`.
+  - Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
+  - Apply the projector in `_build_cond()` after visual+state concatenation.
+- [ ] **Step 3: Add Hydra configs**
+  - New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
+  - Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
+
+### Task 3: Verify locally and prepare remote execution
+
+**Files:**
+- Modify as needed only if tests/smoke reveal issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 2: Run a local smoke instantiation**
+  - Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
+- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
+
+### Task 4: Sync to 5880 and launch experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Stop superseded remote jobs**
+- [ ] **Step 2: Sync updated code to remote**
+  - Prefer `rsync` or `git push/pull` without overwriting unrelated files.
+- [ ] **Step 3: Remote smoke test**
+  - Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
+  - Confirm headless rollout path still uses `256x256` eval resize.
+- [ ] **Step 4: Launch experiment A**
+  - `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
+- [ ] **Step 5: Launch experiment B**
+  - `per_view_output_dim=192`, same other hyperparameters.
+- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**