feat: add vision transfer backbones and IMF variants

2026-04-09 14:02:24 +08:00
parent d51b3ecafa
commit ff7c9c1f2a
58 changed files with 2788 additions and 26 deletions
@@ -0,0 +1,92 @@
+# LEWM ViT Backbone Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
+
+**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
+
+**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
+
+---
+
+### Task 1: Add failing tests for LEWM joint-vision backbone contract
+
+**Files:**
+- Create: `tests/test_lewm_vit_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write the failing backbone shape/load test**
+- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
+- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
+- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
+
+### Task 2: Implement LEWM joint-multiview frozen backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
+
+- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
+- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
+- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
+- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
+- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
+
+### Task 3: Add minimal agent support for joint visual dim
+
+**Files:**
+- Modify: `roboimi/vla/agent.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
+- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
+- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
+
+### Task 4: Add Hydra configs for LEWM backbone training
+
+**Files:**
+- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+
+- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
+- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
+- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
+
+### Task 5: Verify focused local tests
+
+**Files:**
+- Reuse the above
+
+- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
+- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
+
+### Task 6: Sync to 5880 and remote smoke with real checkpoint
+
+**Files:**
+- Remote target: `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
+- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
+- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
+
+### Task 7: Launch two real 50k runs on the 5880 machine
+
+**Files:**
+- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
+
+- [ ] **Step 1: Launch embed384/layer12 on GPU0**
+- [ ] **Step 2: Launch embed256/layer12 on GPU1**
+- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
+- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
+
+### Task 8: Update experiment tracking docs and commit
+
+**Files:**
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
+
+- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
+- [ ] **Step 2: Record running status after launch**
+- [ ] **Step 3: Commit implementation + docs with a focused message**