feat: add vision transfer backbones and IMF variants

2026-04-09 14:02:24 +08:00
parent d51b3ecafa
commit ff7c9c1f2a
58 changed files with 2788 additions and 26 deletions
@@ -0,0 +1,92 @@
+# LEWM ViT Backbone Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
+
+**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
+
+**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
+
+---
+
+### Task 1: Add failing tests for LEWM joint-vision backbone contract
+
+**Files:**
+- Create: `tests/test_lewm_vit_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write the failing backbone shape/load test**
+- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
+- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
+- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
+
+### Task 2: Implement LEWM joint-multiview frozen backbone
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
+- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
+
+- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
+- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
+- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
+- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
+- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
+
+### Task 3: Add minimal agent support for joint visual dim
+
+**Files:**
+- Modify: `roboimi/vla/agent.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
+- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
+- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
+
+### Task 4: Add Hydra configs for LEWM backbone training
+
+**Files:**
+- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
+
+- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
+- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
+- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
+
+### Task 5: Verify focused local tests
+
+**Files:**
+- Reuse the above
+
+- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
+- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
+
+### Task 6: Sync to 5880 and remote smoke with real checkpoint
+
+**Files:**
+- Remote target: `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
+- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
+- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
+
+### Task 7: Launch two real 50k runs on the 5880 machine
+
+**Files:**
+- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
+
+- [ ] **Step 1: Launch embed384/layer12 on GPU0**
+- [ ] **Step 2: Launch embed256/layer12 on GPU1**
+- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
+- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
+
+### Task 8: Update experiment tracking docs and commit
+
+**Files:**
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
+- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
+
+- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
+- [ ] **Step 2: Record running status after launch**
+- [ ] **Step 3: Commit implementation + docs with a focused message**
@@ -0,0 +1,81 @@
+# ResNet Multitoken IMF Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
+
+**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
+
+**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
+
+---
+
+### Task 1: Add failing tests for multi-token conditioning
+
+**Files:**
+- Modify: `tests/test_imf_vla_agent.py`
+- Modify: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: Add a direct agent test**
+  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
+  - Assert state is paired with each camera token, not concatenated across cameras first.
+- [ ] **Step 2: Add Hydra wiring test**
+  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
+  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
+- [ ] **Step 3: Run focused tests and verify RED**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+
+### Task 2: Implement multi-token ResNet conditioning path
+
+**Files:**
+- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
+- Modify: `roboimi/vla/agent.py`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
+
+- [ ] **Step 1: Extend ResNet backbone**
+  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
+  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
+- [ ] **Step 2: Extend VLAAgent condition building**
+  - Support visual features with rank 4 `(B,T,K,D)`.
+  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
+  - Track `condition_tokens_per_step` and `condition_sequence_length`.
+- [ ] **Step 3: Update transformer-head instantiation**
+  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
+- [ ] **Step 4: Add Hydra config**
+  - New agent config uses:
+    - separate ResNet-18 per camera
+    - standard residual vision trunk (`vision_backbone_mode=resnet`)
+    - condition projector output dim tied to `${agent.head.n_emb}`
+    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
+
+### Task 3: Verify locally
+
+**Files:**
+- Modify only if verification reveals issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
+- [ ] **Step 2: Run regression subset**
+  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 3: Run local smoke instantiation**
+  - instantiate the new Hydra config and verify cond shape / sequence length
+
+### Task 4: Launch 4 L20 experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Sync code to `100.119.99.14`**
+- [ ] **Step 2: Smoke the new config on remote**
+- [ ] **Step 3: Launch runs**
+  - `(n_emb=256, n_layer=12)`
+  - `(n_emb=256, n_layer=16)`
+  - `(n_emb=384, n_layer=12)`
+  - `(n_emb=384, n_layer=16)`
+- [ ] **Step 4: Keep fixed across runs**
+  - rollout episodes `10`
+  - `pred_horizon=16`
+  - `num_action_steps=8`
+  - standard ResNet-18 vision trunk
+  - three separate camera weights
+- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
@@ -0,0 +1,78 @@
+# SigLIP2 Multiview VLA Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
+
+**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
+
+**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
+
+---
+
+### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
+
+**Files:**
+- Create: `tests/test_siglip2_diffusion_backbone.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: Write failing backbone tests**
+  - Instantiate the new backbone with a stub SigLIP2 vision model.
+  - Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
+  - Assert three views are encoded independently and projected.
+- [ ] **Step 2: Run focused tests and verify RED**
+  - Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
+  - Expect failure because the backbone/config/projector do not exist yet.
+- [ ] **Step 3: Extend agent wiring tests**
+  - Add a Hydra/instantiate test for a new SigLIP2 IMF config.
+  - Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
+
+### Task 2: Implement SigLIP2 backbone and optional condition projector
+
+**Files:**
+- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
+- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
+- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
+- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
+- Modify: `roboimi/vla/models/backbones/__init__.py`
+- Modify: `roboimi/vla/agent.py`
+
+- [ ] **Step 1: Implement backbone**
+  - Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
+  - Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
+  - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
+- [ ] **Step 2: Implement optional condition projector**
+  - Allow `VLAAgent` to accept `cond_projector`.
+  - Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
+  - Apply the projector in `_build_cond()` after visual+state concatenation.
+- [ ] **Step 3: Add Hydra configs**
+  - New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
+  - Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
+
+### Task 3: Verify locally and prepare remote execution
+
+**Files:**
+- Modify as needed only if tests/smoke reveal issues
+
+- [ ] **Step 1: Run focused tests and make them pass**
+  - `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
+- [ ] **Step 2: Run a local smoke instantiation**
+  - Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
+- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
+
+### Task 4: Sync to 5880 and launch experiments
+
+**Files:**
+- Remote repo copy under `/home/droid/roboimi_suite_20260404`
+
+- [ ] **Step 1: Stop superseded remote jobs**
+- [ ] **Step 2: Sync updated code to remote**
+  - Prefer `rsync` or `git push/pull` without overwriting unrelated files.
+- [ ] **Step 3: Remote smoke test**
+  - Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
+  - Confirm headless rollout path still uses `256x256` eval resize.
+- [ ] **Step 4: Launch experiment A**
+  - `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
+- [ ] **Step 5: Launch experiment B**
+  - `per_view_output_dim=192`, same other hyperparameters.
+- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**