feat: add vision transfer backbones and IMF variants
This commit is contained in:
@@ -0,0 +1,92 @@
|
||||
# LEWM ViT Backbone Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
|
||||
|
||||
**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
|
||||
|
||||
**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add failing tests for LEWM joint-vision backbone contract
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_lewm_vit_backbone.py`
|
||||
- Modify: `tests/test_imf_vla_agent.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing backbone shape/load test**
|
||||
- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
|
||||
- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
|
||||
- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
|
||||
|
||||
### Task 2: Implement LEWM joint-multiview frozen backbone
|
||||
|
||||
**Files:**
|
||||
- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
|
||||
- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
|
||||
|
||||
- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
|
||||
- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
|
||||
- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
|
||||
- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
|
||||
- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
|
||||
|
||||
### Task 3: Add minimal agent support for joint visual dim
|
||||
|
||||
**Files:**
|
||||
- Modify: `roboimi/vla/agent.py`
|
||||
- Test: `tests/test_imf_vla_agent.py`
|
||||
|
||||
- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
|
||||
- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
|
||||
- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
|
||||
|
||||
### Task 4: Add Hydra configs for LEWM backbone training
|
||||
|
||||
**Files:**
|
||||
- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
|
||||
- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
|
||||
|
||||
- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
|
||||
- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
|
||||
- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
|
||||
|
||||
### Task 5: Verify focused local tests
|
||||
|
||||
**Files:**
|
||||
- Reuse the above
|
||||
|
||||
- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
|
||||
- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
|
||||
|
||||
### Task 6: Sync to 5880 and remote smoke with real checkpoint
|
||||
|
||||
**Files:**
|
||||
- Remote target: `/home/droid/roboimi_suite_20260404`
|
||||
|
||||
- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
|
||||
- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
|
||||
- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
|
||||
|
||||
### Task 7: Launch two real 50k runs on the 5880 machine
|
||||
|
||||
**Files:**
|
||||
- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
|
||||
|
||||
- [ ] **Step 1: Launch embed384/layer12 on GPU0**
|
||||
- [ ] **Step 2: Launch embed256/layer12 on GPU1**
|
||||
- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
|
||||
- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
|
||||
|
||||
### Task 8: Update experiment tracking docs and commit
|
||||
|
||||
**Files:**
|
||||
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
|
||||
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
|
||||
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
|
||||
|
||||
- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
|
||||
- [ ] **Step 2: Record running status after launch**
|
||||
- [ ] **Step 3: Commit implementation + docs with a focused message**
|
||||
Reference in New Issue
Block a user