# LEWM ViT Backbone Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.

**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.

**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.

---

### Task 1: Add failing tests for LEWM joint-vision backbone contract

**Files:**
- Create: `tests/test_lewm_vit_backbone.py`
- Modify: `tests/test_imf_vla_agent.py`

- [ ] **Step 1: Write the failing backbone shape/load test**
- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**

### Task 2: Implement LEWM joint-multiview frozen backbone

**Files:**
- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed

- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**

### Task 3: Add minimal agent support for joint visual dim

**Files:**
- Modify: `roboimi/vla/agent.py`
- Test: `tests/test_imf_vla_agent.py`

- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**

### Task 4: Add Hydra configs for LEWM backbone training

**Files:**
- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`

- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**

### Task 5: Verify focused local tests

**Files:**
- Reuse the above

- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
- [ ] **Step 2: If needed, run one tiny local import/forward smoke**

### Task 6: Sync to 5880 and remote smoke with real checkpoint

**Files:**
- Remote target: `/home/droid/roboimi_suite_20260404`

- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**

### Task 7: Launch two real 50k runs on the 5880 machine

**Files:**
- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`

- [ ] **Step 1: Launch embed384/layer12 on GPU0**
- [ ] **Step 2: Launch embed256/layer12 on GPU1**
- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**

### Task 8: Update experiment tracking docs and commit

**Files:**
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`

- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
- [ ] **Step 2: Record running status after launch**
- [ ] **Step 3: Commit implementation + docs with a focused message**