# LEWM ViT Backbone Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine. **Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged. **Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65. --- ### Task 1: Add failing tests for LEWM joint-vision backbone contract **Files:** - Create: `tests/test_lewm_vit_backbone.py` - Modify: `tests/test_imf_vla_agent.py` - [ ] **Step 1: Write the failing backbone shape/load test** - [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails** - [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case** - [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails** ### Task 2: Implement LEWM joint-multiview frozen backbone **Files:** - Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py` - Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed - [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`** - [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion** - [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`** - [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override** - [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green** ### Task 3: Add minimal agent support for joint visual dim **Files:** - Modify: `roboimi/vla/agent.py` - Test: `tests/test_imf_vla_agent.py` - [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`** - [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract** - [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green** ### Task 4: Add Hydra configs for LEWM backbone training **Files:** - Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml` - Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml` - [ ] **Step 1: Add backbone config pointing to the new LEWM backbone** - [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`** - [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke** ### Task 5: Verify focused local tests **Files:** - Reuse the above - [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`** - [ ] **Step 2: If needed, run one tiny local import/forward smoke** ### Task 6: Sync to 5880 and remote smoke with real checkpoint **Files:** - Remote target: `/home/droid/roboimi_suite_20260404` - [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`** - [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint** - [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint** ### Task 7: Launch two real 50k runs on the 5880 machine **Files:** - Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/` - [ ] **Step 1: Launch embed384/layer12 on GPU0** - [ ] **Step 2: Launch embed256/layer12 on GPU1** - [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`** - [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs** ### Task 8: Update experiment tracking docs and commit **Files:** - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json` - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json` - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md` - [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs** - [ ] **Step 2: Record running status after launch** - [ ] **Step 3: Commit implementation + docs with a focused message**