Files
roboimi/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md

4.8 KiB

LEWM ViT Backbone Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.

Architecture: Add a new joint-multiview LEWM backbone that fuses front/top/r_vis into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a joint_output_dim=192. Add a minimal VLAAgent compatibility branch so conditions can be sized from joint visual dim instead of output_dim * num_cams, while leaving the rest of the diffusion pipeline unchanged.

Tech Stack: PyTorch, transformers ViTModel, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.


Task 1: Add failing tests for LEWM joint-vision backbone contract

Files:

  • Create: tests/test_lewm_vit_backbone.py

  • Modify: tests/test_imf_vla_agent.py

  • Step 1: Write the failing backbone shape/load test

  • Step 2: Run pytest tests/test_lewm_vit_backbone.py -q and verify it fails

  • Step 3: Extend tests/test_imf_vla_agent.py with a failing joint-output backbone case

  • Step 4: Run pytest tests/test_imf_vla_agent.py -q and verify it fails

Task 2: Implement LEWM joint-multiview frozen backbone

Files:

  • Create: roboimi/vla/models/backbones/lewm_vit_backbone.py

  • Modify: roboimi/vla/models/backbones/__init__.py only if exports are needed

  • Step 1: Create LEWMViTBackbone with public attrs camera_names, num_cameras, joint_output_dim=192

  • Step 2: Reproduce LEWM preprocessing and joint multiview fusion

  • Step 3: Load checkpoint weights from model.encoder.* and model.projector.*

  • Step 4: Freeze encoder/projector and keep them in eval mode via train() override

  • Step 5: Run pytest tests/test_lewm_vit_backbone.py -q and verify green

Task 3: Add minimal agent support for joint visual dim

Files:

  • Modify: roboimi/vla/agent.py

  • Test: tests/test_imf_vla_agent.py

  • Step 1: Add a joint_output_dim branch in VLAAgent.__init__ for per_step_cond_dim / global_cond_dim

  • Step 2: Keep _build_cond() semantics unchanged except for matching the new dim contract

  • Step 3: Run pytest tests/test_imf_vla_agent.py -q and verify green

Task 4: Add Hydra configs for LEWM backbone training

Files:

  • Create: roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml

  • Create: roboimi/vla/conf/agent/lewm_imf_attnres.yaml

  • Step 1: Add backbone config pointing to the new LEWM backbone

  • Step 2: Add agent=lewm_imf_attnres config with 3 cameras and head.cond_dim=208

  • Step 3: Verify Hydra instantiation with a one-shot compose smoke

Task 5: Verify focused local tests

Files:

  • Reuse the above

  • Step 1: Run pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q

  • Step 2: If needed, run one tiny local import/forward smoke

Task 6: Sync to 5880 and remote smoke with real checkpoint

Files:

  • Remote target: /home/droid/roboimi_suite_20260404

  • Step 1: Rsync modified source/config files to 100.73.14.65:/home/droid/roboimi_suite_20260404

  • Step 2: Run a 2-step smoke on GPU0 with agent.head.n_emb=384, train.rollout_num_episodes=10, real LEWM checkpoint

  • Step 3: Run a 2-step smoke on GPU1 with agent.head.n_emb=256, same checkpoint

Task 7: Launch two real 50k runs on the 5880 machine

Files:

  • Remote logs under /home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/

  • Step 1: Launch embed384/layer12 on GPU0

  • Step 2: Launch embed256/layer12 on GPU1

  • Step 3: Ensure both use data.camera_names=[r_vis,top,front], pred_horizon=16, num_action_steps=8, train.rollout_num_episodes=10, max_steps=50000

  • Step 4: Record run names, pids, log paths, SwanLab URLs

Task 8: Update experiment tracking docs and commit

Files:

  • Create: experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json

  • Create: experiment_suites/2026-04-05-lewm-vit-transfer/status.json

  • Create: experiment_suites/2026-04-05-lewm-vit-transfer/notes.md

  • Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs

  • Step 2: Record running status after launch

  • Step 3: Commit implementation + docs with a focused message