Explore Help

JiajunLI/roboimi

1

0

You've already forked roboimi

Code Issues Pull Requests Actions Packages Projects Releases Wiki Activity

Files

ff7c9c1f2ae79fd8d546881e962393e03650efa2

roboimi/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md

Logic ff7c9c1f2a feat: add vision transfer backbones and IMF variants

2026-04-09 14:02:24 +08:00

4.8 KiB

Raw Blame History

LEWM ViT Backbone Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.

Architecture: Add a new joint-multiview LEWM backbone that fuses front/top/r_vis into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a joint_output_dim=192. Add a minimal VLAAgent compatibility branch so conditions can be sized from joint visual dim instead of output_dim * num_cams, while leaving the rest of the diffusion pipeline unchanged.

Tech Stack: PyTorch, transformers ViTModel, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.

Task 1: Add failing tests for LEWM joint-vision backbone contract

Files:

Create: tests/test_lewm_vit_backbone.py
Modify: tests/test_imf_vla_agent.py
Step 1: Write the failing backbone shape/load test
Step 2: Run pytest tests/test_lewm_vit_backbone.py -q and verify it fails
Step 3: Extend tests/test_imf_vla_agent.py with a failing joint-output backbone case
Step 4: Run pytest tests/test_imf_vla_agent.py -q and verify it fails

Task 2: Implement LEWM joint-multiview frozen backbone

Files:

Create: roboimi/vla/models/backbones/lewm_vit_backbone.py
Modify: roboimi/vla/models/backbones/__init__.py only if exports are needed
Step 1: Create LEWMViTBackbone with public attrs camera_names, num_cameras, joint_output_dim=192
Step 2: Reproduce LEWM preprocessing and joint multiview fusion
Step 3: Load checkpoint weights from model.encoder.* and model.projector.*
Step 4: Freeze encoder/projector and keep them in eval mode via train() override
Step 5: Run pytest tests/test_lewm_vit_backbone.py -q and verify green

Task 3: Add minimal agent support for joint visual dim

Files:

Modify: roboimi/vla/agent.py
Test: tests/test_imf_vla_agent.py
Step 1: Add a joint_output_dim branch in VLAAgent.__init__ for per_step_cond_dim / global_cond_dim
Step 2: Keep _build_cond() semantics unchanged except for matching the new dim contract
Step 3: Run pytest tests/test_imf_vla_agent.py -q and verify green

Task 4: Add Hydra configs for LEWM backbone training

Files:

Create: roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml
Create: roboimi/vla/conf/agent/lewm_imf_attnres.yaml
Step 1: Add backbone config pointing to the new LEWM backbone
Step 2: Add agent=lewm_imf_attnres config with 3 cameras and head.cond_dim=208
Step 3: Verify Hydra instantiation with a one-shot compose smoke

Task 5: Verify focused local tests

Files:

Reuse the above
Step 1: Run pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q
Step 2: If needed, run one tiny local import/forward smoke

Task 6: Sync to 5880 and remote smoke with real checkpoint

Files:

Remote target: /home/droid/roboimi_suite_20260404
Step 1: Rsync modified source/config files to 100.73.14.65:/home/droid/roboimi_suite_20260404
Step 2: Run a 2-step smoke on GPU0 with agent.head.n_emb=384, train.rollout_num_episodes=10, real LEWM checkpoint
Step 3: Run a 2-step smoke on GPU1 with agent.head.n_emb=256, same checkpoint

Task 7: Launch two real 50k runs on the 5880 machine

Files:

Remote logs under /home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/
Step 1: Launch embed384/layer12 on GPU0
Step 2: Launch embed256/layer12 on GPU1
Step 3: Ensure both use data.camera_names=[r_vis,top,front], pred_horizon=16, num_action_steps=8, train.rollout_num_episodes=10, max_steps=50000
Step 4: Record run names, pids, log paths, SwanLab URLs

Task 8: Update experiment tracking docs and commit

Files:

Create: experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json
Create: experiment_suites/2026-04-05-lewm-vit-transfer/status.json
Create: experiment_suites/2026-04-05-lewm-vit-transfer/notes.md
Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs
Step 2: Record running status after launch
Step 3: Commit implementation + docs with a focused message

Reference in New Issue View Git Blame Copy Permalink

Powered by Gitea Version: 1.25.3 Page: 47ms Template: 4ms

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API