Explore Help

JiajunLI/roboimi

1

0

You've already forked roboimi

Code Issues Pull Requests Actions Packages Projects Releases Wiki Activity

Files

ff7c9c1f2ae79fd8d546881e962393e03650efa2

roboimi/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md

Logic ff7c9c1f2a feat: add vision transfer backbones and IMF variants

2026-04-09 14:02:24 +08:00

4.4 KiB

Raw Blame History

SigLIP2 Multiview VLA Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.

Architecture: A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. VLAAgent will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.

Tech Stack: PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.

Task 1: Add failing tests for SigLIP2 backbone and projected conditioning

Files:

Create: tests/test_siglip2_diffusion_backbone.py
Modify: tests/test_imf_vla_agent.py
Step 1: Write failing backbone tests
- Instantiate the new backbone with a stub SigLIP2 vision model.
- Assert raw dataset resize is None, eval resize is (256, 256), output shape is (B, T, 3 * per_view_output_dim).
- Assert three views are encoded independently and projected.
Step 2: Run focused tests and verify RED
- Run pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q
- Expect failure because the backbone/config/projector do not exist yet.
Step 3: Extend agent wiring tests
- Add a Hydra/instantiate test for a new SigLIP2 IMF config.
- Assert raw condition dim 3 * per_view_output_dim + obs_dim, projected cond dim 384, and head cond_dim == 384.

Task 2: Implement SigLIP2 backbone and optional condition projector

Files:

Create: roboimi/vla/models/backbones/siglip2_diffusion_backbone.py
Create: roboimi/vla/conf/backbone/siglip2_diffusion.yaml
Create: roboimi/vla/conf/agent/siglip2_imf_attnres.yaml
Create: roboimi/vla/conf/modules/linear_condition_projector.yaml
Modify: roboimi/vla/models/backbones/__init__.py
Modify: roboimi/vla/agent.py
Step 1: Implement backbone
- Load SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256").
- Normalize [0,1] pixels with mean/std 0.5 and encode each view independently.
- Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
Step 2: Implement optional condition projector
- Allow VLAAgent to accept cond_projector.
- Track raw_per_step_cond_dim and projected per_step_cond_dim / global_cond_dim.
- Apply the projector in _build_cond() after visual+state concatenation.
Step 3: Add Hydra configs
- New agent config should default to n_emb=384, n_layer=12, pred_horizon=16, num_action_steps=8, head.cond_dim=384.
- Backbone config should set dataset_image_resize_shape: null and eval_image_resize_shape: [256, 256].

Task 3: Verify locally and prepare remote execution

Files:

Modify as needed only if tests/smoke reveal issues
Step 1: Run focused tests and make them pass
- pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q
Step 2: Run a local smoke instantiation
- Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
Step 3: Review diffs for unintended LEWM/raw256 regressions

Task 4: Sync to 5880 and launch experiments

Files:

Remote repo copy under /home/droid/roboimi_suite_20260404
Step 1: Stop superseded remote jobs
Step 2: Sync updated code to remote
- Prefer rsync or git push/pull without overwriting unrelated files.
Step 3: Remote smoke test
- Confirm SigLIP2 model download/import works in /home/droid/miniforge3/envs/roboimi/bin/python.
- Confirm headless rollout path still uses 256x256 eval resize.
Step 4: Launch experiment A
- per_view_output_dim=96, embed=384, layer=12, pred=16, exec=8, steps=50000.
Step 5: Launch experiment B
- per_view_output_dim=192, same other hyperparameters.
Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.

Reference in New Issue View Git Blame Copy Permalink

Powered by Gitea Version: 1.25.3 Page: 41ms Template: 1ms

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API