4.4 KiB
SigLIP2 Multiview VLA Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
Architecture: A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. VLAAgent will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
Tech Stack: PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
Files:
-
Create:
tests/test_siglip2_diffusion_backbone.py -
Modify:
tests/test_imf_vla_agent.py -
Step 1: Write failing backbone tests
- Instantiate the new backbone with a stub SigLIP2 vision model.
- Assert raw dataset resize is
None, eval resize is(256, 256), output shape is(B, T, 3 * per_view_output_dim). - Assert three views are encoded independently and projected.
-
Step 2: Run focused tests and verify RED
- Run
pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q - Expect failure because the backbone/config/projector do not exist yet.
- Run
-
Step 3: Extend agent wiring tests
- Add a Hydra/instantiate test for a new SigLIP2 IMF config.
- Assert raw condition dim
3 * per_view_output_dim + obs_dim, projected cond dim384, and headcond_dim == 384.
Task 2: Implement SigLIP2 backbone and optional condition projector
Files:
-
Create:
roboimi/vla/models/backbones/siglip2_diffusion_backbone.py -
Create:
roboimi/vla/conf/backbone/siglip2_diffusion.yaml -
Create:
roboimi/vla/conf/agent/siglip2_imf_attnres.yaml -
Create:
roboimi/vla/conf/modules/linear_condition_projector.yaml -
Modify:
roboimi/vla/models/backbones/__init__.py -
Modify:
roboimi/vla/agent.py -
Step 1: Implement backbone
- Load
SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256"). - Normalize
[0,1]pixels with mean/std0.5and encode each view independently. - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
- Load
-
Step 2: Implement optional condition projector
- Allow
VLAAgentto acceptcond_projector. - Track
raw_per_step_cond_dimand projectedper_step_cond_dim/global_cond_dim. - Apply the projector in
_build_cond()after visual+state concatenation.
- Allow
-
Step 3: Add Hydra configs
- New agent config should default to
n_emb=384,n_layer=12,pred_horizon=16,num_action_steps=8,head.cond_dim=384. - Backbone config should set
dataset_image_resize_shape: nullandeval_image_resize_shape: [256, 256].
- New agent config should default to
Task 3: Verify locally and prepare remote execution
Files:
-
Modify as needed only if tests/smoke reveal issues
-
Step 1: Run focused tests and make them pass
pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q
-
Step 2: Run a local smoke instantiation
- Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
-
Step 3: Review diffs for unintended LEWM/raw256 regressions
Task 4: Sync to 5880 and launch experiments
Files:
-
Remote repo copy under
/home/droid/roboimi_suite_20260404 -
Step 1: Stop superseded remote jobs
-
Step 2: Sync updated code to remote
- Prefer
rsyncorgit push/pullwithout overwriting unrelated files.
- Prefer
-
Step 3: Remote smoke test
- Confirm SigLIP2 model download/import works in
/home/droid/miniforge3/envs/roboimi/bin/python. - Confirm headless rollout path still uses
256x256eval resize.
- Confirm SigLIP2 model download/import works in
-
Step 4: Launch experiment A
per_view_output_dim=96,embed=384,layer=12,pred=16,exec=8,steps=50000.
-
Step 5: Launch experiment B
per_view_output_dim=192, same other hyperparameters.
-
Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.