Files
roboimi/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md

4.4 KiB

SigLIP2 Multiview VLA Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.

Architecture: A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. VLAAgent will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.

Tech Stack: PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.


Task 1: Add failing tests for SigLIP2 backbone and projected conditioning

Files:

  • Create: tests/test_siglip2_diffusion_backbone.py

  • Modify: tests/test_imf_vla_agent.py

  • Step 1: Write failing backbone tests

    • Instantiate the new backbone with a stub SigLIP2 vision model.
    • Assert raw dataset resize is None, eval resize is (256, 256), output shape is (B, T, 3 * per_view_output_dim).
    • Assert three views are encoded independently and projected.
  • Step 2: Run focused tests and verify RED

    • Run pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q
    • Expect failure because the backbone/config/projector do not exist yet.
  • Step 3: Extend agent wiring tests

    • Add a Hydra/instantiate test for a new SigLIP2 IMF config.
    • Assert raw condition dim 3 * per_view_output_dim + obs_dim, projected cond dim 384, and head cond_dim == 384.

Task 2: Implement SigLIP2 backbone and optional condition projector

Files:

  • Create: roboimi/vla/models/backbones/siglip2_diffusion_backbone.py

  • Create: roboimi/vla/conf/backbone/siglip2_diffusion.yaml

  • Create: roboimi/vla/conf/agent/siglip2_imf_attnres.yaml

  • Create: roboimi/vla/conf/modules/linear_condition_projector.yaml

  • Modify: roboimi/vla/models/backbones/__init__.py

  • Modify: roboimi/vla/agent.py

  • Step 1: Implement backbone

    • Load SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256").
    • Normalize [0,1] pixels with mean/std 0.5 and encode each view independently.
    • Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
  • Step 2: Implement optional condition projector

    • Allow VLAAgent to accept cond_projector.
    • Track raw_per_step_cond_dim and projected per_step_cond_dim / global_cond_dim.
    • Apply the projector in _build_cond() after visual+state concatenation.
  • Step 3: Add Hydra configs

    • New agent config should default to n_emb=384, n_layer=12, pred_horizon=16, num_action_steps=8, head.cond_dim=384.
    • Backbone config should set dataset_image_resize_shape: null and eval_image_resize_shape: [256, 256].

Task 3: Verify locally and prepare remote execution

Files:

  • Modify as needed only if tests/smoke reveal issues

  • Step 1: Run focused tests and make them pass

    • pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q
  • Step 2: Run a local smoke instantiation

    • Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
  • Step 3: Review diffs for unintended LEWM/raw256 regressions

Task 4: Sync to 5880 and launch experiments

Files:

  • Remote repo copy under /home/droid/roboimi_suite_20260404

  • Step 1: Stop superseded remote jobs

  • Step 2: Sync updated code to remote

    • Prefer rsync or git push/pull without overwriting unrelated files.
  • Step 3: Remote smoke test

    • Confirm SigLIP2 model download/import works in /home/droid/miniforge3/envs/roboimi/bin/python.
    • Confirm headless rollout path still uses 256x256 eval resize.
  • Step 4: Launch experiment A

    • per_view_output_dim=96, embed=384, layer=12, pred=16, exec=8, steps=50000.
  • Step 5: Launch experiment B

    • per_view_output_dim=192, same other hyperparameters.
  • Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.