Files
roboimi/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md

2.3 KiB

SigLIP2 Multiview VLA Design

Status: user-specified architecture, treated as approved on 2026-04-06

Goal

Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.

Approved architecture

  • Backbone model: google/siglip2-base-patch16-256
  • Camera inputs: three views, encoded independently with a shared SigLIP2 vision encoder
  • Input size:
    • dataset images stay at native 256x256 (no dataset-side resize)
    • eval/rollout images resize to 256x256 before SigLIP2 because env renders are larger
  • Per-view feature: use the global pooled image feature (pooler_output, 768-d)
  • Per-view projection experiments:
    1. 768 -> 96
    2. 768 -> 192
  • Conditioning pipeline:
    1. concatenate 3 projected camera vectors
    2. concatenate robot state
    3. project concatenated condition to 384
    4. feed that 384-d per-step condition into the existing IMF/AttnRes diffusion head
  • Training/run defaults for requested experiments:
    • n_emb=384
    • n_layer=12
    • pred_horizon=16
    • num_action_steps=8
    • rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later

Design decisions

  • The condition projector lives in VLAAgent._build_cond() so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
  • The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
  • The backbone exposes dataset_image_resize_shape=None and eval_image_resize_shape=(256, 256) so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
  • One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.

Files expected to change

  • roboimi/vla/models/backbones/ for the new SigLIP2 backbone
  • roboimi/vla/agent.py for optional post-concat condition projection
  • Hydra configs under roboimi/vla/conf/{agent,backbone,modules}
  • tests for backbone wiring and agent conditioning dims
  • remote launch commands/scripts only as needed for training