SigLIP2 Multiview VLA Design

Status: user-specified architecture, treated as approved on 2026-04-06

Goal

Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.

Approved architecture

Backbone model: google/siglip2-base-patch16-256
Camera inputs: three views, encoded independently with a shared SigLIP2 vision encoder
Input size:
- dataset images stay at native 256x256 (no dataset-side resize)
- eval/rollout images resize to 256x256 before SigLIP2 because env renders are larger
Per-view feature: use the global pooled image feature (pooler_output, 768-d)
Per-view projection experiments:
1. 768 -> 96
2. 768 -> 192
Conditioning pipeline:
1. concatenate 3 projected camera vectors
2. concatenate robot state
3. project concatenated condition to 384
4. feed that 384-d per-step condition into the existing IMF/AttnRes diffusion head
Training/run defaults for requested experiments:
- n_emb=384
- n_layer=12
- pred_horizon=16
- num_action_steps=8
- rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later

Design decisions

The condition projector lives in VLAAgent._build_cond() so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
The backbone exposes dataset_image_resize_shape=None and eval_image_resize_shape=(256, 256) so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.

Files expected to change

roboimi/vla/models/backbones/ for the new SigLIP2 backbone
roboimi/vla/agent.py for optional post-concat condition projection
Hydra configs under roboimi/vla/conf/{agent,backbone,modules}
tests for backbone wiring and agent conditioning dims
remote launch commands/scripts only as needed for training

2.3 KiB Raw Blame History

SigLIP2 Multiview VLA Design

Goal

Approved architecture

Design decisions

Files expected to change

2.3 KiB

Raw Blame History