# SigLIP2 Multiview VLA Design **Status:** user-specified architecture, treated as approved on 2026-04-06 ## Goal Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior. ## Approved architecture - Backbone model: `google/siglip2-base-patch16-256` - Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder - Input size: - dataset images stay at native `256x256` (no dataset-side resize) - eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger - Per-view feature: use the global pooled image feature (`pooler_output`, 768-d) - Per-view projection experiments: 1. `768 -> 96` 2. `768 -> 192` - Conditioning pipeline: 1. concatenate 3 projected camera vectors 2. concatenate robot state 3. project concatenated condition to `384` 4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head - Training/run defaults for requested experiments: - `n_emb=384` - `n_layer=12` - `pred_horizon=16` - `num_action_steps=8` - rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later ## Design decisions - The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head. - The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train. - The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch. - One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image. ## Files expected to change - `roboimi/vla/models/backbones/` for the new SigLIP2 backbone - `roboimi/vla/agent.py` for optional post-concat condition projection - Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}` - tests for backbone wiring and agent conditioning dims - remote launch commands/scripts only as needed for training