# SigLIP2 Multiview VLA Design

**Status:** user-specified architecture, treated as approved on 2026-04-06

## Goal
Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.

## Approved architecture
- Backbone model: `google/siglip2-base-patch16-256`
- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
- Input size:
  - dataset images stay at native `256x256` (no dataset-side resize)
  - eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
- Per-view projection experiments:
  1. `768 -> 96`
  2. `768 -> 192`
- Conditioning pipeline:
  1. concatenate 3 projected camera vectors
  2. concatenate robot state
  3. project concatenated condition to `384`
  4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
- Training/run defaults for requested experiments:
  - `n_emb=384`
  - `n_layer=12`
  - `pred_horizon=16`
  - `num_action_steps=8`
  - rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later

## Design decisions
- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.

## Files expected to change
- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
- `roboimi/vla/agent.py` for optional post-concat condition projection
- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
- tests for backbone wiring and agent conditioning dims
- remote launch commands/scripts only as needed for training