2.3 KiB
2.3 KiB
SigLIP2 Multiview VLA Design
Status: user-specified architecture, treated as approved on 2026-04-06
Goal
Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
Approved architecture
- Backbone model:
google/siglip2-base-patch16-256 - Camera inputs: three views, encoded independently with a shared SigLIP2 vision encoder
- Input size:
- dataset images stay at native
256x256(no dataset-side resize) - eval/rollout images resize to
256x256before SigLIP2 because env renders are larger
- dataset images stay at native
- Per-view feature: use the global pooled image feature (
pooler_output, 768-d) - Per-view projection experiments:
768 -> 96768 -> 192
- Conditioning pipeline:
- concatenate 3 projected camera vectors
- concatenate robot state
- project concatenated condition to
384 - feed that
384-d per-step condition into the existing IMF/AttnRes diffusion head
- Training/run defaults for requested experiments:
n_emb=384n_layer=12pred_horizon=16num_action_steps=8- rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
Design decisions
- The condition projector lives in
VLAAgent._build_cond()so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head. - The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
- The backbone exposes
dataset_image_resize_shape=Noneandeval_image_resize_shape=(256, 256)so existing train/eval plumbing can reuse the raw-256 path already added in this branch. - One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
Files expected to change
roboimi/vla/models/backbones/for the new SigLIP2 backboneroboimi/vla/agent.pyfor optional post-concat condition projection- Hydra configs under
roboimi/vla/conf/{agent,backbone,modules} - tests for backbone wiring and agent conditioning dims
- remote launch commands/scripts only as needed for training