feat: add vision transfer backbones and IMF variants

This commit is contained in:
Logic
2026-04-09 14:02:24 +08:00
parent d51b3ecafa
commit ff7c9c1f2a
58 changed files with 2788 additions and 26 deletions

View File

@@ -0,0 +1,41 @@
# SigLIP2 Multiview VLA Design
**Status:** user-specified architecture, treated as approved on 2026-04-06
## Goal
Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
## Approved architecture
- Backbone model: `google/siglip2-base-patch16-256`
- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
- Input size:
- dataset images stay at native `256x256` (no dataset-side resize)
- eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
- Per-view projection experiments:
1. `768 -> 96`
2. `768 -> 192`
- Conditioning pipeline:
1. concatenate 3 projected camera vectors
2. concatenate robot state
3. project concatenated condition to `384`
4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
- Training/run defaults for requested experiments:
- `n_emb=384`
- `n_layer=12`
- `pred_horizon=16`
- `num_action_steps=8`
- rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
## Design decisions
- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
## Files expected to change
- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
- `roboimi/vla/agent.py` for optional post-concat condition projection
- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
- tests for backbone wiring and agent conditioning dims
- remote launch commands/scripts only as needed for training