feat: add vision transfer backbones and IMF variants
This commit is contained in:
@@ -0,0 +1,41 @@
|
||||
# SigLIP2 Multiview VLA Design
|
||||
|
||||
**Status:** user-specified architecture, treated as approved on 2026-04-06
|
||||
|
||||
## Goal
|
||||
Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
|
||||
|
||||
## Approved architecture
|
||||
- Backbone model: `google/siglip2-base-patch16-256`
|
||||
- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
|
||||
- Input size:
|
||||
- dataset images stay at native `256x256` (no dataset-side resize)
|
||||
- eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
|
||||
- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
|
||||
- Per-view projection experiments:
|
||||
1. `768 -> 96`
|
||||
2. `768 -> 192`
|
||||
- Conditioning pipeline:
|
||||
1. concatenate 3 projected camera vectors
|
||||
2. concatenate robot state
|
||||
3. project concatenated condition to `384`
|
||||
4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
|
||||
- Training/run defaults for requested experiments:
|
||||
- `n_emb=384`
|
||||
- `n_layer=12`
|
||||
- `pred_horizon=16`
|
||||
- `num_action_steps=8`
|
||||
- rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
|
||||
|
||||
## Design decisions
|
||||
- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
|
||||
- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
|
||||
- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
|
||||
- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
|
||||
|
||||
## Files expected to change
|
||||
- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
|
||||
- `roboimi/vla/agent.py` for optional post-concat condition projection
|
||||
- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
|
||||
- tests for backbone wiring and agent conditioning dims
|
||||
- remote launch commands/scripts only as needed for training
|
||||
Reference in New Issue
Block a user