feat: add vision transfer backbones and IMF variants
This commit is contained in:
78
docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
Normal file
78
docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# SigLIP2 Multiview VLA Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
|
||||
|
||||
**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
|
||||
|
||||
**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_siglip2_diffusion_backbone.py`
|
||||
- Modify: `tests/test_imf_vla_agent.py`
|
||||
|
||||
- [ ] **Step 1: Write failing backbone tests**
|
||||
- Instantiate the new backbone with a stub SigLIP2 vision model.
|
||||
- Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
|
||||
- Assert three views are encoded independently and projected.
|
||||
- [ ] **Step 2: Run focused tests and verify RED**
|
||||
- Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
|
||||
- Expect failure because the backbone/config/projector do not exist yet.
|
||||
- [ ] **Step 3: Extend agent wiring tests**
|
||||
- Add a Hydra/instantiate test for a new SigLIP2 IMF config.
|
||||
- Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
|
||||
|
||||
### Task 2: Implement SigLIP2 backbone and optional condition projector
|
||||
|
||||
**Files:**
|
||||
- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
|
||||
- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
|
||||
- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
|
||||
- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
|
||||
- Modify: `roboimi/vla/models/backbones/__init__.py`
|
||||
- Modify: `roboimi/vla/agent.py`
|
||||
|
||||
- [ ] **Step 1: Implement backbone**
|
||||
- Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
|
||||
- Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
|
||||
- Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
|
||||
- [ ] **Step 2: Implement optional condition projector**
|
||||
- Allow `VLAAgent` to accept `cond_projector`.
|
||||
- Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
|
||||
- Apply the projector in `_build_cond()` after visual+state concatenation.
|
||||
- [ ] **Step 3: Add Hydra configs**
|
||||
- New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
|
||||
- Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
|
||||
|
||||
### Task 3: Verify locally and prepare remote execution
|
||||
|
||||
**Files:**
|
||||
- Modify as needed only if tests/smoke reveal issues
|
||||
|
||||
- [ ] **Step 1: Run focused tests and make them pass**
|
||||
- `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
|
||||
- [ ] **Step 2: Run a local smoke instantiation**
|
||||
- Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
|
||||
- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
|
||||
|
||||
### Task 4: Sync to 5880 and launch experiments
|
||||
|
||||
**Files:**
|
||||
- Remote repo copy under `/home/droid/roboimi_suite_20260404`
|
||||
|
||||
- [ ] **Step 1: Stop superseded remote jobs**
|
||||
- [ ] **Step 2: Sync updated code to remote**
|
||||
- Prefer `rsync` or `git push/pull` without overwriting unrelated files.
|
||||
- [ ] **Step 3: Remote smoke test**
|
||||
- Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
|
||||
- Confirm headless rollout path still uses `256x256` eval resize.
|
||||
- [ ] **Step 4: Launch experiment A**
|
||||
- `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
|
||||
- [ ] **Step 5: Launch experiment B**
|
||||
- `per_view_output_dim=192`, same other hyperparameters.
|
||||
- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**
|
||||
Reference in New Issue
Block a user