feat: add vision transfer backbones and IMF variants
This commit is contained in:
81
docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
Normal file
81
docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# ResNet Multitoken IMF Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
|
||||
|
||||
**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
|
||||
|
||||
**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add failing tests for multi-token conditioning
|
||||
|
||||
**Files:**
|
||||
- Modify: `tests/test_imf_vla_agent.py`
|
||||
- Modify: `tests/test_resnet_transformer_agent_wiring.py`
|
||||
|
||||
- [ ] **Step 1: Add a direct agent test**
|
||||
- Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
|
||||
- Assert state is paired with each camera token, not concatenated across cameras first.
|
||||
- [ ] **Step 2: Add Hydra wiring test**
|
||||
- Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
|
||||
- Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
|
||||
- [ ] **Step 3: Run focused tests and verify RED**
|
||||
- `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
|
||||
|
||||
### Task 2: Implement multi-token ResNet conditioning path
|
||||
|
||||
**Files:**
|
||||
- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
|
||||
- Modify: `roboimi/vla/agent.py`
|
||||
- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
|
||||
|
||||
- [ ] **Step 1: Extend ResNet backbone**
|
||||
- Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
|
||||
- Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
|
||||
- [ ] **Step 2: Extend VLAAgent condition building**
|
||||
- Support visual features with rank 4 `(B,T,K,D)`.
|
||||
- Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
|
||||
- Track `condition_tokens_per_step` and `condition_sequence_length`.
|
||||
- [ ] **Step 3: Update transformer-head instantiation**
|
||||
- Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
|
||||
- [ ] **Step 4: Add Hydra config**
|
||||
- New agent config uses:
|
||||
- separate ResNet-18 per camera
|
||||
- standard residual vision trunk (`vision_backbone_mode=resnet`)
|
||||
- condition projector output dim tied to `${agent.head.n_emb}`
|
||||
- rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
|
||||
|
||||
### Task 3: Verify locally
|
||||
|
||||
**Files:**
|
||||
- Modify only if verification reveals issues
|
||||
|
||||
- [ ] **Step 1: Run focused tests and make them pass**
|
||||
- `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
|
||||
- [ ] **Step 2: Run regression subset**
|
||||
- `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
|
||||
- [ ] **Step 3: Run local smoke instantiation**
|
||||
- instantiate the new Hydra config and verify cond shape / sequence length
|
||||
|
||||
### Task 4: Launch 4 L20 experiments
|
||||
|
||||
**Files:**
|
||||
- Remote repo copy under `/home/droid/roboimi_suite_20260404`
|
||||
|
||||
- [ ] **Step 1: Sync code to `100.119.99.14`**
|
||||
- [ ] **Step 2: Smoke the new config on remote**
|
||||
- [ ] **Step 3: Launch runs**
|
||||
- `(n_emb=256, n_layer=12)`
|
||||
- `(n_emb=256, n_layer=16)`
|
||||
- `(n_emb=384, n_layer=12)`
|
||||
- `(n_emb=384, n_layer=16)`
|
||||
- [ ] **Step 4: Keep fixed across runs**
|
||||
- rollout episodes `10`
|
||||
- `pred_horizon=16`
|
||||
- `num_action_steps=8`
|
||||
- standard ResNet-18 vision trunk
|
||||
- three separate camera weights
|
||||
- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
|
||||
Reference in New Issue
Block a user