# ResNet Multitoken IMF Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.

**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.

**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.

---

### Task 1: Add failing tests for multi-token conditioning

**Files:**
- Modify: `tests/test_imf_vla_agent.py`
- Modify: `tests/test_resnet_transformer_agent_wiring.py`

- [ ] **Step 1: Add a direct agent test**
  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
  - Assert state is paired with each camera token, not concatenated across cameras first.
- [ ] **Step 2: Add Hydra wiring test**
  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
- [ ] **Step 3: Run focused tests and verify RED**
  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`

### Task 2: Implement multi-token ResNet conditioning path

**Files:**
- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
- Modify: `roboimi/vla/agent.py`
- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`

- [ ] **Step 1: Extend ResNet backbone**
  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
- [ ] **Step 2: Extend VLAAgent condition building**
  - Support visual features with rank 4 `(B,T,K,D)`.
  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
  - Track `condition_tokens_per_step` and `condition_sequence_length`.
- [ ] **Step 3: Update transformer-head instantiation**
  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
- [ ] **Step 4: Add Hydra config**
  - New agent config uses:
    - separate ResNet-18 per camera
    - standard residual vision trunk (`vision_backbone_mode=resnet`)
    - condition projector output dim tied to `${agent.head.n_emb}`
    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`

### Task 3: Verify locally

**Files:**
- Modify only if verification reveals issues

- [ ] **Step 1: Run focused tests and make them pass**
  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
- [ ] **Step 2: Run regression subset**
  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
- [ ] **Step 3: Run local smoke instantiation**
  - instantiate the new Hydra config and verify cond shape / sequence length

### Task 4: Launch 4 L20 experiments

**Files:**
- Remote repo copy under `/home/droid/roboimi_suite_20260404`

- [ ] **Step 1: Sync code to `100.119.99.14`**
- [ ] **Step 2: Smoke the new config on remote**
- [ ] **Step 3: Launch runs**
  - `(n_emb=256, n_layer=12)`
  - `(n_emb=256, n_layer=16)`
  - `(n_emb=384, n_layer=12)`
  - `(n_emb=384, n_layer=16)`
- [ ] **Step 4: Keep fixed across runs**
  - rollout episodes `10`
  - `pred_horizon=16`
  - `num_action_steps=8`
  - standard ResNet-18 vision trunk
  - three separate camera weights
- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**