# ResNet Multitoken IMF Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`. **Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head. **Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale. --- ### Task 1: Add failing tests for multi-token conditioning **Files:** - Modify: `tests/test_imf_vla_agent.py` - Modify: `tests/test_resnet_transformer_agent_wiring.py` - [ ] **Step 1: Add a direct agent test** - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`. - Assert state is paired with each camera token, not concatenated across cameras first. - [ ] **Step 2: Add Hydra wiring test** - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims. - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length. - [ ] **Step 3: Run focused tests and verify RED** - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q` ### Task 2: Implement multi-token ResNet conditioning path **Files:** - Modify: `roboimi/vla/models/backbones/resnet_diffusion.py` - Modify: `roboimi/vla/agent.py` - Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml` - [ ] **Step 1: Extend ResNet backbone** - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token. - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision. - [ ] **Step 2: Extend VLAAgent condition building** - Support visual features with rank 4 `(B,T,K,D)`. - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`. - Track `condition_tokens_per_step` and `condition_sequence_length`. - [ ] **Step 3: Update transformer-head instantiation** - Pass `n_obs_steps=condition_sequence_length` when building transformer heads. - [ ] **Step 4: Add Hydra config** - New agent config uses: - separate ResNet-18 per camera - standard residual vision trunk (`vision_backbone_mode=resnet`) - condition projector output dim tied to `${agent.head.n_emb}` - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8` ### Task 3: Verify locally **Files:** - Modify only if verification reveals issues - [ ] **Step 1: Run focused tests and make them pass** - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q` - [ ] **Step 2: Run regression subset** - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q` - [ ] **Step 3: Run local smoke instantiation** - instantiate the new Hydra config and verify cond shape / sequence length ### Task 4: Launch 4 L20 experiments **Files:** - Remote repo copy under `/home/droid/roboimi_suite_20260404` - [ ] **Step 1: Sync code to `100.119.99.14`** - [ ] **Step 2: Smoke the new config on remote** - [ ] **Step 3: Launch runs** - `(n_emb=256, n_layer=12)` - `(n_emb=256, n_layer=16)` - `(n_emb=384, n_layer=12)` - `(n_emb=384, n_layer=16)` - [ ] **Step 4: Keep fixed across runs** - rollout episodes `10` - `pred_horizon=16` - `num_action_steps=8` - standard ResNet-18 vision trunk - three separate camera weights - [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**