4.0 KiB
ResNet Multitoken IMF Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for n_emb in {256,384} and n_layer in {12,16}.
Architecture: The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. VLAAgent will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
Tech Stack: PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
Task 1: Add failing tests for multi-token conditioning
Files:
-
Modify:
tests/test_imf_vla_agent.py -
Modify:
tests/test_resnet_transformer_agent_wiring.py -
Step 1: Add a direct agent test
- Stub a vision backbone returning
(B,T,3,D)and assert_build_cond()yields(B, T*3, D_cond). - Assert state is paired with each camera token, not concatenated across cameras first.
- Stub a vision backbone returning
-
Step 2: Add Hydra wiring test
- Instantiate a new
agent=resnet_imf_attnres_multitokenconfig with small dims. - Assert
condition_tokens_per_step == 3,condition_sequence_length == obs_horizon * 3, and headn_obs_stepsreceives that sequence length.
- Instantiate a new
-
Step 3: Run focused tests and verify RED
python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q
Task 2: Implement multi-token ResNet conditioning path
Files:
-
Modify:
roboimi/vla/models/backbones/resnet_diffusion.py -
Modify:
roboimi/vla/agent.py -
Create:
roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml -
Step 1: Extend ResNet backbone
- Add an opt-in flag to return
(B,T,num_cams,D)camera tokens instead of one concatenated(B,T,num_cams*D)token. - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
- Add an opt-in flag to return
-
Step 2: Extend VLAAgent condition building
- Support visual features with rank 4
(B,T,K,D). - Broadcast state to
(B,T,K,D_state), concatenate per camera, apply projector per token, then flatten to(B,T*K,D_cond). - Track
condition_tokens_per_stepandcondition_sequence_length.
- Support visual features with rank 4
-
Step 3: Update transformer-head instantiation
- Pass
n_obs_steps=condition_sequence_lengthwhen building transformer heads.
- Pass
-
Step 4: Add Hydra config
- New agent config uses:
- separate ResNet-18 per camera
- standard residual vision trunk (
vision_backbone_mode=resnet) - condition projector output dim tied to
${agent.head.n_emb} - rollout episodes
10,pred_horizon=16,num_action_steps=8
- New agent config uses:
Task 3: Verify locally
Files:
-
Modify only if verification reveals issues
-
Step 1: Run focused tests and make them pass
python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q
-
Step 2: Run regression subset
python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q
-
Step 3: Run local smoke instantiation
- instantiate the new Hydra config and verify cond shape / sequence length
Task 4: Launch 4 L20 experiments
Files:
-
Remote repo copy under
/home/droid/roboimi_suite_20260404 -
Step 1: Sync code to
100.119.99.14 -
Step 2: Smoke the new config on remote
-
Step 3: Launch runs
(n_emb=256, n_layer=12)(n_emb=256, n_layer=16)(n_emb=384, n_layer=12)(n_emb=384, n_layer=16)
-
Step 4: Keep fixed across runs
- rollout episodes
10 pred_horizon=16num_action_steps=8- standard ResNet-18 vision trunk
- three separate camera weights
- rollout episodes
-
Step 5: Record PIDs, GPUs, log paths, SwanLab URLs