Files
roboimi/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md

4.0 KiB

ResNet Multitoken IMF Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for n_emb in {256,384} and n_layer in {12,16}.

Architecture: The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. VLAAgent will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.

Tech Stack: PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.


Task 1: Add failing tests for multi-token conditioning

Files:

  • Modify: tests/test_imf_vla_agent.py

  • Modify: tests/test_resnet_transformer_agent_wiring.py

  • Step 1: Add a direct agent test

    • Stub a vision backbone returning (B,T,3,D) and assert _build_cond() yields (B, T*3, D_cond).
    • Assert state is paired with each camera token, not concatenated across cameras first.
  • Step 2: Add Hydra wiring test

    • Instantiate a new agent=resnet_imf_attnres_multitoken config with small dims.
    • Assert condition_tokens_per_step == 3, condition_sequence_length == obs_horizon * 3, and head n_obs_steps receives that sequence length.
  • Step 3: Run focused tests and verify RED

    • python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q

Task 2: Implement multi-token ResNet conditioning path

Files:

  • Modify: roboimi/vla/models/backbones/resnet_diffusion.py

  • Modify: roboimi/vla/agent.py

  • Create: roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml

  • Step 1: Extend ResNet backbone

    • Add an opt-in flag to return (B,T,num_cams,D) camera tokens instead of one concatenated (B,T,num_cams*D) token.
    • Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
  • Step 2: Extend VLAAgent condition building

    • Support visual features with rank 4 (B,T,K,D).
    • Broadcast state to (B,T,K,D_state), concatenate per camera, apply projector per token, then flatten to (B,T*K,D_cond).
    • Track condition_tokens_per_step and condition_sequence_length.
  • Step 3: Update transformer-head instantiation

    • Pass n_obs_steps=condition_sequence_length when building transformer heads.
  • Step 4: Add Hydra config

    • New agent config uses:
      • separate ResNet-18 per camera
      • standard residual vision trunk (vision_backbone_mode=resnet)
      • condition projector output dim tied to ${agent.head.n_emb}
      • rollout episodes 10, pred_horizon=16, num_action_steps=8

Task 3: Verify locally

Files:

  • Modify only if verification reveals issues

  • Step 1: Run focused tests and make them pass

    • python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q
  • Step 2: Run regression subset

    • python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q
  • Step 3: Run local smoke instantiation

    • instantiate the new Hydra config and verify cond shape / sequence length

Task 4: Launch 4 L20 experiments

Files:

  • Remote repo copy under /home/droid/roboimi_suite_20260404

  • Step 1: Sync code to 100.119.99.14

  • Step 2: Smoke the new config on remote

  • Step 3: Launch runs

    • (n_emb=256, n_layer=12)
    • (n_emb=256, n_layer=16)
    • (n_emb=384, n_layer=12)
    • (n_emb=384, n_layer=16)
  • Step 4: Keep fixed across runs

    • rollout episodes 10
    • pred_horizon=16
    • num_action_steps=8
    • standard ResNet-18 vision trunk
    • three separate camera weights
  • Step 5: Record PIDs, GPUs, log paths, SwanLab URLs