Files
roboimi/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
2026-04-05 00:07:59 +08:00

3.1 KiB

Phase-2 Full-AttnRes Vision Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.

Architecture: Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing ResNetDiffusionBackbone via an opt-in mode and keep the agent/head/data interfaces unchanged.

Tech Stack: PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.


Task 1: Add failing tests for the new full-AttnRes visual backbone

Files:

  • Create: tests/test_attnres_resnet2d_backbone.py

  • Update: tests/test_imf_vla_agent.py

  • Step 1: Write a failing backbone shape test

  • Step 2: Run it to confirm the new backbone/config does not exist yet

  • Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208

  • Step 4: Run the targeted tests and capture the failure

Task 2: Implement a ResNet-like 2D AttnRes backbone

Files:

  • Create: roboimi/vla/models/backbones/attnres_resnet2d.py

  • Modify: roboimi/vla/models/backbones/resnet_diffusion.py

  • Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling

  • Step 2: Implement AttnResImageBlock2D for feature maps

  • Step 3: Implement AttnResResNetLikeBackbone2D with stage-wise downsampling

  • Step 4: Wire _SingleRgbEncoder to choose between original ResNet trunk and the new full-AttnRes trunk

  • Step 5: Run the new backbone tests

Task 3: Expose config switches and agent wiring

Files:

  • Modify: roboimi/vla/conf/backbone/resnet_diffusion.yaml

  • Modify: roboimi/vla/conf/agent/resnet_imf_attnres.yaml

  • Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk

  • Step 2: Add defaults for attnres image depth/heads/etc. if needed

  • Step 3: Add a Phase-2 launch override path that enables the new visual trunk

  • Step 4: Run agent wiring tests again

Task 4: Smoke-verify training path

Files:

  • Reuse existing training scripts and configs

  • Step 1: Run a short CPU or tiny-step smoke instantiation / compute_loss test

  • Step 2: If needed, run a very short training smoke launch

  • Step 3: Verify no cond-dim or rollout-loading regressions

Task 5: Launch the Phase-2 experiment

Files:

  • Update experiment tracking under experiment_suites/

  • Step 1: Use Phase-1 best setting (pred_horizon=16, num_action_steps=8)

  • Step 2: Launch baseline reference or reuse existing result

  • Step 3: Launch full-AttnRes vision experiment

  • Step 4: Track rollout metrics and compare max avg_reward