Files
diffusion_policy/docs/superpowers/plans/2026-03-29-pusht-imf-attnres-implementation.md
2026-03-29 11:15:59 +08:00

4.3 KiB

PushT Image iMF AttnRes Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add an AttnRes-backed full-attention iMF backbone for the PushT image experiment path, verify it with tests/smoke runs, then launch the 9-run 350-epoch architecture sweep across the local 5090 and remote 5880 GPUs.

Architecture: Extend IMFTransformerForDiffusion with a selectable attnres_full backbone that keeps the current iMF training/inference API unchanged while replacing the transformer internals with RMSNorm + RoPE self-attention + SwiGLU + Full AttnRes depth-wise residual routing. Add one standalone Hydra config for the PushT image sweep and reuse queue-style launch scripts with unique SwanLab names.

Tech Stack: Python 3.9 via uv, PyTorch 2.8 CUDA, Hydra, SwanLab online logging, local shell + SSH to trusted 5880 host.


Task 1: Add regression tests for the new AttnRes path

Files:

  • Modify: tests/test_imf_transformer_for_diffusion.py

  • Modify: tests/test_pusht_swanlab_config.py

  • Add a failing model test that instantiates IMFTransformerForDiffusion(backbone_type='attnres_full', causal_attn=False, ...), runs a forward pass with conditional observations, and asserts output shape plus optimizer construction.

  • Run the targeted pytest selection and confirm the new test fails for the expected missing-backbone reason.

  • Add a failing config regression test for image_pusht_diffusion_policy_dit_imf_attnres_full.yaml asserting SwanLab naming fields and policy.causal_attn == False.

  • Re-run the targeted pytest selection and confirm the config test fails before implementation.

Task 2: Implement the AttnRes-backed iMF backbone

Files:

  • Create: diffusion_policy/model/diffusion/attnres_transformer_components.py

  • Modify: diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py

  • Add focused reusable modules for RMSNorm, RoPE helpers, grouped-query self-attention, SwiGLU FFN, and the Full AttnRes operator.

  • Extend IMFTransformerForDiffusion with a backbone_type switch that preserves the existing vanilla path and adds an attnres_full path using concatenated [r, t, obs, sample] tokens.

  • Ensure the AttnRes path slices condition tokens away before the output head so the returned tensor still matches the sample/action horizon.

  • Update optimizer parameter grouping to treat RMSNorm weights like LayerNorm weights (no decay) and include any new positional/conditioning parameters.

  • Run the targeted tests and get them green.

Task 3: Add the new PushT config and smoke-test path

Files:

  • Create: image_pusht_diffusion_policy_dit_imf_attnres_full.yaml

  • Modify: tests/test_pusht_swanlab_config.py

  • Add a standalone PushT image config for the AttnRes iMF variant with SwanLab online logging, policy.backbone_type=attnres_full, and policy.causal_attn=false.

  • Verify uv run python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_dit_imf_attnres_full.yaml --help succeeds.

  • Run a real smoke training command with training.debug=true, training.device=cuda:0, safety overrides (dataloader.num_workers=0, task.env_runner.n_envs=1, no vis), and confirm it reaches the training loop and writes a run directory.

Task 4: Prepare launch scripts and start the 9-run sweep

Files:

  • Create or modify: data/run_logs/imf_attnres_local_queue.sh

  • Create or modify locally before copy: data/run_logs/imf_attnres_remote_gpu0_queue.sh

  • Create or modify locally before copy: data/run_logs/imf_attnres_remote_gpu1_queue.sh

  • Write queue command templates for the 9 runs using config image_pusht_diffusion_policy_dit_imf_attnres_full.yaml, training.num_epochs=350, unique exp_name/logging.name, and shared logging.group=imf_pusht_attnres_arch_sweep.

  • Sync the necessary config/model files plus remote queue scripts to droid@100.73.14.65:~/project/diffusion_policy-smoke.

  • Start the local queue under nohup, record PID, and verify the first run log is advancing.

  • Start the two remote queues under nohup, record PIDs, and verify both first-run logs are advancing.

  • Confirm all three GPUs have officially entered training for the new sweep.