4.3 KiB
PushT Image iMF AttnRes Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Add an AttnRes-backed full-attention iMF backbone for the PushT image experiment path, verify it with tests/smoke runs, then launch the 9-run 350-epoch architecture sweep across the local 5090 and remote 5880 GPUs.
Architecture: Extend IMFTransformerForDiffusion with a selectable attnres_full backbone that keeps the current iMF training/inference API unchanged while replacing the transformer internals with RMSNorm + RoPE self-attention + SwiGLU + Full AttnRes depth-wise residual routing. Add one standalone Hydra config for the PushT image sweep and reuse queue-style launch scripts with unique SwanLab names.
Tech Stack: Python 3.9 via uv, PyTorch 2.8 CUDA, Hydra, SwanLab online logging, local shell + SSH to trusted 5880 host.
Task 1: Add regression tests for the new AttnRes path
Files:
-
Modify:
tests/test_imf_transformer_for_diffusion.py -
Modify:
tests/test_pusht_swanlab_config.py -
Add a failing model test that instantiates
IMFTransformerForDiffusion(backbone_type='attnres_full', causal_attn=False, ...), runs a forward pass with conditional observations, and asserts output shape plus optimizer construction. -
Run the targeted pytest selection and confirm the new test fails for the expected missing-backbone reason.
-
Add a failing config regression test for
image_pusht_diffusion_policy_dit_imf_attnres_full.yamlasserting SwanLab naming fields andpolicy.causal_attn == False. -
Re-run the targeted pytest selection and confirm the config test fails before implementation.
Task 2: Implement the AttnRes-backed iMF backbone
Files:
-
Create:
diffusion_policy/model/diffusion/attnres_transformer_components.py -
Modify:
diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py -
Add focused reusable modules for
RMSNorm, RoPE helpers, grouped-query self-attention, SwiGLU FFN, and the Full AttnRes operator. -
Extend
IMFTransformerForDiffusionwith abackbone_typeswitch that preserves the existing vanilla path and adds anattnres_fullpath using concatenated[r, t, obs, sample]tokens. -
Ensure the AttnRes path slices condition tokens away before the output head so the returned tensor still matches the sample/action horizon.
-
Update optimizer parameter grouping to treat RMSNorm weights like LayerNorm weights (no decay) and include any new positional/conditioning parameters.
-
Run the targeted tests and get them green.
Task 3: Add the new PushT config and smoke-test path
Files:
-
Create:
image_pusht_diffusion_policy_dit_imf_attnres_full.yaml -
Modify:
tests/test_pusht_swanlab_config.py -
Add a standalone PushT image config for the AttnRes iMF variant with SwanLab online logging,
policy.backbone_type=attnres_full, andpolicy.causal_attn=false. -
Verify
uv run python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_dit_imf_attnres_full.yaml --helpsucceeds. -
Run a real smoke training command with
training.debug=true,training.device=cuda:0, safety overrides (dataloader.num_workers=0,task.env_runner.n_envs=1, no vis), and confirm it reaches the training loop and writes a run directory.
Task 4: Prepare launch scripts and start the 9-run sweep
Files:
-
Create or modify:
data/run_logs/imf_attnres_local_queue.sh -
Create or modify locally before copy:
data/run_logs/imf_attnres_remote_gpu0_queue.sh -
Create or modify locally before copy:
data/run_logs/imf_attnres_remote_gpu1_queue.sh -
Write queue command templates for the 9 runs using config
image_pusht_diffusion_policy_dit_imf_attnres_full.yaml,training.num_epochs=350, uniqueexp_name/logging.name, and sharedlogging.group=imf_pusht_attnres_arch_sweep. -
Sync the necessary config/model files plus remote queue scripts to
droid@100.73.14.65:~/project/diffusion_policy-smoke. -
Start the local queue under
nohup, record PID, and verify the first run log is advancing. -
Start the two remote queues under
nohup, record PIDs, and verify both first-run logs are advancing. -
Confirm all three GPUs have officially entered training for the new sweep.