Files
diffusion_policy/docs/superpowers/plans/2026-03-29-pusht-imf-attnres-implementation.md
2026-03-29 11:15:59 +08:00

58 lines
4.3 KiB
Markdown

# PushT Image iMF AttnRes Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add an AttnRes-backed full-attention iMF backbone for the PushT image experiment path, verify it with tests/smoke runs, then launch the 9-run 350-epoch architecture sweep across the local 5090 and remote 5880 GPUs.
**Architecture:** Extend `IMFTransformerForDiffusion` with a selectable `attnres_full` backbone that keeps the current iMF training/inference API unchanged while replacing the transformer internals with RMSNorm + RoPE self-attention + SwiGLU + Full AttnRes depth-wise residual routing. Add one standalone Hydra config for the PushT image sweep and reuse queue-style launch scripts with unique SwanLab names.
**Tech Stack:** Python 3.9 via uv, PyTorch 2.8 CUDA, Hydra, SwanLab online logging, local shell + SSH to trusted 5880 host.
---
### Task 1: Add regression tests for the new AttnRes path
**Files:**
- Modify: `tests/test_imf_transformer_for_diffusion.py`
- Modify: `tests/test_pusht_swanlab_config.py`
- [ ] Add a failing model test that instantiates `IMFTransformerForDiffusion(backbone_type='attnres_full', causal_attn=False, ...)`, runs a forward pass with conditional observations, and asserts output shape plus optimizer construction.
- [ ] Run the targeted pytest selection and confirm the new test fails for the expected missing-backbone reason.
- [ ] Add a failing config regression test for `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml` asserting SwanLab naming fields and `policy.causal_attn == False`.
- [ ] Re-run the targeted pytest selection and confirm the config test fails before implementation.
### Task 2: Implement the AttnRes-backed iMF backbone
**Files:**
- Create: `diffusion_policy/model/diffusion/attnres_transformer_components.py`
- Modify: `diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
- [ ] Add focused reusable modules for `RMSNorm`, RoPE helpers, grouped-query self-attention, SwiGLU FFN, and the Full AttnRes operator.
- [ ] Extend `IMFTransformerForDiffusion` with a `backbone_type` switch that preserves the existing vanilla path and adds an `attnres_full` path using concatenated `[r, t, obs, sample]` tokens.
- [ ] Ensure the AttnRes path slices condition tokens away before the output head so the returned tensor still matches the sample/action horizon.
- [ ] Update optimizer parameter grouping to treat RMSNorm weights like LayerNorm weights (no decay) and include any new positional/conditioning parameters.
- [ ] Run the targeted tests and get them green.
### Task 3: Add the new PushT config and smoke-test path
**Files:**
- Create: `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
- Modify: `tests/test_pusht_swanlab_config.py`
- [ ] Add a standalone PushT image config for the AttnRes iMF variant with SwanLab online logging, `policy.backbone_type=attnres_full`, and `policy.causal_attn=false`.
- [ ] Verify `uv run python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_dit_imf_attnres_full.yaml --help` succeeds.
- [ ] Run a real smoke training command with `training.debug=true`, `training.device=cuda:0`, safety overrides (`dataloader.num_workers=0`, `task.env_runner.n_envs=1`, no vis), and confirm it reaches the training loop and writes a run directory.
### Task 4: Prepare launch scripts and start the 9-run sweep
**Files:**
- Create or modify: `data/run_logs/imf_attnres_local_queue.sh`
- Create or modify locally before copy: `data/run_logs/imf_attnres_remote_gpu0_queue.sh`
- Create or modify locally before copy: `data/run_logs/imf_attnres_remote_gpu1_queue.sh`
- [ ] Write queue command templates for the 9 runs using config `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`, `training.num_epochs=350`, unique `exp_name/logging.name`, and shared `logging.group=imf_pusht_attnres_arch_sweep`.
- [ ] Sync the necessary config/model files plus remote queue scripts to `droid@100.73.14.65:~/project/diffusion_policy-smoke`.
- [ ] Start the local queue under `nohup`, record PID, and verify the first run log is advancing.
- [ ] Start the two remote queues under `nohup`, record PIDs, and verify both first-run logs are advancing.
- [ ] Confirm all three GPUs have officially entered training for the new sweep.