feat: add pusht imf attnres backbone

2026-03-29 11:15:59 +08:00
parent 78ab18e8f3
commit 185ed6596c
8 changed files with 647 additions and 61 deletions
--- a/docs/superpowers/plans/2026-03-29-pusht-imf-attnres-implementation.md
+++ b/docs/superpowers/plans/2026-03-29-pusht-imf-attnres-implementation.md
@@ -0,0 +1,57 @@
+# PushT Image iMF AttnRes Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add an AttnRes-backed full-attention iMF backbone for the PushT image experiment path, verify it with tests/smoke runs, then launch the 9-run 350-epoch architecture sweep across the local 5090 and remote 5880 GPUs.
+
+**Architecture:** Extend `IMFTransformerForDiffusion` with a selectable `attnres_full` backbone that keeps the current iMF training/inference API unchanged while replacing the transformer internals with RMSNorm + RoPE self-attention + SwiGLU + Full AttnRes depth-wise residual routing. Add one standalone Hydra config for the PushT image sweep and reuse queue-style launch scripts with unique SwanLab names.
+
+**Tech Stack:** Python 3.9 via uv, PyTorch 2.8 CUDA, Hydra, SwanLab online logging, local shell + SSH to trusted 5880 host.
+
+---
+
+### Task 1: Add regression tests for the new AttnRes path
+
+**Files:**
+- Modify: `tests/test_imf_transformer_for_diffusion.py`
+- Modify: `tests/test_pusht_swanlab_config.py`
+
+- [ ] Add a failing model test that instantiates `IMFTransformerForDiffusion(backbone_type='attnres_full', causal_attn=False, ...)`, runs a forward pass with conditional observations, and asserts output shape plus optimizer construction.
+- [ ] Run the targeted pytest selection and confirm the new test fails for the expected missing-backbone reason.
+- [ ] Add a failing config regression test for `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml` asserting SwanLab naming fields and `policy.causal_attn == False`.
+- [ ] Re-run the targeted pytest selection and confirm the config test fails before implementation.
+
+### Task 2: Implement the AttnRes-backed iMF backbone
+
+**Files:**
+- Create: `diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- Modify: `diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+
+- [ ] Add focused reusable modules for `RMSNorm`, RoPE helpers, grouped-query self-attention, SwiGLU FFN, and the Full AttnRes operator.
+- [ ] Extend `IMFTransformerForDiffusion` with a `backbone_type` switch that preserves the existing vanilla path and adds an `attnres_full` path using concatenated `[r, t, obs, sample]` tokens.
+- [ ] Ensure the AttnRes path slices condition tokens away before the output head so the returned tensor still matches the sample/action horizon.
+- [ ] Update optimizer parameter grouping to treat RMSNorm weights like LayerNorm weights (no decay) and include any new positional/conditioning parameters.
+- [ ] Run the targeted tests and get them green.
+
+### Task 3: Add the new PushT config and smoke-test path
+
+**Files:**
+- Create: `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
+- Modify: `tests/test_pusht_swanlab_config.py`
+
+- [ ] Add a standalone PushT image config for the AttnRes iMF variant with SwanLab online logging, `policy.backbone_type=attnres_full`, and `policy.causal_attn=false`.
+- [ ] Verify `uv run python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_dit_imf_attnres_full.yaml --help` succeeds.
+- [ ] Run a real smoke training command with `training.debug=true`, `training.device=cuda:0`, safety overrides (`dataloader.num_workers=0`, `task.env_runner.n_envs=1`, no vis), and confirm it reaches the training loop and writes a run directory.
+
+### Task 4: Prepare launch scripts and start the 9-run sweep
+
+**Files:**
+- Create or modify: `data/run_logs/imf_attnres_local_queue.sh`
+- Create or modify locally before copy: `data/run_logs/imf_attnres_remote_gpu0_queue.sh`
+- Create or modify locally before copy: `data/run_logs/imf_attnres_remote_gpu1_queue.sh`
+
+- [ ] Write queue command templates for the 9 runs using config `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`, `training.num_epochs=350`, unique `exp_name/logging.name`, and shared `logging.group=imf_pusht_attnres_arch_sweep`.
+- [ ] Sync the necessary config/model files plus remote queue scripts to `droid@100.73.14.65:~/project/diffusion_policy-smoke`.
+- [ ] Start the local queue under `nohup`, record PID, and verify the first run log is advancing.
+- [ ] Start the two remote queues under `nohup`, record PIDs, and verify both first-run logs are advancing.
+- [ ] Confirm all three GPUs have officially entered training for the new sweep.
--- a/docs/superpowers/specs/2026-03-29-pusht-imf-attnres-design.md
+++ b/docs/superpowers/specs/2026-03-29-pusht-imf-attnres-design.md
@@ -0,0 +1,108 @@
+# PushT Image iMF AttnRes Design
+
+## Goal
+在现有 PushT 图像 iMF full-attention 路线之上，引入 `attn_res` 仓库中的 **Full AttnRes** 残差聚合形式，并同步使用与其匹配的 **RMSNorm + 自注意力 + SwiGLU FFN** 模块，保持 iMF 训练目标与一步推理语义不变，仅作用于本次实验链路。实现完成并验证后，启动与此前相同的 9 组 `n_emb × n_layer` 扫描（350 epochs, seed=42, SwanLab online, 无视频记录）。
+
+## Scope
+本次工作仅覆盖：
+1. 为 `IMFTransformerForDiffusion` 增加一个 AttnRes-backed backbone 变体；
+2. 保持 `forward(sample, r, t, cond=None)`、iMF loss、一步推理策略接口不变；
+3. 新增独立 PushT 图像配置用于该变体；
+4. 复用本地 5090 + 远端 5880 双卡三路并行调度 9 组实验。
+
+不在范围内：
+- 不替换已有 vanilla iMF/full-attn 配置；
+- 不修改 DiT baseline；
+- 不增加视频日志；
+- 不扩大到多 seed。
+
+## Recommended Approach
+采用“**在当前 iMF 模型内增加可选 AttnRes backbone**”的方式，而不是新建独立 policy 链路。
+
+理由：
+- policy / workspace / loss / sampling 路径已经被验证，保留这些路径可最大程度缩小变动面；
+- 仅在模型内部切换 backbone，可以让新实验与既有 iMF 结果保持可比；
+- 配置上只需显式打开 `backbone_type=attnres_full`、`causal_attn=false` 等开关，复现实验更直接。
+
+## Architecture
+### 1. Backbone split
+`IMFTransformerForDiffusion` 保留现有 vanilla encoder/decoder 实现为默认路径，并新增 `attnres_full` 路径：
+- **vanilla**：保持当前实现不变；
+- **attnres_full**：使用单栈式全注意力 Transformer，输入 token 序列为
+  `[r token, t token, obs cond tokens..., action/sample tokens...]`。
+
+模型只对末尾的 action/sample token 位置输出 `u` 预测，前置条件 token 仅参与上下文建模。
+
+### 2. AttnRes stack
+新 backbone 使用以下模块：
+- `RMSNorm`
+- `Rotary Position Embedding`（用于自注意力 q/k）
+- `GroupedQueryAttention`（本实验默认 `n_kv_head=1`，与单头配置兼容）
+- `SwiGLU` FFN
+- `AttnResOperator`（每个子层一个 pseudo-query，执行 full depth-wise residual aggregation）
+
+每个 transformer block 由两个子层组成：
+1. self-attention 子层
+2. FFN 子层
+
+每个子层的输入不再是简单 `x + f(x)`，而是从 embedding 与全部历史子层输出中通过 Full AttnRes 聚合得到 `h_l`，再执行 `RMSNorm(h_l) -> sublayer_fn(...)`。
+
+### 3. Conditioning and token flow
+- `sample` 先经 `input_emb` 映射为 action tokens；
+- `r` 和 `t` 各自经 `SinusoidalPosEmb + linear` 映射为两个条件 token；
+- 图像观测编码后的 `cond` 通过 `cond_obs_emb` 映射为 obs tokens；
+- 拼接后的完整 token 序列进入 AttnRes stack；
+- 输出时切掉前置条件 token，仅保留 action/sample token 段，随后经 `RMSNorm + head` 得到最终 `u`。
+
+### 4. Attention mode
+本次实验链路固定为 **non-causal full attention**：
+- `causal_attn=false`
+- 不构造 causal mask
+- 所有 token 可彼此双向可见
+
+这与用户指定的“训练过程仍然使用全注意力（不加因果注意）”一致。
+
+## Config and Logging
+新增独立配置文件，例如：
+- `image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
+
+该配置需要：
+- 指向现有 `IMFTransformerHybridImagePolicy`
+- 显式开启 AttnRes backbone 相关参数
+- 设置 `policy.causal_attn=false`
+- 保持 `logging.backend=swanlab`、`logging.mode=online`
+- 运行时通过覆盖保证：
+  - `logging.name=<unique_run_name>`
+  - `logging.group=imf_pusht_attnres_arch_sweep`
+  - `exp_name=<unique_run_name>`
+- 保持 `task.env_runner.n_test_vis=0` 与 `n_train_vis=0`，仅记录标量
+
+## Experiment Matrix
+固定 9 组：
+- `n_emb ∈ {128, 256, 384}`
+- `n_layer ∈ {6, 12, 18}`
+- `seed=42`
+- `training.num_epochs=350`
+
+## Scheduling
+沿用之前验证过的三队列分配：
+- 本机 5090：`384x18`, `256x6`, `128x6`
+- 5880 GPU0：`384x12`, `256x12`, `128x12`
+- 5880 GPU1：`384x6`, `256x18`, `128x18`
+
+每个 run name 编码 backbone 与结构，例如：
+`imf_attnres_emb256_layer12_seed42_5880gpu0`
+
+## Verification
+实现阶段至少验证：
+1. 新配置的 SwanLab 命名与 `causal_attn=false` 正确；
+2. 新 backbone 的 forward shape 与 `configure_optimizers()` 可用；
+3. 旧 vanilla 路径测试不回归；
+4. `training.debug=true` smoke run 可以完整通过。
+
+## Success Criteria
+1. 新 AttnRes iMF 变体在本分支可训练、可一步推理；
+2. 不影响已有 vanilla iMF/full-attn 链路；
+3. 9 组实验成功在三张卡上正式启动；
+4. SwanLab run 名称唯一，无冲突；
+5. 不记录视频，仅记录标量。