2 Commits

Author SHA1 Message Date
Logic
78ab18e8f3 feat: add pusht imf full-attention config 2026-03-27 22:02:31 +08:00
Logic
484d008997 docs: add pusht imf full-attention sweep spec 2026-03-27 16:20:41 +08:00
4 changed files with 212 additions and 0 deletions

View File

@@ -0,0 +1,60 @@
# PushT iMF Full-Attention Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a separate full-attention PushT image iMF config, commit/push it on a new branch, and launch the 9-run 350-epoch architecture sweep across 3 GPUs.
**Architecture:** Keep the existing causal iMF path untouched and add a standalone full-attention config that only flips `policy.causal_attn=false` while retaining one-step iMF inference and SwanLab-safe naming. Reuse the previous 9-run architecture matrix and balanced 3-queue scheduling across local 5090 plus 5880 GPU0/GPU1.
**Tech Stack:** Hydra, Diffusion Policy iMF image workspace, SwanLab, uv env, local shell + trusted remote 5880 over SSH.
---
### Task 1: Add full-attention iMF config with TDD
**Files:**
- Create: `image_pusht_diffusion_policy_dit_imf_fullattn.yaml`
- Modify: `tests/test_pusht_swanlab_config.py`
- [ ] Write a failing config regression test asserting the new config uses SwanLab-safe naming and `policy.causal_attn == False`.
- [ ] Run the targeted pytest command and verify it fails because the config does not exist yet.
- [ ] Add the minimal full-attention config by composing from the existing PushT image iMF config and overriding only `exp_name` and `policy.causal_attn=false`.
- [ ] Re-run the targeted pytest and verify it passes.
### Task 2: Verify the new config
**Files:**
- Read: `image_pusht_diffusion_policy_dit_imf_fullattn.yaml`
- [ ] Run `train.py --help` for the new config.
- [ ] Run a real `training.debug=true` smoke test locally to confirm the training path is valid.
### Task 3: Commit and push the new branch
**Files:**
- Commit only the new config/test/plan files needed for the full-attention experiment chain.
- [ ] Run verification commands again before commit.
- [ ] Commit with a focused message.
- [ ] Push `feat/pusht-imf-fullattn` to origin.
### Task 4: Launch the 9-run sweep
**Files:**
- Write queue scripts and logs under `data/run_logs/` locally and on 5880.
- Write outputs under `data/outputs/` locally and on 5880.
- [ ] Use the same matrix as the prior iMF sweep: `n_emb ∈ {128,256,384}`, `n_layer ∈ {6,12,18}`, `seed=42`.
- [ ] Set `training.num_epochs=350` for all 9 runs.
- [ ] Encode `fullattn` in every `exp_name`, `logging.name`, and run directory to avoid collisions.
- [ ] Balance the 9 runs across local 5090, 5880 GPU0, and 5880 GPU1 as three serial queues.
- [ ] Sync the new config to the remote smoke repo before launching remote queues.
### Task 5: Monitor and auto-summarize
**Files:**
- Read local and remote pid files, logs, outputs, checkpoints.
- [ ] Start an xhigh monitoring agent that polls all three queues.
- [ ] On completion, parse all 9 `logs.json.txt` files and rank by max `test_mean_score`.
- [ ] Report embedding/layer trends and the best configuration.

View File

@@ -0,0 +1,107 @@
# PushT Image iMF Full-Attention Sweep Design
## Goal
在一个独立新分支上,为 PushT 图像 iMF 路线新增 **full-attention** 变体(关闭因果注意力),并按与之前相同的架构扫描网格运行 **9 组实验**,每组训练 **350 epochs**。所有实验完成后,提取每组 **`max(test_mean_score)`** 并输出完整排名和趋势总结。
## Scope
本次工作仅覆盖:
1. 在不影响现有因果版 iMF 路线的前提下,新增 full-attention 实验链路;
2.`n_emb ∈ {128, 256, 384}``n_layer ∈ {6, 12, 18}` 的 9 组组合做 350-epoch 扫描;
3. 在本机 5090 与 5880 双卡上做三路并行调度;
4. 在全部实验完成后自动汇总结果并直接向用户汇报。
不在本次范围内:
- 不替换或删除现有因果版 iMF 配置;
- 不改动已有 DiT baseline 实现;
- 不做多 seed 扩展;
- 不额外增加视频记录。
## Design Choice
采用“**新增独立配置 + 新分支**”的方式,而不是覆盖现有 iMF 默认配置。
原因:
- 现有因果版 iMF 已完成实验与结果记录,保持不动更利于对照;
- full-attention 作为新的实验链路,使用独立配置更易复现;
- 运行时只需要通过配置切换 `policy.causal_attn=false`,不需要重新设计 iMF 算法本身。
## Configuration Design
新增一个独立配置文件,例如:
- `image_pusht_diffusion_policy_dit_imf_fullattn.yaml`
其职责:
- 继承当前 PushT image iMF 配置链路;
- 保持 iMF 单步推理、SwanLab 标量记录、无视频记录;
- 显式设置:
- `policy.causal_attn=false`
- `policy.n_head=1`
- 保持其余 iMF 训练语义不变。
SwanLab 命名延续当前修复后的策略:
- `logging.name=${exp_name}`
- `logging.resume=false`
- `logging.id=null`
- `logging.group=${exp_name}` 或统一 sweep group override
## Code Change Strategy
优先最小改动:
- 若当前 `IMFTransformerForDiffusion` 已支持 `causal_attn=False` 分支,则不改核心算法,仅通过新配置关闭因果 mask
- 如需补充回归验证,则新增针对 full-attention 配置/掩码行为的最小测试;
- 不改变已有因果版实验配置和已有测试语义。
## Experiment Matrix
实验网格固定为:
- `n_emb=128, n_layer=6`
- `n_emb=128, n_layer=12`
- `n_emb=128, n_layer=18`
- `n_emb=256, n_layer=6`
- `n_emb=256, n_layer=12`
- `n_emb=256, n_layer=18`
- `n_emb=384, n_layer=6`
- `n_emb=384, n_layer=12`
- `n_emb=384, n_layer=18`
统一设置:
- `training.num_epochs=350`
- `training.resume=false`
- `seed=42`
- PushT image 数据路径不变
- 指标以 **`logs.json.txt``test_mean_score` 的最大值** 为准
## Scheduling Design
使用三路串行队列并行执行 9 个实验:
- 本机 50901 个顺序队列
- 5880 GPU01 个顺序队列
- 5880 GPU11 个顺序队列
分配原则:
- 延续按 `n_emb × n_layer` 近似平衡工作量;
- 每张卡同一时刻只跑 1 个实验;
- 队列脚本负责“前一个结束后自动启动下一个”。
## Monitoring Design
继续采用“**训练队列脚本 + 监控 agent**”双层机制:
1. **实际调度**由本地/远端队列脚本负责;
2. **监控**由一个 xhigh 子 agent 轮询:
- 读取 pid 状态
- 检查 master log
- 检查每个 run 的 `logs.json.txt`
- 判断是否卡死/失败/全部完成
3. 一旦全部完成,监控 agent 直接返回:
- 9 组实验的最终 epoch
- 每组 `max(test_mean_score)`
- 排名表
- embedding / layer 趋势总结
本次要求下agent 在收到全部完成信号后应直接向主会话回报结果,不等待用户再次提醒。
## Success Criteria
满足以下条件即视为完成:
1. full-attention iMF 配置在新分支上可运行;
2. 9 组 350-epoch 实验全部完成;
3. 不记录仿真视频,只记录标量;
4. SwanLab 运行命名不冲突;
5. 输出 9 组实验 `max(test_mean_score)` 的完整汇总与结论;
6. 全部实验结束后主会话可直接给用户最终总结。

View File

@@ -0,0 +1,33 @@
defaults:
- diffusion_policy/config/train_diffusion_transformer_hybrid_workspace@_here_
- override /diffusion_policy/config/task@task: pusht_image
- _self_
exp_name: pusht_image_dit_imf_fullattn
policy:
_target_: diffusion_policy.policy.imf_transformer_hybrid_image_policy.IMFTransformerHybridImagePolicy
num_inference_steps: 1
n_head: 1
causal_attn: false
logging:
backend: swanlab
mode: online
name: ${exp_name}
resume: false
tags: ["${name}", "${task_name}", "${exp_name}", "swanlab"]
id: null
group: ${exp_name}
dataloader:
num_workers: 0
val_dataloader:
num_workers: 0
task:
env_runner:
n_envs: 1
n_test_vis: 0
n_train_vis: 0

View File

@@ -30,3 +30,15 @@ def test_image_pusht_dit_imf_swanlab_config_uses_exp_name_and_no_resume_collisio
assert cfg.logging.resume is False assert cfg.logging.resume is False
assert cfg.logging.id is None assert cfg.logging.id is None
assert cfg.logging.group == cfg.exp_name assert cfg.logging.group == cfg.exp_name
def test_image_pusht_dit_imf_fullattn_config_uses_exp_name_and_disables_causal_attention():
cfg = _load_cfg('image_pusht_diffusion_policy_dit_imf_fullattn.yaml')
assert cfg.logging.backend == 'swanlab'
assert cfg.logging.mode == 'online'
assert cfg.logging.name == cfg.exp_name
assert cfg.logging.resume is False
assert cfg.logging.id is None
assert cfg.logging.group == cfg.exp_name
assert cfg.policy.causal_attn is False