Files
roboimi/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md

269 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IMF-AttnRes Policy Migration Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** 将 external `diffusion_policy@185ed659` 的 IMF-AttnRes 模型、训练目标和一步推理机制迁移到 RoboIMI并在保持三相机视觉条件输入与现有训练/rollout 工作流的前提下启动同参数训练。
**Architecture:** 保留 RoboIMI 现有 ResNet 三相机观测编码、normalization、queue-based online rollout 和训练脚本;新增 AttnRes 组件与 IMF transformer head并新增 IMF 专用 agent 以覆盖 DDPM loss / DDIM inference 语义。训练脚本只做最小接线修改,让新 head/agent 能用现有 optimizer、checkpoint、SwanLab 和 headless rollout。
**Tech Stack:** PyTorch, Hydra, diffusers schedulers (仅保留兼容初始化), MuJoCo rollout, unittest, SwanLab
---
## File Map
### New files
- `roboimi/vla/models/heads/attnres_transformer_components.py` — 本地 IMF AttnRes 基础组件
- `roboimi/vla/models/heads/imf_transformer1d.py` — IMF transformer head暴露 `forward(sample, r, t, cond=None)`
- `roboimi/vla/agent_imf.py` — IMF 专用 VLA agent复用现有观测/队列/normalization 逻辑并覆盖 loss / inference
- `roboimi/vla/conf/head/imf_transformer1d.yaml` — IMF head 配置
- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml` — IMF agent + backbone/head 组合配置
- `tests/test_imf_transformer1d_external_alignment.py` — external `185ed659` 对齐测试
- `tests/test_imf_vla_agent.py` — IMF agent 的 loss / inference / queue 语义测试
### Modified files
- `roboimi/demos/vla_scripts/train_vla.py` — 优化器参数分组接线;确保新 agent 能无缝训练
- `roboimi/vla/conf/config.yaml` — 保持默认配置不变,仅支持通过 override 启用 IMF agent
- `tests/test_train_vla_transformer_optimizer.py` — 覆盖 IMF head 的 optimizer-group 行为
- (如需要)`roboimi/vla/models/heads/__init__.py` 或相近导出文件 — 暴露新 head
---
### Task 1: 写 IMF transformer 对齐测试
**Files:**
- Create: `tests/test_imf_transformer1d_external_alignment.py`
- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/attnres_transformer_components.py`
- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
- [ ] **Step 1: 写失败测试,验证 local IMF head 与 external `185ed659` 的 state-dict key、前向 shape、forward 数值、optim groups 对齐**
```python
with torch.no_grad():
external_out = external_model(sample=sample, r=r, t=t, cond=cond)
local_out = local_model(sample=sample, r=r, t=t, cond=cond)
assert torch.allclose(local_out, external_out, atol=1e-6, rtol=1e-5)
```
- [ ] **Step 2: 运行单测,确认当前失败**
Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
Expected: FAIL提示 `imf_transformer1d` / `attnres` 模块不存在
- [ ] **Step 3: 若测试需要复用现有 external-loader 逻辑,则从 `tests/test_transformer1d_external_alignment.py` 复制最小必要 helper避免重复依赖 session context**
- [ ] **Step 4: 提交测试骨架**
```bash
git add tests/test_imf_transformer1d_external_alignment.py
git commit -m "test: add IMF transformer external alignment coverage"
```
### Task 2: 实现 AttnRes 组件与 IMF transformer head
**Files:**
- Create: `roboimi/vla/models/heads/attnres_transformer_components.py`
- Create: `roboimi/vla/models/heads/imf_transformer1d.py`
- Modify: `tests/test_imf_transformer1d_external_alignment.py`
- [ ] **Step 1: 按 external `185ed659` 迁移 AttnRes 基础组件,保持命名和参数语义一致**
必须包含:
- `RMSNorm`
- `RMSNormNoWeight`
- `precompute_rope_freqs`
- `apply_rope`
- `GroupedQuerySelfAttention`
- `SwiGLUFFN`
- `AttnResOperator`
- `AttnResSubLayer`
- `AttnResTransformerBackbone`
- [ ] **Step 2: 在 `imf_transformer1d.py` 中实现本地 IMF head**
必须满足:
- `forward(sample, r, t, cond=None)`
- 默认支持 `backbone_type='attnres_full'`
- token 序列为 `[r_token, t_token, cond_tokens..., sample_tokens...]`
- 输出只切回 sample token 段
- 保留 `get_optim_groups()` 供 AdamW 分组
- [ ] **Step 3: 运行对齐测试,修正 state-dict key / init / no-decay 参数分组不一致问题**
Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
Expected: PASS
- [ ] **Step 4: 提交模型组件实现**
```bash
git add roboimi/vla/models/heads/attnres_transformer_components.py \
roboimi/vla/models/heads/imf_transformer1d.py \
tests/test_imf_transformer1d_external_alignment.py
git commit -m "feat: add IMF AttnRes transformer head"
```
### Task 3: 写 IMF agent 行为测试
**Files:**
- Create: `tests/test_imf_vla_agent.py`
- Reference: `roboimi/vla/agent.py`
- Reference: `tests/test_resnet_transformer_agent_wiring.py`
- [ ] **Step 1: 写失败测试,覆盖 IMF agent 的核心契约**
需要覆盖:
1. `compute_loss()` 接受当前 batch 结构并返回标量 loss
2. `predict_action()` 输出 `(B, pred_horizon, action_dim)`
3. `select_action()` 仍按 queue/chunk 语义工作
4. `predict_action()` 不走 DDIM 多步循环,而是只触发一步 IMF sample
5. `action_is_pad` 存在时仅在有效 action 上计 loss
- [ ] **Step 2: 用 stub backbone / stub head 记录调用参数,验证 `r,t,cond` 的传递与 observation conditioning 维度正确**
```python
self.assertEqual(recorded['cond'].shape, (B, obs_horizon, expected_cond_dim))
self.assertTrue(torch.allclose(recorded['r'], torch.zeros(B)))
self.assertTrue(torch.allclose(recorded['t'], torch.ones(B)))
```
- [ ] **Step 3: 运行测试,确认当前失败**
Run: `python -m unittest tests.test_imf_vla_agent -v`
Expected: FAIL提示 `roboimi.vla.agent_imf` 不存在
- [ ] **Step 4: 提交测试骨架**
```bash
git add tests/test_imf_vla_agent.py
git commit -m "test: add IMF VLA agent behavior coverage"
```
### Task 4: 实现 IMF agent 与 Hydra 接线
**Files:**
- Create: `roboimi/vla/agent_imf.py`
- Create: `roboimi/vla/conf/head/imf_transformer1d.yaml`
- Create: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
- Modify: `roboimi/demos/vla_scripts/train_vla.py`
- Modify: `tests/test_train_vla_transformer_optimizer.py`
- Modify: `tests/test_imf_vla_agent.py`
- [ ] **Step 1: 以 `VLAAgent` 为基础实现 `IMFVLAAgent`**
实现策略:
- 复用 `VLAAgent.__init__``_build_cond()``reset()``_populate_queues()``_prepare_observation_batch()``select_action()``get_normalization_stats()`
- 覆盖:
- `compute_loss()` -> IMF objective
- `predict_action()` -> one-step sample
- 提供内部 helper
- `_broadcast_batch_time`
- `_apply_conditioning`(如需)
- `_compute_u_and_du_dt`
- `_compound_velocity`
- `_sample_one_step`
- [ ] **Step 2: 在 JVP 路径中加入 CUDA math SDPA fallback保持 external repo 的稳定性策略**
- [ ] **Step 3: 新增 Hydra 配置,让 `agent=resnet_imf_attnres` 可实例化**
关键默认值:
- `_target_: roboimi.vla.agent_imf.IMFVLAAgent`
- `head._target_: roboimi.vla.models.heads.imf_transformer1d.IMFTransformer1D`
- `head.backbone_type: attnres_full`
- `head.causal_attn: false`
- `head.time_as_cond: true`
- `head.n_cond_layers: 0`
- `inference_steps: 1`
- `camera_names: ${data.camera_names}`
- `vision_backbone.camera_names: ${agent.camera_names}`
- [ ] **Step 4: 让训练脚本对任何带 `get_optim_groups()` 的 head 复用参数分组,而不是硬编码旧 transformer head_type**
推荐最小改法:
```python
use_head_groups = callable(getattr(noise_pred_net, 'get_optim_groups', None))
```
- [ ] **Step 5: 运行测试并修复 wiring 问题**
Run:
- `python -m unittest tests.test_imf_vla_agent -v`
- `python -m unittest tests.test_train_vla_transformer_optimizer -v`
Expected: PASS
- [ ] **Step 6: 提交 agent / config / train-script 接线**
```bash
git add roboimi/vla/agent_imf.py \
roboimi/vla/conf/head/imf_transformer1d.yaml \
roboimi/vla/conf/agent/resnet_imf_attnres.yaml \
roboimi/demos/vla_scripts/train_vla.py \
tests/test_imf_vla_agent.py \
tests/test_train_vla_transformer_optimizer.py
git commit -m "feat: add IMF VLA agent and training wiring"
```
### Task 5: 集成验证与训练启动
**Files:**
- Modify: none required unless验证暴露真实问题
- Use run artifacts under: `runs/`
- [ ] **Step 1: 运行聚焦测试集**
Run:
```bash
python -m unittest \
tests.test_imf_transformer1d_external_alignment \
tests.test_imf_vla_agent \
tests.test_resnet_transformer_agent_wiring \
tests.test_train_vla_transformer_optimizer -v
```
Expected: PASS
- [ ] **Step 2: 运行一个最小 GPU 训练冒烟任务(不必长跑)**
Run:
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=resnet_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
data.camera_names=[r_vis,top,front] \
train.device=cuda train.max_steps=2 train.batch_size=4 train.num_workers=2 \
train.use_swanlab=false train.rollout_val_freq_epochs=0
```
Expected: 成功完成 2 steps生成 checkpoint / log无 shape 或 JVP 错误
- [ ] **Step 3: 用正式参数启动 IMF 训练**
Run:
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=resnet_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
data.camera_names=[r_vis,top,front] \
train.device=cuda train.val_split=0.0 train.seed=42 \
train.batch_size=80 train.lr=5e-4 train.num_workers=12 train.max_steps=150000 \
train.log_freq=100 train.save_freq=10000 train.use_swanlab=true \
train.swanlab_project=roboimi-vla \
train.rollout_val_freq_epochs=5 train.rollout_validate_on_checkpoint=false \
train.rollout_num_episodes=5 train.warmup_steps=2000 \
train.scheduler_type=cosine train.min_lr=1e-6 train.weight_decay=1e-5 train.grad_clip=1.0 \
agent.pred_horizon=16 agent.inference_steps=1 \
agent.head.n_emb=384 agent.head.n_layer=18 agent.head.n_head=1 agent.head.n_kv_head=1 \
agent.vision_backbone.pretrained_backbone_weights=null \
agent.vision_backbone.freeze_backbone=false \
agent.vision_backbone.use_separate_rgb_encoder_per_camera=true
```
Expected: 训练启动成功SwanLab 记录完整 config5 epoch 一次 headless rollout
- [ ] **Step 4: 记录 run 路径、训练 PID、SwanLab 运行名并向用户汇报**
- [ ] **Step 5: 提交最终收尾改动(如果 smoke fix 需要额外 patch**
```bash
git add <changed files>
git commit -m "chore: verify IMF AttnRes training launch"
```