feat: add full attnres vision backbone
This commit is contained in:
@@ -0,0 +1,64 @@
|
||||
# Phase-2 Full-AttnRes Vision Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.
|
||||
|
||||
**Architecture:** Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing `ResNetDiffusionBackbone` via an opt-in mode and keep the agent/head/data interfaces unchanged.
|
||||
|
||||
**Tech Stack:** PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add failing tests for the new full-AttnRes visual backbone
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/test_attnres_resnet2d_backbone.py`
|
||||
- Update: `tests/test_imf_vla_agent.py`
|
||||
|
||||
- [ ] **Step 1: Write a failing backbone shape test**
|
||||
- [ ] **Step 2: Run it to confirm the new backbone/config does not exist yet**
|
||||
- [ ] **Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208**
|
||||
- [ ] **Step 4: Run the targeted tests and capture the failure**
|
||||
|
||||
### Task 2: Implement a ResNet-like 2D AttnRes backbone
|
||||
|
||||
**Files:**
|
||||
- Create: `roboimi/vla/models/backbones/attnres_resnet2d.py`
|
||||
- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
|
||||
|
||||
- [ ] **Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling**
|
||||
- [ ] **Step 2: Implement `AttnResImageBlock2D` for feature maps**
|
||||
- [ ] **Step 3: Implement `AttnResResNetLikeBackbone2D` with stage-wise downsampling**
|
||||
- [ ] **Step 4: Wire `_SingleRgbEncoder` to choose between original ResNet trunk and the new full-AttnRes trunk**
|
||||
- [ ] **Step 5: Run the new backbone tests**
|
||||
|
||||
### Task 3: Expose config switches and agent wiring
|
||||
|
||||
**Files:**
|
||||
- Modify: `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
|
||||
- Modify: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
|
||||
|
||||
- [ ] **Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk**
|
||||
- [ ] **Step 2: Add defaults for attnres image depth/heads/etc. if needed**
|
||||
- [ ] **Step 3: Add a Phase-2 launch override path that enables the new visual trunk**
|
||||
- [ ] **Step 4: Run agent wiring tests again**
|
||||
|
||||
### Task 4: Smoke-verify training path
|
||||
|
||||
**Files:**
|
||||
- Reuse existing training scripts and configs
|
||||
|
||||
- [ ] **Step 1: Run a short CPU or tiny-step smoke instantiation / `compute_loss` test**
|
||||
- [ ] **Step 2: If needed, run a very short training smoke launch**
|
||||
- [ ] **Step 3: Verify no cond-dim or rollout-loading regressions**
|
||||
|
||||
### Task 5: Launch the Phase-2 experiment
|
||||
|
||||
**Files:**
|
||||
- Update experiment tracking under `experiment_suites/`
|
||||
|
||||
- [ ] **Step 1: Use Phase-1 best setting (`pred_horizon=16`, `num_action_steps=8`)**
|
||||
- [ ] **Step 2: Launch baseline reference or reuse existing result**
|
||||
- [ ] **Step 3: Launch full-AttnRes vision experiment**
|
||||
- [ ] **Step 4: Track rollout metrics and compare max avg_reward**
|
||||
@@ -0,0 +1,81 @@
|
||||
# Phase-2 Full-AttnRes Vision Design
|
||||
|
||||
## Goal
|
||||
在当前 roboimi IMF policy 中,把视觉 backbone 里原先由 ResNet BasicBlock/Bottleneck 提供的残差单元全部替换为 AttnRes 风格单元,同时尽量保持现有 agent / cond / rollout / 训练脚本接口不变。
|
||||
|
||||
## User requirement interpretation
|
||||
这里按最严格解释执行:
|
||||
- 不是“在 ResNet 后面再加一个 AttnRes 模块”
|
||||
- 也不是“只在某几个 stage 加 AttnRes 混合”
|
||||
- 而是:视觉主干网络中原本依赖 ResNet residual block 的地方,统一改成 AttnRes residual operator 驱动的 block
|
||||
- 最终仍然输出与现有 `ResNetDiffusionBackbone` 相同的每相机特征接口,以便复用 `SpatialSoftmax -> Linear -> ReLU`、多相机拼接、state concat、IMF head 条件输入
|
||||
|
||||
## Recommended design
|
||||
### Option A (recommended)
|
||||
保留 ResNet 的宏观 stage/stem 结构与通道/步幅规划,但把每个 stage 内的 BasicBlock/Bottleneck 替换为新的 `AttnResImageBlock2D`:
|
||||
- 输入仍是 `(B, C, H, W)` feature map
|
||||
- block 内先把空间维 flatten 成 token 序列 `(B, H*W, C)`
|
||||
- 用二维位置编码 / 可学习位置偏置 + AttnRes self-attention + AttnRes FFN 完成 block 变换
|
||||
- 再 reshape 回 `(B, C, H, W)`
|
||||
- stage 间下采样仍由显式 stride/downsample path 完成
|
||||
|
||||
优点:
|
||||
- 最接近“ResNet 中所有残差都由 AttnRes 代替”的要求
|
||||
- 保留现有视觉输出接口和 cond_dim,不用改 agent/head/data pipeline
|
||||
- 仍可沿用现有多相机编码器框架
|
||||
|
||||
缺点:
|
||||
- 需要新写 2D 版 AttnRes image block,而不是直接复用 1D IMF head block
|
||||
|
||||
### Option B
|
||||
完全移除 ResNet stage,换成 patchify + ViT/AttnRes 图像 transformer,再接 SpatialSoftmax/MLP。
|
||||
|
||||
优点:实现概念更统一。
|
||||
缺点:已经不算“把 ResNet 中残差替换掉”,而是直接换 backbone,和用户要求不完全一致。
|
||||
|
||||
### Option C
|
||||
保留现有 ResNet block,只在 block 外层加 AttnRes mixing。
|
||||
|
||||
不推荐,因为不满足“所有残差均由 AttnRes 替代”。
|
||||
|
||||
## Concrete architecture choice
|
||||
采用 Option A:
|
||||
1. 保留 stem(conv/bn-or-gn/relu/maxpool)与 stage 边界
|
||||
2. 新增 `AttnResImageBlock2D`
|
||||
3. 新增 `AttnResResNetLikeBackbone2D`,负责堆叠 stage/block
|
||||
4. 在 `ResNetDiffusionBackbone` 中增加可选 backbone mode,例如:
|
||||
- `vision_backbone_mode: resnet`
|
||||
- `vision_backbone_mode: attnres_resnet`
|
||||
5. `resnet_imf_attnres` agent 配置新增一个 Phase-2 变体,默认打开 `attnres_resnet`
|
||||
6. 仍保持:
|
||||
- 每相机输出 `64`
|
||||
- 多相机总视觉输出 `3 * 64`
|
||||
- 与 state 拼接后 `cond_dim = 208`
|
||||
|
||||
## Files likely to change
|
||||
- `roboimi/vla/models/backbones/resnet_diffusion.py`
|
||||
- `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
|
||||
- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
|
||||
- new: `roboimi/vla/models/backbones/attnres_resnet2d.py`
|
||||
- tests:
|
||||
- new: `tests/test_attnres_resnet2d_backbone.py`
|
||||
- update/add wiring test for agent cond dims
|
||||
|
||||
## Test plan
|
||||
1. New backbone instantiates and forwards `(B,T,C,H,W)` multi-camera input
|
||||
2. Output shape unchanged vs current backbone
|
||||
3. `output_dim == 64`
|
||||
4. 3-camera cond path still yields `208`
|
||||
5. Phase-2 config instantiates full IMF agent successfully
|
||||
6. One short CPU smoke forward for `compute_loss`
|
||||
|
||||
## Phase-2 experiment plan
|
||||
固定使用 Phase-1 最优组合:
|
||||
- `pred_horizon=16`
|
||||
- `num_action_steps=8`
|
||||
|
||||
比较:
|
||||
1. baseline: current IMF head-only AttnRes + original ResNet vision backbone
|
||||
2. phase2: IMF head AttnRes + full AttnRes-replaced vision backbone
|
||||
|
||||
训练超参保持与 Phase-1 最优设置一致,先跑一组 50k step 对比。
|
||||
Reference in New Issue
Block a user