feat(policy): 添加ddt policy

2026-01-28 17:14:28 +08:00
184 changed files with 4839 additions and 18358 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -123,9 +123,4 @@ CLAUDE.md
 GEMINI.md
 # Copilot
-.github/copilot-instructions.md
+.github/copilot-instructions.md
 .hydra/
 # Local git worktrees
 .worktrees/
--- a/README.en.md
+++ b/README.en.md
@@ -0,0 +1,36 @@
 # robo-imi-act
 #### Description
 {**When you're done, you can delete the content in this README and update the file with details for others getting started with your repository**}
 #### Software Architecture
 Software architecture description
 #### Installation
 1.  xxxx
 2.  xxxx
 3.  xxxx
 #### Instructions
 1.  xxxx
 2.  xxxx
 3.  xxxx
 #### Contribution
 1.  Fork the repository
 2.  Create Feat_xxx branch
 3.  Commit your code
 4.  Create Pull Request
 #### Gitee Feature
 1.  You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md
 2.  Gitee blog [blog.gitee.com](https://blog.gitee.com)
 3.  Explore open source project [https://gitee.com/explore](https://gitee.com/explore)
 4.  The most valuable open source project [GVP](https://gitee.com/gvp)
 5.  The manual of Gitee [https://gitee.com/help](https://gitee.com/help)
 6.  The most popular members  [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
--- a/README.md
+++ b/README.md
@@ -1,208 +1,39 @@
-# RoboIMI
+# robo-imi-act
-基于 MuJoCo 的机器人仿真与模仿学习框架，实现了使用扩散策略的视觉-语言-动作（VLA）模型，用于机器人操作任务。
+#### 介绍
 {**以下是 Gitee 平台说明，您可以替换此简介**
 Gitee 是 OSCHINA 推出的基于 Git 的代码托管平台（同时支持 SVN）。专为开发者提供稳定、高效、安全的云端软件开发协作平台
 无论是个人、团队、或是企业，都能够用 Gitee 实现代码托管、项目管理、协作开发。企业项目请看 [https://gitee.com/enterprises](https://gitee.com/enterprises)}
-## 主要特性
+#### 软件架构
 软件架构说明
 - **多机器人平台支持**：支持 Diana 和 vx300s 机械臂，可扩展至其他机器人
 - **扩散策略**：采用最先进的扩散模型（DDPM/DDIM）进行动作序列预测
 - **视觉-语言-动作模型**：使用 ResNet-18 视觉骨干网络和空间 softmax 进行视觉特征提取
 - **灵活的控制模式**：支持关节空间和末端执行器（笛卡尔）控制
 - **Hydra 配置系统**：模块化配置系统，便于实验
 - **HDF5 数据集格式**：高效存储和加载演示数据
 - **单臂和双臂任务**：支持单臂和双臂操作任务
-## 安装
+#### 安装教程
-### 环境要求
+1.  xxxx
 2.  xxxx
 3.  xxxx
- Python 3.8+
+#### 使用说明
 - 支持 CUDA 的 GPU（训练时推荐）
 - Conda 或 Miniconda
-### 安装步骤
+1.  xxxx
 2.  xxxx
 3.  xxxx
-```bash
+#### 参与贡献
 # 克隆仓库
 git clone <repository-url>
 cd robo-imi-act
-# 创建并激活 conda 环境
+1.  Fork 本仓库
-conda env create -f environment.yml
+2.  新建 Feat_xxx 分支
-conda activate roboimi
+3.  提交代码
 4.  新建 Pull Request
 # 以开发模式安装包
 pip install -e .
 ```
-## 快速开始
+#### 特技
-### 1. 数据采集
+1.  使用 Readme\_XXX.md 来支持不同的语言，例如 Readme\_en.md, Readme\_zh.md
-
+2.  Gitee 官方博客 [blog.gitee.com](https://blog.gitee.com)
-在仿真环境中记录演示轨迹：
+3.  你可以 [https://gitee.com/explore](https://gitee.com/explore) 这个地址来了解 Gitee 上的优秀开源项目
-
+4.  [GVP](https://gitee.com/gvp) 全称是 Gitee 最有价值开源项目，是综合评定出的优秀开源项目
-```bash
+5.  Gitee 官方提供的使用手册 [https://gitee.com/help](https://gitee.com/help)
-# 为 vx300s 机器人记录轨迹
+6.  Gitee 封面人物是一档用来展示 Gitee 会员风采的栏目 [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
 python roboimi/demos/record_sim_episodes.py
 # 为 Diana 机器人记录轨迹
 python roboimi/demos/diana_record_sim_episodes.py
 ```
 轨迹数据以 HDF5 文件格式保存，包含机器人状态、动作和相机观测。
 ### 2. 计算数据集统计信息
 训练前需要计算归一化统计数据：
 ```bash
 python roboimi/vla/scripts/calculate_stats.py
 ```
 该命令会生成 `data_stats.pkl` 文件，包含动作和观测的均值/标准差或最小值/最大值。
 ### 3. 训练 VLA 模型
 使用采集的数据训练视觉-语言-动作模型：
 ```bash
 # 使用默认配置训练
 python roboimi/demos/vla_scripts/train_vla.py
 # 覆盖特定参数
 python roboimi/demos/vla_scripts/train_vla.py train.batch_size=32 train.lr=5e-5 train.max_steps=50000
 # 使用不同的模型架构
 python roboimi/demos/vla_scripts/train_vla.py agent=resnet_diffusion data=resnet_dataset
 ```
 训练输出保存至 `outputs/<日期>/<时间>/`，模型检查点保存至 `checkpoints/`。
 ### 4. 评估模型
 在仿真环境中评估训练好的模型：
 ```bash
 # 使用默认配置评估（使用最佳检查点）
 python roboimi/demos/vla_scripts/eval_vla.py
 # 指定检查点和评估轮数
 python roboimi/demos/vla_scripts/eval_vla.py eval.ckpt_path=checkpoints/vla_model_step_8000.pt eval.num_episodes=5
 # 启用动作平滑以获得更流畅的执行
 python roboimi/demos/vla_scripts/eval_vla.py eval.use_smoothing=true eval.smooth_alpha=0.5
 ```
 ## 项目结构
 ```
 robo-imi-act/
 ├── roboimi/
 │   ├── assets/                    # 机器人模型和资源
 │   │   ├── models/manipulators/   # URDF 和 MuJoCo XML 文件
 │   │   └── robots/                # 机器人抽象类
 │   ├── envs/                      # 仿真环境
 │   │   ├── mujoco_base.py         # MuJoCo 环境基类
 │   │   ├── single_base.py         # 单臂任务基类
 │   │   └── double_base.py         # 双臂任务基类
 │   ├── vla/                       # 视觉-语言-动作模型
 │   │   ├── agent.py               # VLAAgent（训练与推理）
 │   │   ├── models/
 │   │   │   ├── backbones/         # 视觉编码器（ResNet 等）
 │   │   │   └── heads/             # 策略头（扩散 UNet1D）
 │   │   ├── conf/                  # Hydra 配置文件
 │   │   └── scripts/               # 训练和工具脚本
 │   └── demos/                     # 演示脚本和示例
 ├── checkpoints/                   # 保存的模型检查点
 ├── outputs/                       # 训练输出（Hydra）
 ├── environment.yml                # Conda 环境配置
 └── CLAUDE.md                      # Claude Code 开发指南
 ```
 ## 架构设计
 ### VLA 训练流程
 ```
 HDF5 轨迹数据 → Dataset → DataLoader → VLAAgent → 模型检查点
 ```
 **模型组件**：
 - **视觉骨干网络**：ResNet-18 + 空间 softmax，用于从相机图像中提取视觉特征
 - **扩散头**：条件 UNet1D，使用 DDPM/DDIM 预测动作序列
 - **VLAAgent**：组合视觉编码器和扩散策略，处理训练和推理
 ### 配置系统
 基于 Hydra 的配置文件位于 `roboimi/vla/conf/`：
 - `config.yaml`：主要训练配置（批次大小、学习率、设备）
 - `agent/resnet_diffusion.yaml`：模型架构（动作维度、观测维度、时间窗口）
 - `data/resnet_dataset.yaml`：数据集路径、相机名称、归一化类型
 - `eval/eval.yaml`：评估设置（检查点路径、轮数、平滑参数）
 使用配置插值保持一致性：`${agent.obs_horizon}`
 ### 数据集格式
 HDF5 轨迹文件（`episode_*.hdf5`）包含：
 - `action`：机器人动作 `[T, action_dim]`
 - `observations/qpos`：关节位置 `[T, obs_dim]`
 - `observations/images/<cam_name>`：相机图像 `[T, H, W, C]`
 统计文件（`data_stats.pkl`）存储归一化参数（最小值/最大值/均值/标准差）。
 ## 开发指南
 ### 添加新机器人
 1. 在 `roboimi/assets/models/manipulators/<robot_name>/` 创建 URDF/XML 文件
 2. 在 `roboimi/assets/robots/<robot_name>.py` 定义机器人类（继承自 `arm_base.py`）
 3. 在 `roboimi/envs/<robot_name>_*.py` 创建环境类
 4. 如需要，在常量中注册机器人
 ### 修改 VLA 架构
 1. **自定义骨干网络**：在 `roboimi/vla/models/backbones/` 创建新类，继承 `VLABackbone`
 2. **自定义头部**：在 `roboimi/vla/models/heads/` 创建新类，继承 `VLAHead`
 3. **更新配置**：在 `roboimi/vla/conf/agent/` 添加新的 YAML 文件
 4. **接口定义**：参考 `roboimi/vla/core/interfaces.py` 的抽象基类
 ### 训练最佳实践
 - 采集新数据后务必运行 `calculate_stats.py`
 - 训练时会归一化输入/输出；推理时使用检查点中保存的统计信息进行反归一化
 - 模型预测 `pred_horizon` 步，但只执行前 `action_horizon` 步
 - 推理使用 DDIM（10 步）快速采样；训练使用 DDPM（100 步）
 - 监控验证损失以防止过拟合
 ## 技术细节
 - **坐标系**：关节空间（qpos）或末端执行器空间（xyz + rpy + 夹爪）
 - **动作时间窗口**：`obs_horizon` 为观测窗口，`pred_horizon` 为预测窗口，`action_horizon` 为执行窗口
 - **归一化**：对稳定训练至关重要 - 训练前务必计算统计信息
 - **推理加速**：使用 DDIM 调度器，比训练时的 DDPM 快 10 倍
 - **设备配置**：通过 `train.device` 配置（cuda/cpu）
 ## 许可证
 [在此添加许可证信息]
 ## 引用
 如果您在研究中使用了本代码库，请引用：
 ```bibtex
 [在此添加引用信息]
 ```
 ## 贡献
 欢迎贡献！请随时提交 Pull Request 或开启 Issue。
 ## 致谢
 本项目基于以下开源项目构建：
 - [MuJoCo](https://mujoco.org/) - 物理仿真引擎
 - [PyTorch](https://pytorch.org/) - 深度学习框架
 - [Hydra](https://hydra.cc/) - 配置管理系统
 - [Diffusers](https://github.com/huggingface/diffusers) - 扩散模型库
--- a/check_all_episodes.py
+++ b/check_all_episodes.py
@@ -1,91 +0,0 @@
 #!/usr/bin/env python3
 """
 检查所有 episode 的重复帧情况
 找出哪些 episode 有问题，需要删除或重新收集
 """
 import os
 import h5py
 import glob
 import numpy as np
 def check_all_episodes():
    """检查所有 episode 的质量"""
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
    print("="*80)
    print("所有 Episode 质量检查")
    print("="*80)
    good_episodes = []
    bad_episodes = []
    for ep_idx, ep_file in enumerate(episode_files):
        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
        try:
            with h5py.File(ep_file, 'r') as f:
                img_path = '/observations/images/top'
                if img_path not in f:
                    continue
                images = f[img_path][:]
                # 检查前 50 帧的重复情况
                check_frames = min(50, len(images))
                duplicate_count = 0
                for i in range(check_frames - 1):
                    img1 = images[i]
                    img2 = images[i + 1]
                    diff = np.mean(np.abs(img1.astype(float) - img2.astype(float)))
                    if diff < 1.0:  # 重复
                        duplicate_count += 1
                duplicate_rate = duplicate_count / check_frames * 100
                # 判断质量
                if duplicate_rate > 10:  # 超过10%重复
                    bad_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
                    status = "❌"
                else:
                    good_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
                    status = "✅"
                print(f"{status} Episode {ep_idx:2d}: {duplicate_rate:5.1f}% 重复 ({duplicate_count:2d}/{check_frames}) - {ep_name}")
        except Exception as e:
            print(f"❌ Episode {ep_idx}: 错误 - {e}")
    # 总结
    print("\n" + "="*80)
    print("总结")
    print("="*80)
    print(f"总共检查: {len(episode_files)} 个 episodes")
    print(f"正常的: {len(good_episodes)} 个 ✅")
    print(f"有问题的: {len(bad_episodes)} 个 ❌")
    if bad_episodes:
        print(f"\n有问题的 episodes:")
        for ep_idx, ep_name, rate, count in bad_episodes:
            print(f"  - episode_{ep_idx}.hdf5: {rate:.1f}% 重复")
        print(f"\n删除命令:")
        ep_names = [name for _, name, _, _ in bad_episodes]
        print(f"  rm " + " ".join([f"{dataset_dir}/{name}.hdf5" for name in ep_names]))
    print(f"\n建议:")
    if len(bad_episodes) > 0:
        print(f"  1. 删除有问题的 {len(bad_episodes)} 个 episodes")
        print(f"  2. 重新收集数据，或使用剩余的 {len(good_episodes)} 个正常 episodes")
    else:
        print(f"  ✅ 所有 episodes 都正常，可以直接使用！")
 if __name__ == "__main__":
    check_all_episodes()
--- a/check_specific_frames.py
+++ b/check_specific_frames.py
@@ -1,202 +0,0 @@
 #!/usr/bin/env python3
 """
 检查特定帧的图像 - 用于验证数据记录问题
 功能：
 1. 提取每个 episode 的第 0、1、2 帧图像
 2. 对比不同 episode 的相同帧号
 3. 保存图像供人工检查
 """
 import os
 import h5py
 import glob
 import cv2
 import numpy as np
 def check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10):
    """
    检查特定帧的图像和 qpos
    Args:
        frame_indices: 要检查的帧索引列表
        camera: 相机名称
        num_episodes: 要检查的 episode 数量
    """
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    # 按数字排序
    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
    # 创建输出目录
    output_dir = f'/tmp/dataset_frames'
    os.makedirs(output_dir, exist_ok=True)
    print(f"检查前 {min(num_episodes, len(episode_files))} 个 episode 的特定帧")
    print(f"帧索引: {frame_indices}")
    print(f"相机: {camera}")
    print("="*80)
    # 收集所有数据
    for ep_idx in range(min(num_episodes, len(episode_files))):
        ep_file = episode_files[ep_idx]
        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
        try:
            with h5py.File(ep_file, 'r') as f:
                # 读取 qpos
                qpos = f['/observations/qpos'][:]
                # 读取图像
                img_path = f'/observations/images/{camera}'
                if img_path not in f:
                    print(f"Episode {ep_name}: 相机 {camera} 不存在")
                    continue
                images = f[img_path][:]
                print(f"\nEpisode {ep_name}:")
                print(f"  总帧数: {len(images)}")
                # 保存指定帧
                for frame_idx in frame_indices:
                    if frame_idx >= len(images):
                        print(f"  帧 {frame_idx}: 超出范围")
                        continue
                    # 保存图像
                    img = images[frame_idx]
                    filename = f"{output_dir}/ep{ep_idx:02d}_frame{frame_idx:03d}.png"
                    cv2.imwrite(filename, img)
                    # 打印 qpos
                    q = qpos[frame_idx]
                    print(f"  帧 {frame_idx}: qpos[0:3]=[{q[0]:6.2f}, {q[1]:6.2f}, {q[2]:6.2f}], qpos[3]={q[3]:6.2f} → {filename}")
        except Exception as e:
            print(f"Episode {ep_name}: 错误 - {e}")
    print("\n" + "="*80)
    print(f"✅ 所有图像已保存到: {output_dir}")
    print(f"\n查看方法:")
    print(f"  eog {output_dir}/*.png")
    print(f"  ")
    print(f"  # 或对比特定帧:")
    print(f"  eog {output_dir}/*_frame000.png  # 所有 episode 的第 0 帧")
    print(f"  eog {output_dir}/*_frame001.png  # 所有 episode 的第 1 帧")
    print(f"  eog {output_dir}/*_frame002.png  # 所有 episode 的第 2 帧")
 def compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10):
    """
    并排对比所有 episode 的某一帧
    生成一个大的对比图，包含所有 episode 的指定帧
    """
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
    num_compare = min(num_episodes, len(episode_files))
    cols = 5  # 每行 5 个
    rows = (num_compare + cols - 1) // cols
    # 创建输出目录
    output_dir = f'/tmp/dataset_frames'
    os.makedirs(output_dir, exist_ok=True)
    print(f"生成对比图: 所有 Episode 的第 {frame_idx} 帧")
    print("="*80)
    # 收集图像
    images_compare = []
    qpos_list = []
    for ep_idx in range(num_compare):
        ep_file = episode_files[ep_idx]
        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
        try:
            with h5py.File(ep_file, 'r') as f:
                qpos = f['/observations/qpos'][:]
                img_path = f'/observations/images/{camera}'
                if img_path in f and frame_idx < f[img_path].shape[0]:
                    img = f[img_path][frame_idx]
                    images_compare.append(img)
                    qpos_list.append(qpos[frame_idx])
                    print(f"Episode {ep_name}: qpos[0:3]=[{qpos[frame_idx][0]:.2f}, {qpos[frame_idx][1]:.2f}, {qpos[frame_idx][2]:.2f}]")
        except Exception as e:
            print(f"Episode {ep_name}: 错误 - {e}")
    if not images_compare:
        print("❌ 没有收集到图像")
        return
    # 获取图像尺寸
    h, w = images_compare[0].shape[:2]
    # 创建对比图
    compare_img = np.zeros((rows * h + 50, cols * w, 3), dtype=np.uint8)
    for i, (img, qpos) in enumerate(zip(images_compare, qpos_list)):
        row = i // cols
        col = i % cols
        y_start = row * h + 30
        y_end = y_start + h
        x_start = col * w
        x_end = x_start + w
        # 调整大小（如果需要）
        if img.shape[:2] != (h, w):
            img = cv2.resize(img, (w, h))
        compare_img[y_start:y_end, x_start:x_end] = img
        # 添加信息
        ep_name = f"Ep {i}"
        cv2.putText(compare_img, ep_name, (x_start + 10, row * h + 20),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
        cv2.putText(compare_img, f"qpos[3]={qpos[3]:.2f}", (x_start + 10, y_end - 10),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
    # 保存对比图
    output_path = f"{output_dir}/compare_frame{frame_idx:03d}.png"
    cv2.imwrite(output_path, compare_img)
    print(f"\n✅ 对比图已保存: {output_path}")
    print(f"  查看方法: eog {output_path}")
 if __name__ == "__main__":
    import sys
    print("="*80)
    print("特定帧检查工具")
    print("="*80)
    if len(sys.argv) > 1:
        frame_idx = int(sys.argv[1])
        compare_frame_across_episodes(frame_idx=frame_idx, camera='top', num_episodes=10)
    else:
        # 默认检查第 0、1、2 帧
        check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10)
        print("\n" + "="*80)
        print("生成对比图...")
        print("="*80)
        # 生成第 0 帧的对比图
        compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10)
        compare_frame_across_episodes(frame_idx=1, camera='top', num_episodes=10)
        compare_frame_across_episodes(frame_idx=2, camera='top', num_episodes=10)
    print("\n" + "="*80)
    print("其他用法:")
    print("  python check_specific_frames.py 0    # 只检查第 0 帧")
    print("  python check_specific_frames.py 1    # 只检查第 1 帧")
    print("="*80)
--- a/diffusion/configuration_diffusion.py
+++ b/diffusion/configuration_diffusion.py
@@ -1,238 +0,0 @@
 #!/usr/bin/env python
 # Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
 # and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from dataclasses import dataclass, field
 from lerobot.configs.policies import PreTrainedConfig
 from lerobot.configs.types import NormalizationMode
 from lerobot.optim.optimizers import AdamConfig
 from lerobot.optim.schedulers import DiffuserSchedulerConfig
@PreTrainedConfig.register_subclass("diffusion")
@dataclass
 class DiffusionConfig(PreTrainedConfig):
    """Configuration class for DiffusionPolicy.
    Defaults are configured for training with PushT providing proprioceptive and single camera observations.
    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
    Those are: `input_shapes` and `output_shapes`.
    Notes on the inputs and outputs:
        - "observation.state" is required as an input key.
        - Either:
            - At least one key starting with "observation.image is required as an input.
              AND/OR
            - The key "observation.environment_state" is required as input.
        - If there are multiple keys beginning with "observation.image" they are treated as multiple camera
          views. Right now we only support all images having the same shape.
        - "action" is required as an output key.
    Args:
        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
            current step and additional steps going back).
        horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
        n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
            See `DiffusionPolicy.select_action` for more details.
        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
            the input data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
            include batch dimension or temporal dimension.
        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
            the output data name, and the value is a list indicating the dimensions of the corresponding data.
            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
            [-1, 1] range.
        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
            original scale. Note that this is also used for normalizing the training targets.
        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
        crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
            within the image size. If None, no cropping is done.
        crop_is_random: Whether the crop should be random at training time (it's always a center crop in eval
            mode).
        pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
            `None` means no pretrained weights.
        use_group_norm: Whether to replace batch normalization with group normalization in the backbone.
            The group sizes are set to be about 16 (to be precise, feature_dim // 16).
        spatial_softmax_num_keypoints: Number of keypoints for SpatialSoftmax.
        use_separate_rgb_encoders_per_camera: Whether to use a separate RGB encoder for each camera view.
        down_dims: Feature dimension for each stage of temporal downsampling in the diffusion modeling Unet.
            You may provide a variable number of dimensions, therefore also controlling the degree of
            downsampling.
        kernel_size: The convolutional kernel size of the diffusion modeling Unet.
        n_groups: Number of groups used in the group norm of the Unet's convolutional blocks.
        diffusion_step_embed_dim: The Unet is conditioned on the diffusion timestep via a small non-linear
            network. This is the output dimension of that network, i.e., the embedding dimension.
        use_film_scale_modulation: FiLM (https://huggingface.co/papers/1709.07871) is used for the Unet conditioning.
            Bias modulation is used be default, while this parameter indicates whether to also use scale
            modulation.
        noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].
        num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.
        beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.
        beta_start: Beta value for the first forward-diffusion step.
        beta_end: Beta value for the last forward-diffusion step.
        prediction_type: The type of prediction that the diffusion modeling Unet makes. Choose from "epsilon"
            or "sample". These have equivalent outcomes from a latent variable modeling perspective, but
            "epsilon" has been shown to work better in many deep neural network settings.
        clip_sample: Whether to clip the sample to [-`clip_sample_range`, +`clip_sample_range`] for each
            denoising step at inference time. WARNING: you will need to make sure your action-space is
            normalized to fit within this range.
        clip_sample_range: The magnitude of the clipping range as described above.
        num_inference_steps: Number of reverse diffusion steps to use at inference time (steps are evenly
            spaced). If not provided, this defaults to be the same as `num_train_timesteps`.
        do_mask_loss_for_padding: Whether to mask the loss when there are copy-padded actions. See
            `LeRobotDataset` and `load_previous_and_future_frames` for more information. Note, this defaults
            to False as the original Diffusion Policy implementation does the same.
    """
    # Inputs / output structure.
    n_obs_steps: int = 2
    horizon: int = 16
    n_action_steps: int = 8
    normalization_mapping: dict[str, NormalizationMode] = field(
        default_factory=lambda: {
            "VISUAL": NormalizationMode.MEAN_STD,
            "STATE": NormalizationMode.MIN_MAX,
            "ACTION": NormalizationMode.MIN_MAX,
        }
    )
    # The original implementation doesn't sample frames for the last 7 steps,
    # which avoids excessive padding and leads to improved training results.
    drop_n_last_frames: int = 7  # horizon - n_action_steps - n_obs_steps + 1
    # Architecture / modeling.
    # Vision backbone.
    vision_backbone: str = "resnet18"
    crop_shape: tuple[int, int] | None = (84, 84)
    crop_is_random: bool = True
    pretrained_backbone_weights: str | None = None
    use_group_norm: bool = True
    spatial_softmax_num_keypoints: int = 32
    use_separate_rgb_encoder_per_camera: bool = False
    # Unet.
    down_dims: tuple[int, ...] = (512, 1024, 2048)
    kernel_size: int = 5
    n_groups: int = 8
    diffusion_step_embed_dim: int = 128
    use_film_scale_modulation: bool = True
    # Noise scheduler.
    noise_scheduler_type: str = "DDPM"
    num_train_timesteps: int = 100
    beta_schedule: str = "squaredcos_cap_v2"
    beta_start: float = 0.0001
    beta_end: float = 0.02
    prediction_type: str = "epsilon"
    clip_sample: bool = True
    clip_sample_range: float = 1.0
    # Inference
    num_inference_steps: int | None = None
    # Loss computation
    do_mask_loss_for_padding: bool = False
    # Training presets
    optimizer_lr: float = 1e-4
    optimizer_betas: tuple = (0.95, 0.999)
    optimizer_eps: float = 1e-8
    optimizer_weight_decay: float = 1e-6
    scheduler_name: str = "cosine"
    scheduler_warmup_steps: int = 500
    def __post_init__(self):
        super().__post_init__()
        """Input validation (not exhaustive)."""
        if not self.vision_backbone.startswith("resnet"):
            raise ValueError(
                f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
            )
        supported_prediction_types = ["epsilon", "sample"]
        if self.prediction_type not in supported_prediction_types:
            raise ValueError(
                f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
            )
        supported_noise_schedulers = ["DDPM", "DDIM"]
        if self.noise_scheduler_type not in supported_noise_schedulers:
            raise ValueError(
                f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
                f"Got {self.noise_scheduler_type}."
            )
        # Check that the horizon size and U-Net downsampling is compatible.
        # U-Net downsamples by 2 with each stage.
        downsampling_factor = 2 ** len(self.down_dims)
        if self.horizon % downsampling_factor != 0:
            raise ValueError(
                "The horizon should be an integer multiple of the downsampling factor (which is determined "
                f"by `len(down_dims)`). Got {self.horizon=} and {self.down_dims=}"
            )
    def get_optimizer_preset(self) -> AdamConfig:
        return AdamConfig(
            lr=self.optimizer_lr,
            betas=self.optimizer_betas,
            eps=self.optimizer_eps,
            weight_decay=self.optimizer_weight_decay,
        )
    def get_scheduler_preset(self) -> DiffuserSchedulerConfig:
        return DiffuserSchedulerConfig(
            name=self.scheduler_name,
            num_warmup_steps=self.scheduler_warmup_steps,
        )
    def validate_features(self) -> None:
        if len(self.image_features) == 0 and self.env_state_feature is None:
            raise ValueError("You must provide at least one image or the environment state among the inputs.")
        if self.crop_shape is not None:
            for key, image_ft in self.image_features.items():
                if self.crop_shape[0] > image_ft.shape[1] or self.crop_shape[1] > image_ft.shape[2]:
                    raise ValueError(
                        f"`crop_shape` should fit within the images shapes. Got {self.crop_shape} "
                        f"for `crop_shape` and {image_ft.shape} for "
                        f"`{key}`."
                    )
        # Check that all input images have the same shape.
        if len(self.image_features) > 0:
            first_image_key, first_image_ft = next(iter(self.image_features.items()))
            for key, image_ft in self.image_features.items():
                if image_ft.shape != first_image_ft.shape:
                    raise ValueError(
                        f"`{key}` does not match `{first_image_key}`, but we expect all image shapes to match."
                    )
    @property
    def observation_delta_indices(self) -> list:
        return list(range(1 - self.n_obs_steps, 1))
    @property
    def action_delta_indices(self) -> list:
        return list(range(1 - self.n_obs_steps, 1 - self.n_obs_steps + self.horizon))
    @property
    def reward_delta_indices(self) -> None:
        return None
--- a/diffusion/modeling_diffusion.py
+++ b/diffusion/modeling_diffusion.py
@@ -1,764 +0,0 @@
 #!/usr/bin/env python
 # Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
 # and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
 TODO(alexander-soare):
  - Remove reliance on diffusers for DDPMScheduler and LR scheduler.
 """
 import math
 from collections import deque
 from collections.abc import Callable
 import einops
 import numpy as np
 import torch
 import torch.nn.functional as F  # noqa: N812
 import torchvision
 from diffusers.schedulers.scheduling_ddim import DDIMScheduler
 from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
 from torch import Tensor, nn
 from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
 from lerobot.policies.pretrained import PreTrainedPolicy
 from lerobot.policies.utils import (
    get_device_from_parameters,
    get_dtype_from_parameters,
    get_output_shape,
    populate_queues,
 )
 from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
 class DiffusionPolicy(PreTrainedPolicy):
    """
    Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
    (paper: https://huggingface.co/papers/2303.04137, code: https://github.com/real-stanford/diffusion_policy).
    """
    config_class = DiffusionConfig
    name = "diffusion"
    def __init__(
        self,
        config: DiffusionConfig,
        **kwargs,
    ):
        """
        Args:
            config: Policy configuration class instance or None, in which case the default instantiation of
                the configuration class is used.
            dataset_stats: Dataset statistics to be used for normalization. If not passed here, it is expected
                that they will be passed with a call to `load_state_dict` before the policy is used.
        """
        super().__init__(config)
        config.validate_features()
        self.config = config
        # queues are populated during rollout of the policy, they contain the n latest observations and actions
        self._queues = None
        self.diffusion = DiffusionModel(config)
        self.reset()
    def get_optim_params(self) -> dict:
        return self.diffusion.parameters()
    def reset(self):
        """Clear observation and action queues. Should be called on `env.reset()`"""
        self._queues = {
            OBS_STATE: deque(maxlen=self.config.n_obs_steps),
            ACTION: deque(maxlen=self.config.n_action_steps),
        }
        if self.config.image_features:
            self._queues[OBS_IMAGES] = deque(maxlen=self.config.n_obs_steps)
        if self.config.env_state_feature:
            self._queues[OBS_ENV_STATE] = deque(maxlen=self.config.n_obs_steps)
    @torch.no_grad()
    def predict_action_chunk(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
        """Predict a chunk of actions given environment observations."""
        # stack n latest observations from the queue
        batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
        actions = self.diffusion.generate_actions(batch, noise=noise)
        return actions
    @torch.no_grad()
    def select_action(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
        """Select a single action given environment observations.
        This method handles caching a history of observations and an action trajectory generated by the
        underlying diffusion model. Here's how it works:
          - `n_obs_steps` steps worth of observations are cached (for the first steps, the observation is
            copied `n_obs_steps` times to fill the cache).
          - The diffusion model generates `horizon` steps worth of actions.
          - `n_action_steps` worth of actions are actually kept for execution, starting from the current step.
        Schematically this looks like:
            ----------------------------------------------------------------------------------------------
            (legend: o = n_obs_steps, h = horizon, a = n_action_steps)
            |timestep            | n-o+1 | n-o+2 | ..... | n     | ..... | n+a-1 | n+a   | ..... | n-o+h |
            |observation is used | YES   | YES   | YES   | YES   | NO    | NO    | NO    | NO    | NO    |
            |action is generated | YES   | YES   | YES   | YES   | YES   | YES   | YES   | YES   | YES   |
            |action is used      | NO    | NO    | NO    | YES   | YES   | YES   | NO    | NO    | NO    |
            ----------------------------------------------------------------------------------------------
        Note that this means we require: `n_action_steps <= horizon - n_obs_steps + 1`. Also, note that
        "horizon" may not the best name to describe what the variable actually means, because this period is
        actually measured from the first observation which (if `n_obs_steps` > 1) happened in the past.
        """
        # NOTE: for offline evaluation, we have action in the batch, so we need to pop it out
        if ACTION in batch:
            batch.pop(ACTION)
        if self.config.image_features:
            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
            batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
        # NOTE: It's important that this happens after stacking the images into a single key.
        self._queues = populate_queues(self._queues, batch)
        if len(self._queues[ACTION]) == 0:
            actions = self.predict_action_chunk(batch, noise=noise)
            self._queues[ACTION].extend(actions.transpose(0, 1))
        action = self._queues[ACTION].popleft()
        return action
    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, None]:
        """Run the batch through the model and compute the loss for training or validation."""
        if self.config.image_features:
            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
            batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
        loss = self.diffusion.compute_loss(batch)
        # no output_dict so returning None
        return loss, None
 def _make_noise_scheduler(name: str, **kwargs: dict) -> DDPMScheduler | DDIMScheduler:
    """
    Factory for noise scheduler instances of the requested type. All kwargs are passed
    to the scheduler.
    """
    if name == "DDPM":
        return DDPMScheduler(**kwargs)
    elif name == "DDIM":
        return DDIMScheduler(**kwargs)
    else:
        raise ValueError(f"Unsupported noise scheduler type {name}")
 class DiffusionModel(nn.Module):
    def __init__(self, config: DiffusionConfig):
        super().__init__()
        self.config = config
        # Build observation encoders (depending on which observations are provided).
        global_cond_dim = self.config.robot_state_feature.shape[0]
        if self.config.image_features:
            num_images = len(self.config.image_features)
            if self.config.use_separate_rgb_encoder_per_camera:
                encoders = [DiffusionRgbEncoder(config) for _ in range(num_images)]
                self.rgb_encoder = nn.ModuleList(encoders)
                global_cond_dim += encoders[0].feature_dim * num_images
            else:
                self.rgb_encoder = DiffusionRgbEncoder(config)
                global_cond_dim += self.rgb_encoder.feature_dim * num_images
        if self.config.env_state_feature:
            global_cond_dim += self.config.env_state_feature.shape[0]
        self.unet = DiffusionConditionalUnet1d(config, global_cond_dim=global_cond_dim * config.n_obs_steps)
        self.noise_scheduler = _make_noise_scheduler(
            config.noise_scheduler_type,
            num_train_timesteps=config.num_train_timesteps,
            beta_start=config.beta_start,
            beta_end=config.beta_end,
            beta_schedule=config.beta_schedule,
            clip_sample=config.clip_sample,
            clip_sample_range=config.clip_sample_range,
            prediction_type=config.prediction_type,
        )
        if config.num_inference_steps is None:
            self.num_inference_steps = self.noise_scheduler.config.num_train_timesteps
        else:
            self.num_inference_steps = config.num_inference_steps
    # ========= inference  ============
    def conditional_sample(
        self,
        batch_size: int,
        global_cond: Tensor | None = None,
        generator: torch.Generator | None = None,
        noise: Tensor | None = None,
    ) -> Tensor:
        device = get_device_from_parameters(self)
        dtype = get_dtype_from_parameters(self)
        # Sample prior.
        sample = (
            noise
            if noise is not None
            else torch.randn(
                size=(batch_size, self.config.horizon, self.config.action_feature.shape[0]),
                dtype=dtype,
                device=device,
                generator=generator,
            )
        )
        self.noise_scheduler.set_timesteps(self.num_inference_steps)
        for t in self.noise_scheduler.timesteps:
            # Predict model output.
            model_output = self.unet(
                sample,
                torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
                global_cond=global_cond,
            )
            # Compute previous image: x_t -> x_t-1
            sample = self.noise_scheduler.step(model_output, t, sample, generator=generator).prev_sample
        return sample
    def _prepare_global_conditioning(self, batch: dict[str, Tensor]) -> Tensor:
        """Encode image features and concatenate them all together along with the state vector."""
        batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
        global_cond_feats = [batch[OBS_STATE]]
        # Extract image features.
        if self.config.image_features:
            if self.config.use_separate_rgb_encoder_per_camera:
                # Combine batch and sequence dims while rearranging to make the camera index dimension first.
                images_per_camera = einops.rearrange(batch[OBS_IMAGES], "b s n ... -> n (b s) ...")
                img_features_list = torch.cat(
                    [
                        encoder(images)
                        for encoder, images in zip(self.rgb_encoder, images_per_camera, strict=True)
                    ]
                )
                # Separate batch and sequence dims back out. The camera index dim gets absorbed into the
                # feature dim (effectively concatenating the camera features).
                img_features = einops.rearrange(
                    img_features_list, "(n b s) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
                )
            else:
                # Combine batch, sequence, and "which camera" dims before passing to shared encoder.
                img_features = self.rgb_encoder(
                    einops.rearrange(batch[OBS_IMAGES], "b s n ... -> (b s n) ...")
                )
                # Separate batch dim and sequence dim back out. The camera index dim gets absorbed into the
                # feature dim (effectively concatenating the camera features).
                img_features = einops.rearrange(
                    img_features, "(b s n) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
                )
            global_cond_feats.append(img_features)
        if self.config.env_state_feature:
            global_cond_feats.append(batch[OBS_ENV_STATE])
        # Concatenate features then flatten to (B, global_cond_dim).
        return torch.cat(global_cond_feats, dim=-1).flatten(start_dim=1)
    def generate_actions(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
        """
        This function expects `batch` to have:
        {
            "observation.state": (B, n_obs_steps, state_dim)
            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
                AND/OR
            "observation.environment_state": (B, n_obs_steps, environment_dim)
        }
        """
        batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
        assert n_obs_steps == self.config.n_obs_steps
        # Encode image features and concatenate them all together along with the state vector.
        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)
        # run sampling
        actions = self.conditional_sample(batch_size, global_cond=global_cond, noise=noise)
        # Extract `n_action_steps` steps worth of actions (from the current observation).
        start = n_obs_steps - 1
        end = start + self.config.n_action_steps
        actions = actions[:, start:end]
        return actions
    def compute_loss(self, batch: dict[str, Tensor]) -> Tensor:
        """
        This function expects `batch` to have (at least):
        {
            "observation.state": (B, n_obs_steps, state_dim)
            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
                AND/OR
            "observation.environment_state": (B, n_obs_steps, environment_dim)
            "action": (B, horizon, action_dim)
            "action_is_pad": (B, horizon)
        }
        """
        # Input validation.
        assert set(batch).issuperset({OBS_STATE, ACTION, "action_is_pad"})
        assert OBS_IMAGES in batch or OBS_ENV_STATE in batch
        n_obs_steps = batch[OBS_STATE].shape[1]
        horizon = batch[ACTION].shape[1]
        assert horizon == self.config.horizon
        assert n_obs_steps == self.config.n_obs_steps
        # Encode image features and concatenate them all together along with the state vector.
        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)
        # Forward diffusion.
        trajectory = batch[ACTION]
        # Sample noise to add to the trajectory.
        eps = torch.randn(trajectory.shape, device=trajectory.device)
        # Sample a random noising timestep for each item in the batch.
        timesteps = torch.randint(
            low=0,
            high=self.noise_scheduler.config.num_train_timesteps,
            size=(trajectory.shape[0],),
            device=trajectory.device,
        ).long()
        # Add noise to the clean trajectories according to the noise magnitude at each timestep.
        noisy_trajectory = self.noise_scheduler.add_noise(trajectory, eps, timesteps)
        # Run the denoising network (that might denoise the trajectory, or attempt to predict the noise).
        pred = self.unet(noisy_trajectory, timesteps, global_cond=global_cond)
        # Compute the loss.
        # The target is either the original trajectory, or the noise.
        if self.config.prediction_type == "epsilon":
            target = eps
        elif self.config.prediction_type == "sample":
            target = batch[ACTION]
        else:
            raise ValueError(f"Unsupported prediction type {self.config.prediction_type}")
        loss = F.mse_loss(pred, target, reduction="none")
        # Mask loss wherever the action is padded with copies (edges of the dataset trajectory).
        if self.config.do_mask_loss_for_padding:
            if "action_is_pad" not in batch:
                raise ValueError(
                    "You need to provide 'action_is_pad' in the batch when "
                    f"{self.config.do_mask_loss_for_padding=}."
                )
            in_episode_bound = ~batch["action_is_pad"]
            loss = loss * in_episode_bound.unsqueeze(-1)
        return loss.mean()
 class SpatialSoftmax(nn.Module):
    """
    Spatial Soft Argmax operation described in "Deep Spatial Autoencoders for Visuomotor Learning" by Finn et al.
    (https://huggingface.co/papers/1509.06113). A minimal port of the robomimic implementation.
    At a high level, this takes 2D feature maps (from a convnet/ViT) and returns the "center of mass"
    of activations of each channel, i.e., keypoints in the image space for the policy to focus on.
    Example: take feature maps of size (512x10x12). We generate a grid of normalized coordinates (10x12x2):
    -----------------------------------------------------
    | (-1., -1.)   | (-0.82, -1.)   | ... | (1., -1.)   |
    | (-1., -0.78) | (-0.82, -0.78) | ... | (1., -0.78) |
    | ...          | ...            | ... | ...         |
    | (-1., 1.)    | (-0.82, 1.)    | ... | (1., 1.)    |
    -----------------------------------------------------
    This is achieved by applying channel-wise softmax over the activations (512x120) and computing the dot
    product with the coordinates (120x2) to get expected points of maximal activation (512x2).
    The example above results in 512 keypoints (corresponding to the 512 input channels). We can optionally
    provide num_kp != None to control the number of keypoints. This is achieved by a first applying a learnable
    linear mapping (in_channels, H, W) -> (num_kp, H, W).
    """
    def __init__(self, input_shape, num_kp=None):
        """
        Args:
            input_shape (list): (C, H, W) input feature map shape.
            num_kp (int): number of keypoints in output. If None, output will have the same number of channels as input.
        """
        super().__init__()
        assert len(input_shape) == 3
        self._in_c, self._in_h, self._in_w = input_shape
        if num_kp is not None:
            self.nets = torch.nn.Conv2d(self._in_c, num_kp, kernel_size=1)
            self._out_c = num_kp
        else:
            self.nets = None
            self._out_c = self._in_c
        # we could use torch.linspace directly but that seems to behave slightly differently than numpy
        # and causes a small degradation in pc_success of pre-trained models.
        pos_x, pos_y = np.meshgrid(np.linspace(-1.0, 1.0, self._in_w), np.linspace(-1.0, 1.0, self._in_h))
        pos_x = torch.from_numpy(pos_x.reshape(self._in_h * self._in_w, 1)).float()
        pos_y = torch.from_numpy(pos_y.reshape(self._in_h * self._in_w, 1)).float()
        # register as buffer so it's moved to the correct device.
        self.register_buffer("pos_grid", torch.cat([pos_x, pos_y], dim=1))
    def forward(self, features: Tensor) -> Tensor:
        """
        Args:
            features: (B, C, H, W) input feature maps.
        Returns:
            (B, K, 2) image-space coordinates of keypoints.
        """
        if self.nets is not None:
            features = self.nets(features)
        # [B, K, H, W] -> [B * K, H * W] where K is number of keypoints
        features = features.reshape(-1, self._in_h * self._in_w)
        # 2d softmax normalization
        attention = F.softmax(features, dim=-1)
        # [B * K, H * W] x [H * W, 2] -> [B * K, 2] for spatial coordinate mean in x and y dimensions
        expected_xy = attention @ self.pos_grid
        # reshape to [B, K, 2]
        feature_keypoints = expected_xy.view(-1, self._out_c, 2)
        return feature_keypoints
 class DiffusionRgbEncoder(nn.Module):
    """Encodes an RGB image into a 1D feature vector.
    Includes the ability to normalize and crop the image first.
    """
    def __init__(self, config: DiffusionConfig):
        super().__init__()
        # Set up optional preprocessing.
        if config.crop_shape is not None:
            self.do_crop = True
            # Always use center crop for eval
            self.center_crop = torchvision.transforms.CenterCrop(config.crop_shape)
            if config.crop_is_random:
                self.maybe_random_crop = torchvision.transforms.RandomCrop(config.crop_shape)
            else:
                self.maybe_random_crop = self.center_crop
        else:
            self.do_crop = False
        # Set up backbone.
        backbone_model = getattr(torchvision.models, config.vision_backbone)(
            weights=config.pretrained_backbone_weights
        )
        # Note: This assumes that the layer4 feature map is children()[-3]
        # TODO(alexander-soare): Use a safer alternative.
        self.backbone = nn.Sequential(*(list(backbone_model.children())[:-2]))
        if config.use_group_norm:
            if config.pretrained_backbone_weights:
                raise ValueError(
                    "You can't replace BatchNorm in a pretrained model without ruining the weights!"
                )
            self.backbone = _replace_submodules(
                root_module=self.backbone,
                predicate=lambda x: isinstance(x, nn.BatchNorm2d),
                func=lambda x: nn.GroupNorm(num_groups=x.num_features // 16, num_channels=x.num_features),
            )
        # Set up pooling and final layers.
        # Use a dry run to get the feature map shape.
        # The dummy input should take the number of image channels from `config.image_features` and it should
        # use the height and width from `config.crop_shape` if it is provided, otherwise it should use the
        # height and width from `config.image_features`.
        # Note: we have a check in the config class to make sure all images have the same shape.
        images_shape = next(iter(config.image_features.values())).shape
        dummy_shape_h_w = config.crop_shape if config.crop_shape is not None else images_shape[1:]
        dummy_shape = (1, images_shape[0], *dummy_shape_h_w)
        feature_map_shape = get_output_shape(self.backbone, dummy_shape)[1:]
        self.pool = SpatialSoftmax(feature_map_shape, num_kp=config.spatial_softmax_num_keypoints)
        self.feature_dim = config.spatial_softmax_num_keypoints * 2
        self.out = nn.Linear(config.spatial_softmax_num_keypoints * 2, self.feature_dim)
        self.relu = nn.ReLU()
    def forward(self, x: Tensor) -> Tensor:
        """
        Args:
            x: (B, C, H, W) image tensor with pixel values in [0, 1].
        Returns:
            (B, D) image feature.
        """
        # Preprocess: maybe crop (if it was set up in the __init__).
        if self.do_crop:
            if self.training:  # noqa: SIM108
                x = self.maybe_random_crop(x)
            else:
                # Always use center crop for eval.
                x = self.center_crop(x)
        # Extract backbone feature.
        x = torch.flatten(self.pool(self.backbone(x)), start_dim=1)
        # Final linear layer with non-linearity.
        x = self.relu(self.out(x))
        return x
 def _replace_submodules(
    root_module: nn.Module, predicate: Callable[[nn.Module], bool], func: Callable[[nn.Module], nn.Module]
 ) -> nn.Module:
    """
    Args:
        root_module: The module for which the submodules need to be replaced
        predicate: Takes a module as an argument and must return True if the that module is to be replaced.
        func: Takes a module as an argument and returns a new module to replace it with.
    Returns:
        The root module with its submodules replaced.
    """
    if predicate(root_module):
        return func(root_module)
    replace_list = [k.split(".") for k, m in root_module.named_modules(remove_duplicate=True) if predicate(m)]
    for *parents, k in replace_list:
        parent_module = root_module
        if len(parents) > 0:
            parent_module = root_module.get_submodule(".".join(parents))
        if isinstance(parent_module, nn.Sequential):
            src_module = parent_module[int(k)]
        else:
            src_module = getattr(parent_module, k)
        tgt_module = func(src_module)
        if isinstance(parent_module, nn.Sequential):
            parent_module[int(k)] = tgt_module
        else:
            setattr(parent_module, k, tgt_module)
    # verify that all BN are replaced
    assert not any(predicate(m) for _, m in root_module.named_modules(remove_duplicate=True))
    return root_module
 class DiffusionSinusoidalPosEmb(nn.Module):
    """1D sinusoidal positional embeddings as in Attention is All You Need."""
    def __init__(self, dim: int):
        super().__init__()
        self.dim = dim
    def forward(self, x: Tensor) -> Tensor:
        device = x.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = x.unsqueeze(-1) * emb.unsqueeze(0)
        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
        return emb
 class DiffusionConv1dBlock(nn.Module):
    """Conv1d --> GroupNorm --> Mish"""
    def __init__(self, inp_channels, out_channels, kernel_size, n_groups=8):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv1d(inp_channels, out_channels, kernel_size, padding=kernel_size // 2),
            nn.GroupNorm(n_groups, out_channels),
            nn.Mish(),
        )
    def forward(self, x):
        return self.block(x)
 class DiffusionConditionalUnet1d(nn.Module):
    """A 1D convolutional UNet with FiLM modulation for conditioning.
    Note: this removes local conditioning as compared to the original diffusion policy code.
    """
    def __init__(self, config: DiffusionConfig, global_cond_dim: int):
        super().__init__()
        self.config = config
        # Encoder for the diffusion timestep.
        self.diffusion_step_encoder = nn.Sequential(
            DiffusionSinusoidalPosEmb(config.diffusion_step_embed_dim),
            nn.Linear(config.diffusion_step_embed_dim, config.diffusion_step_embed_dim * 4),
            nn.Mish(),
            nn.Linear(config.diffusion_step_embed_dim * 4, config.diffusion_step_embed_dim),
        )
        # The FiLM conditioning dimension.
        cond_dim = config.diffusion_step_embed_dim + global_cond_dim
        # In channels / out channels for each downsampling block in the Unet's encoder. For the decoder, we
        # just reverse these.
        in_out = [(config.action_feature.shape[0], config.down_dims[0])] + list(
            zip(config.down_dims[:-1], config.down_dims[1:], strict=True)
        )
        # Unet encoder.
        common_res_block_kwargs = {
            "cond_dim": cond_dim,
            "kernel_size": config.kernel_size,
            "n_groups": config.n_groups,
            "use_film_scale_modulation": config.use_film_scale_modulation,
        }
        self.down_modules = nn.ModuleList([])
        for ind, (dim_in, dim_out) in enumerate(in_out):
            is_last = ind >= (len(in_out) - 1)
            self.down_modules.append(
                nn.ModuleList(
                    [
                        DiffusionConditionalResidualBlock1d(dim_in, dim_out, **common_res_block_kwargs),
                        DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
                        # Downsample as long as it is not the last block.
                        nn.Conv1d(dim_out, dim_out, 3, 2, 1) if not is_last else nn.Identity(),
                    ]
                )
            )
        # Processing in the middle of the auto-encoder.
        self.mid_modules = nn.ModuleList(
            [
                DiffusionConditionalResidualBlock1d(
                    config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
                ),
                DiffusionConditionalResidualBlock1d(
                    config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
                ),
            ]
        )
        # Unet decoder.
        self.up_modules = nn.ModuleList([])
        for ind, (dim_out, dim_in) in enumerate(reversed(in_out[1:])):
            is_last = ind >= (len(in_out) - 1)
            self.up_modules.append(
                nn.ModuleList(
                    [
                        # dim_in * 2, because it takes the encoder's skip connection as well
                        DiffusionConditionalResidualBlock1d(dim_in * 2, dim_out, **common_res_block_kwargs),
                        DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
                        # Upsample as long as it is not the last block.
                        nn.ConvTranspose1d(dim_out, dim_out, 4, 2, 1) if not is_last else nn.Identity(),
                    ]
                )
            )
        self.final_conv = nn.Sequential(
            DiffusionConv1dBlock(config.down_dims[0], config.down_dims[0], kernel_size=config.kernel_size),
            nn.Conv1d(config.down_dims[0], config.action_feature.shape[0], 1),
        )
    def forward(self, x: Tensor, timestep: Tensor | int, global_cond=None) -> Tensor:
        """
        Args:
            x: (B, T, input_dim) tensor for input to the Unet.
            timestep: (B,) tensor of (timestep_we_are_denoising_from - 1).
            global_cond: (B, global_cond_dim)
            output: (B, T, input_dim)
        Returns:
            (B, T, input_dim) diffusion model prediction.
        """
        # For 1D convolutions we'll need feature dimension first.
        x = einops.rearrange(x, "b t d -> b d t")
        timesteps_embed = self.diffusion_step_encoder(timestep)
        # If there is a global conditioning feature, concatenate it to the timestep embedding.
        if global_cond is not None:
            global_feature = torch.cat([timesteps_embed, global_cond], axis=-1)
        else:
            global_feature = timesteps_embed
        # Run encoder, keeping track of skip features to pass to the decoder.
        encoder_skip_features: list[Tensor] = []
        for resnet, resnet2, downsample in self.down_modules:
            x = resnet(x, global_feature)
            x = resnet2(x, global_feature)
            encoder_skip_features.append(x)
            x = downsample(x)
        for mid_module in self.mid_modules:
            x = mid_module(x, global_feature)
        # Run decoder, using the skip features from the encoder.
        for resnet, resnet2, upsample in self.up_modules:
            x = torch.cat((x, encoder_skip_features.pop()), dim=1)
            x = resnet(x, global_feature)
            x = resnet2(x, global_feature)
            x = upsample(x)
        x = self.final_conv(x)
        x = einops.rearrange(x, "b d t -> b t d")
        return x
 class DiffusionConditionalResidualBlock1d(nn.Module):
    """ResNet style 1D convolutional block with FiLM modulation for conditioning."""
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        cond_dim: int,
        kernel_size: int = 3,
        n_groups: int = 8,
        # Set to True to do scale modulation with FiLM as well as bias modulation (defaults to False meaning
        # FiLM just modulates bias).
        use_film_scale_modulation: bool = False,
    ):
        super().__init__()
        self.use_film_scale_modulation = use_film_scale_modulation
        self.out_channels = out_channels
        self.conv1 = DiffusionConv1dBlock(in_channels, out_channels, kernel_size, n_groups=n_groups)
        # FiLM modulation (https://huggingface.co/papers/1709.07871) outputs per-channel bias and (maybe) scale.
        cond_channels = out_channels * 2 if use_film_scale_modulation else out_channels
        self.cond_encoder = nn.Sequential(nn.Mish(), nn.Linear(cond_dim, cond_channels))
        self.conv2 = DiffusionConv1dBlock(out_channels, out_channels, kernel_size, n_groups=n_groups)
        # A final convolution for dimension matching the residual (if needed).
        self.residual_conv = (
            nn.Conv1d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()
        )
    def forward(self, x: Tensor, cond: Tensor) -> Tensor:
        """
        Args:
            x: (B, in_channels, T)
            cond: (B, cond_dim)
        Returns:
            (B, out_channels, T)
        """
        out = self.conv1(x)
        # Get condition embedding. Unsqueeze for broadcasting to `out`, resulting in (B, out_channels, 1).
        cond_embed = self.cond_encoder(cond).unsqueeze(-1)
        if self.use_film_scale_modulation:
            # Treat the embedding as a list of scales and biases.
            scale = cond_embed[:, : self.out_channels]
            bias = cond_embed[:, self.out_channels :]
            out = scale * out + bias
        else:
            # Treat the embedding as biases.
            out = out + cond_embed
        out = self.conv2(out)
        out = out + self.residual_conv(x)
        return out
--- a/diffusion/processor_diffusion.py
+++ b/diffusion/processor_diffusion.py
@@ -1,92 +0,0 @@
 #!/usr/bin/env python
 # Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
 # and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Any
 import torch
 from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
 from lerobot.processor import (
    AddBatchDimensionProcessorStep,
    DeviceProcessorStep,
    NormalizerProcessorStep,
    PolicyAction,
    PolicyProcessorPipeline,
    RenameObservationsProcessorStep,
    UnnormalizerProcessorStep,
 )
 from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
 from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
 def make_diffusion_pre_post_processors(
    config: DiffusionConfig,
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
 ) -> tuple[
    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
    PolicyProcessorPipeline[PolicyAction, PolicyAction],
 ]:
    """
    Constructs pre-processor and post-processor pipelines for a diffusion policy.
    The pre-processing pipeline prepares the input data for the model by:
    1. Renaming features.
    2. Normalizing the input and output features based on dataset statistics.
    3. Adding a batch dimension.
    4. Moving the data to the specified device.
    The post-processing pipeline handles the model's output by:
    1. Moving the data to the CPU.
    2. Unnormalizing the output features to their original scale.
    Args:
        config: The configuration object for the diffusion policy,
            containing feature definitions, normalization mappings, and device information.
        dataset_stats: A dictionary of statistics used for normalization.
            Defaults to None.
    Returns:
        A tuple containing the configured pre-processor and post-processor pipelines.
    """
    input_steps = [
        RenameObservationsProcessorStep(rename_map={}),
        AddBatchDimensionProcessorStep(),
        DeviceProcessorStep(device=config.device),
        NormalizerProcessorStep(
            features={**config.input_features, **config.output_features},
            norm_map=config.normalization_mapping,
            stats=dataset_stats,
        ),
    ]
    output_steps = [
        UnnormalizerProcessorStep(
            features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
        ),
        DeviceProcessorStep(device="cpu"),
    ]
    return (
        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
            steps=input_steps,
            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
        ),
        PolicyProcessorPipeline[PolicyAction, PolicyAction](
            steps=output_steps,
            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
            to_transition=policy_action_to_transition,
            to_output=transition_to_policy_action,
        ),
    )
--- a/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
+++ b/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
@@ -1,42 +0,0 @@
 # Streaming HDF5 EE Action Dataset Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** 将 Diana 仿真采集改为流式写入 HDF5，图像保存为 256x256 的四路相机视角，并把 `/action` 改为 IK 前的原始末端位姿动作。
 **Architecture:** 新增一个独立的流式 HDF5 episode writer，负责逐帧写入 qpos、原始 action 和 resize 后图像，并在 episode 成功时原子提交、失败时删除临时文件。采集脚本只负责 rollout 和把每一步观测/动作交给 writer，避免整集数据先堆在内存里。
 **Tech Stack:** Python, h5py, numpy, cv2, unittest, MuJoCo demo scripts
 ---
 ### Task 1: 为流式 writer 建立测试边界
 **Files:**
 - Create: `tests/test_streaming_episode_writer.py`
 - Create: `roboimi/utils/streaming_episode_writer.py`
 - [ ] **Step 1: Write the failing test**
 - [ ] **Step 2: Run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it fails because the writer module does not exist**
 - [ ] **Step 3: Implement the minimal streaming writer with temp-file commit/discard, per-frame append, and 256x256 image resize**
 - [ ] **Step 4: Re-run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it passes**
 ### Task 2: 接入 Diana 采集脚本
 **Files:**
 - Modify: `roboimi/demos/diana_record_sim_episodes.py`
 - Reuse: `roboimi/utils/streaming_episode_writer.py`
 - [ ] **Step 1: Replace in-memory `data_dict` / `obs` accumulation with per-episode streaming writer lifecycle**
 - [ ] **Step 2: Keep four cameras (`angle`, `r_vis`, `top`, `front`) and resize to 256x256 before persistence**
 - [ ] **Step 3: Capture raw policy output before IK and write that to `/action`**
 - [ ] **Step 4: On success commit to `episode_{idx}.hdf5`; on failure remove temp file**
 ### Task 3: 验证改动
 **Files:**
 - Verify only
 - [ ] **Step 1: Run unit tests for the writer**
 - [ ] **Step 2: Run one end-to-end collection episode and stop after `episode_0.hdf5` becomes readable**
 - [ ] **Step 3: Verify HDF5 keys and shapes: `action=(700,16)`, image datasets are `(700,256,256,3)`, and `/action` matches raw EE action semantics**
--- a/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
+++ b/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
@@ -1,26 +0,0 @@
 # Raw Action Trajectory Viewer Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** 在可交互 MuJoCo 仿真窗口中，把 rollout 导出的 raw EE action 轨迹用红色轨迹标出来并启动仿真供人工查看。
 **Architecture:** 读取已有 trajectory artifact 中的 raw_action / step 数据，生成左右臂末端轨迹点，并在 viewer 渲染循环中持续注入红色 marker。实现尽量独立为一个可复用的小脚本，避免影响训练/评估主路径。
 **Tech Stack:** Python, NumPy, MuJoCo viewer, unittest/mock.
 ---
 ### Task 1: 抽取 raw_action 轨迹并生成可视化点集
 - [ ] 写失败测试，验证从 trajectory.npz 提取左右臂轨迹点
 - [ ] 实现最小 helper
 - [ ] 运行测试确认通过
 ### Task 2: 在 viewer 中渲染红色轨迹并支持交互查看
 - [ ] 写失败测试，验证 marker 配置/调用
 - [ ] 实现 viewer 可视化脚本
 - [ ] 运行测试确认通过
 ### Task 3: 启动真实仿真窗口供人工查看
 - [ ] 用现有 trajectory artifact 启动 viewer
 - [ ] 确认窗口可交互、红线出现
 - [ ] 向用户汇报启动方式与脚本路径
--- a/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
+++ b/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
@@ -1,44 +0,0 @@
 # Rollout Artifacts Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Extend rollout evaluation so one selected checkpoint can be run once with video capture, timing breakdown, and saved EE trajectory artifacts.
 **Architecture:** Keep the implementation centered in `eval_vla.py` so existing training-time rollout validation remains compatible. Add config-gated artifact capture helpers, serialize outputs under the eval run directory, and add lightweight tests for helper behavior and summary wiring; default eval behavior must remain unchanged when artifact capture is off.
 **Tech Stack:** Python, Hydra/OmegaConf, NumPy, OpenCV, JSON, PyTorch unittest/mocking.
 ---
 ### Task 1: Add artifact capture configuration and helper wiring
 **Files:**
 - Modify: `roboimi/demos/vla_scripts/eval_vla.py`
 - Modify: `roboimi/vla/conf/eval/eval.yaml`
 - Test: `tests/test_eval_vla_rollout_artifacts.py`
 - [ ] **Step 1: Write failing tests for optional artifact config / summary wiring**
 - [ ] **Step 2: Implement config-backed artifact flags and output paths with defaults that write nothing**
 - [ ] **Step 3: Verify existing eval call sites still work with defaults**
 ### Task 2: Add timing breakdown, video recording, and trajectory export
 **Files:**
 - Modify: `roboimi/demos/vla_scripts/eval_vla.py`
 - Test: `tests/test_eval_vla_rollout_artifacts.py`
 - [ ] **Step 1: Write failing tests for timing aggregation, trajectory serialization, and summary schema**
 - [ ] **Step 2: Implement per-step timing capture for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`**
 - [ ] **Step 3: Implement MP4 recording from a chosen camera stream and canonical `trajectory.npz` export using `left_link7/right_link7` executed poses after `env.step`**
 - [ ] **Step 4: Run focused tests and fix issues**
 ### Task 3: Stop training safely and execute one real rollout
 **Files:**
 - Use: `roboimi/demos/vla_scripts/eval_vla.py`
 - Output: `runs/.../eval_artifacts/...`
 - [ ] **Step 1: Stop the active training process, wait for exit, and confirm the target checkpoint is readable**
 - [ ] **Step 2: Select the latest completed checkpoint if an explicit one is not provided; fall back to prior completed / best checkpoint if needed**
 - [ ] **Step 3: Run one headless rollout with artifact capture enabled**
 - [ ] **Step 4: Verify the MP4 / timing summary / trajectory files exist and summarize findings**
--- a/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
+++ b/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
@@ -1,268 +0,0 @@
 # IMF-AttnRes Policy Migration Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** 将 external `diffusion_policy@185ed659` 的 IMF-AttnRes 模型、训练目标和一步推理机制迁移到 RoboIMI，并在保持三相机视觉条件输入与现有训练/rollout 工作流的前提下启动同参数训练。
 **Architecture:** 保留 RoboIMI 现有 ResNet 三相机观测编码、normalization、queue-based online rollout 和训练脚本；新增 AttnRes 组件与 IMF transformer head，并新增 IMF 专用 agent 以覆盖 DDPM loss / DDIM inference 语义。训练脚本只做最小接线修改，让新 head/agent 能用现有 optimizer、checkpoint、SwanLab 和 headless rollout。
 **Tech Stack:** PyTorch, Hydra, diffusers schedulers (仅保留兼容初始化), MuJoCo rollout, unittest, SwanLab
 ---
 ## File Map
 ### New files
 - `roboimi/vla/models/heads/attnres_transformer_components.py` — 本地 IMF AttnRes 基础组件
 - `roboimi/vla/models/heads/imf_transformer1d.py` — IMF transformer head，暴露 `forward(sample, r, t, cond=None)`
 - `roboimi/vla/agent_imf.py` — IMF 专用 VLA agent，复用现有观测/队列/normalization 逻辑并覆盖 loss / inference
 - `roboimi/vla/conf/head/imf_transformer1d.yaml` — IMF head 配置
 - `roboimi/vla/conf/agent/resnet_imf_attnres.yaml` — IMF agent + backbone/head 组合配置
 - `tests/test_imf_transformer1d_external_alignment.py` — external `185ed659` 对齐测试
 - `tests/test_imf_vla_agent.py` — IMF agent 的 loss / inference / queue 语义测试
 ### Modified files
 - `roboimi/demos/vla_scripts/train_vla.py` — 优化器参数分组接线；确保新 agent 能无缝训练
 - `roboimi/vla/conf/config.yaml` — 保持默认配置不变，仅支持通过 override 启用 IMF agent
 - `tests/test_train_vla_transformer_optimizer.py` — 覆盖 IMF head 的 optimizer-group 行为
 - （如需要）`roboimi/vla/models/heads/__init__.py` 或相近导出文件 — 暴露新 head
 ---
 ### Task 1: 写 IMF transformer 对齐测试
 **Files:**
 - Create: `tests/test_imf_transformer1d_external_alignment.py`
 - Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/attnres_transformer_components.py`
 - Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
 - [ ] **Step 1: 写失败测试，验证 local IMF head 与 external `185ed659` 的 state-dict key、前向 shape、forward 数值、optim groups 对齐**
 ```python
 with torch.no_grad():
    external_out = external_model(sample=sample, r=r, t=t, cond=cond)
    local_out = local_model(sample=sample, r=r, t=t, cond=cond)
 assert torch.allclose(local_out, external_out, atol=1e-6, rtol=1e-5)
 ```
 - [ ] **Step 2: 运行单测，确认当前失败**
 Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
 Expected: FAIL，提示 `imf_transformer1d` / `attnres` 模块不存在
 - [ ] **Step 3: 若测试需要复用现有 external-loader 逻辑，则从 `tests/test_transformer1d_external_alignment.py` 复制最小必要 helper，避免重复依赖 session context**
 - [ ] **Step 4: 提交测试骨架**
 ```bash
 git add tests/test_imf_transformer1d_external_alignment.py
 git commit -m "test: add IMF transformer external alignment coverage"
 ```
 ### Task 2: 实现 AttnRes 组件与 IMF transformer head
 **Files:**
 - Create: `roboimi/vla/models/heads/attnres_transformer_components.py`
 - Create: `roboimi/vla/models/heads/imf_transformer1d.py`
 - Modify: `tests/test_imf_transformer1d_external_alignment.py`
 - [ ] **Step 1: 按 external `185ed659` 迁移 AttnRes 基础组件，保持命名和参数语义一致**
 必须包含：
 - `RMSNorm`
 - `RMSNormNoWeight`
 - `precompute_rope_freqs`
 - `apply_rope`
 - `GroupedQuerySelfAttention`
 - `SwiGLUFFN`
 - `AttnResOperator`
 - `AttnResSubLayer`
 - `AttnResTransformerBackbone`
 - [ ] **Step 2: 在 `imf_transformer1d.py` 中实现本地 IMF head**
 必须满足：
 - `forward(sample, r, t, cond=None)`
 - 默认支持 `backbone_type='attnres_full'`
 - token 序列为 `[r_token, t_token, cond_tokens..., sample_tokens...]`
 - 输出只切回 sample token 段
 - 保留 `get_optim_groups()` 供 AdamW 分组
 - [ ] **Step 3: 运行对齐测试，修正 state-dict key / init / no-decay 参数分组不一致问题**
 Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
 Expected: PASS
 - [ ] **Step 4: 提交模型组件实现**
 ```bash
 git add roboimi/vla/models/heads/attnres_transformer_components.py \
        roboimi/vla/models/heads/imf_transformer1d.py \
        tests/test_imf_transformer1d_external_alignment.py
 git commit -m "feat: add IMF AttnRes transformer head"
 ```
 ### Task 3: 写 IMF agent 行为测试
 **Files:**
 - Create: `tests/test_imf_vla_agent.py`
 - Reference: `roboimi/vla/agent.py`
 - Reference: `tests/test_resnet_transformer_agent_wiring.py`
 - [ ] **Step 1: 写失败测试，覆盖 IMF agent 的核心契约**
 需要覆盖：
 1. `compute_loss()` 接受当前 batch 结构并返回标量 loss
 2. `predict_action()` 输出 `(B, pred_horizon, action_dim)`
 3. `select_action()` 仍按 queue/chunk 语义工作
 4. `predict_action()` 不走 DDIM 多步循环，而是只触发一步 IMF sample
 5. `action_is_pad` 存在时仅在有效 action 上计 loss
 - [ ] **Step 2: 用 stub backbone / stub head 记录调用参数，验证 `r,t,cond` 的传递与 observation conditioning 维度正确**
 ```python
 self.assertEqual(recorded['cond'].shape, (B, obs_horizon, expected_cond_dim))
 self.assertTrue(torch.allclose(recorded['r'], torch.zeros(B)))
 self.assertTrue(torch.allclose(recorded['t'], torch.ones(B)))
 ```
 - [ ] **Step 3: 运行测试，确认当前失败**
 Run: `python -m unittest tests.test_imf_vla_agent -v`
 Expected: FAIL，提示 `roboimi.vla.agent_imf` 不存在
 - [ ] **Step 4: 提交测试骨架**
 ```bash
 git add tests/test_imf_vla_agent.py
 git commit -m "test: add IMF VLA agent behavior coverage"
 ```
 ### Task 4: 实现 IMF agent 与 Hydra 接线
 **Files:**
 - Create: `roboimi/vla/agent_imf.py`
 - Create: `roboimi/vla/conf/head/imf_transformer1d.yaml`
 - Create: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
 - Modify: `roboimi/demos/vla_scripts/train_vla.py`
 - Modify: `tests/test_train_vla_transformer_optimizer.py`
 - Modify: `tests/test_imf_vla_agent.py`
 - [ ] **Step 1: 以 `VLAAgent` 为基础实现 `IMFVLAAgent`**
 实现策略：
 - 复用 `VLAAgent.__init__`、`_build_cond()`、`reset()`、`_populate_queues()`、`_prepare_observation_batch()`、`select_action()`、`get_normalization_stats()`
 - 覆盖：
  - `compute_loss()` -> IMF objective
  - `predict_action()` -> one-step sample
 - 提供内部 helper：
  - `_broadcast_batch_time`
  - `_apply_conditioning`（如需）
  - `_compute_u_and_du_dt`
  - `_compound_velocity`
  - `_sample_one_step`
 - [ ] **Step 2: 在 JVP 路径中加入 CUDA math SDPA fallback，保持 external repo 的稳定性策略**
 - [ ] **Step 3: 新增 Hydra 配置，让 `agent=resnet_imf_attnres` 可实例化**
 关键默认值：
 - `_target_: roboimi.vla.agent_imf.IMFVLAAgent`
 - `head._target_: roboimi.vla.models.heads.imf_transformer1d.IMFTransformer1D`
 - `head.backbone_type: attnres_full`
 - `head.causal_attn: false`
 - `head.time_as_cond: true`
 - `head.n_cond_layers: 0`
 - `inference_steps: 1`
 - `camera_names: ${data.camera_names}`
 - `vision_backbone.camera_names: ${agent.camera_names}`
 - [ ] **Step 4: 让训练脚本对任何带 `get_optim_groups()` 的 head 复用参数分组，而不是硬编码旧 transformer head_type**
 推荐最小改法：
 ```python
 use_head_groups = callable(getattr(noise_pred_net, 'get_optim_groups', None))
 ```
 - [ ] **Step 5: 运行测试并修复 wiring 问题**
 Run:
 - `python -m unittest tests.test_imf_vla_agent -v`
 - `python -m unittest tests.test_train_vla_transformer_optimizer -v`
 Expected: PASS
 - [ ] **Step 6: 提交 agent / config / train-script 接线**
 ```bash
 git add roboimi/vla/agent_imf.py \
        roboimi/vla/conf/head/imf_transformer1d.yaml \
        roboimi/vla/conf/agent/resnet_imf_attnres.yaml \
        roboimi/demos/vla_scripts/train_vla.py \
        tests/test_imf_vla_agent.py \
        tests/test_train_vla_transformer_optimizer.py
 git commit -m "feat: add IMF VLA agent and training wiring"
 ```
 ### Task 5: 集成验证与训练启动
 **Files:**
 - Modify: none required unless验证暴露真实问题
 - Use run artifacts under: `runs/`
 - [ ] **Step 1: 运行聚焦测试集**
 Run:
 ```bash
 python -m unittest \
  tests.test_imf_transformer1d_external_alignment \
  tests.test_imf_vla_agent \
  tests.test_resnet_transformer_agent_wiring \
  tests.test_train_vla_transformer_optimizer -v
 ```
 Expected: PASS
 - [ ] **Step 2: 运行一个最小 GPU 训练冒烟任务（不必长跑）**
 Run:
 ```bash
 /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
  agent=resnet_imf_attnres \
  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
  data.camera_names=[r_vis,top,front] \
  train.device=cuda train.max_steps=2 train.batch_size=4 train.num_workers=2 \
  train.use_swanlab=false train.rollout_val_freq_epochs=0
 ```
 Expected: 成功完成 2 steps，生成 checkpoint / log，无 shape 或 JVP 错误
 - [ ] **Step 3: 用正式参数启动 IMF 训练**
 Run:
 ```bash
 /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
  agent=resnet_imf_attnres \
  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
  data.camera_names=[r_vis,top,front] \
  train.device=cuda train.val_split=0.0 train.seed=42 \
  train.batch_size=80 train.lr=5e-4 train.num_workers=12 train.max_steps=150000 \
  train.log_freq=100 train.save_freq=10000 train.use_swanlab=true \
  train.swanlab_project=roboimi-vla \
  train.rollout_val_freq_epochs=5 train.rollout_validate_on_checkpoint=false \
  train.rollout_num_episodes=5 train.warmup_steps=2000 \
  train.scheduler_type=cosine train.min_lr=1e-6 train.weight_decay=1e-5 train.grad_clip=1.0 \
  agent.pred_horizon=16 agent.inference_steps=1 \
  agent.head.n_emb=384 agent.head.n_layer=18 agent.head.n_head=1 agent.head.n_kv_head=1 \
  agent.vision_backbone.pretrained_backbone_weights=null \
  agent.vision_backbone.freeze_backbone=false \
  agent.vision_backbone.use_separate_rgb_encoder_per_camera=true
 ```
 Expected: 训练启动成功，SwanLab 记录完整 config，5 epoch 一次 headless rollout
 - [ ] **Step 4: 记录 run 路径、训练 PID、SwanLab 运行名并向用户汇报**
 - [ ] **Step 5: 提交最终收尾改动（如果 smoke fix 需要额外 patch）**
 ```bash
 git add <changed files>
 git commit -m "chore: verify IMF AttnRes training launch"
 ```
--- a/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
+++ b/docs/superpowers/plans/2026-04-02-imf-rollout-trajectory-images-and-short-horizon-training.md
@@ -1,79 +0,0 @@
 # IMF Rollout Trajectory Images and Short-Horizon Training Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Add training-time rollout front trajectory image export plus SwanLab image logging, then start a new local IMF training run with `emb=384`, `layer=12`, `pred_horizon=8`, `num_action_steps=4`, `max_steps=50000`.
 **Architecture:** Extend `eval_vla.py` so a rollout can emit one per-episode static front-view image with red EE trajectory overlay. Extend `train_vla.py` so rollout validation forces image export, forces video off, and uploads those per-episode images to SwanLab. Launch the requested new run through explicit command-line overrides rather than branch-default config changes.
 **Tech Stack:** Python, PyTorch, Hydra/OmegaConf, MuJoCo, OpenCV, SwanLab.
 ---
 ### Task 1: Add and validate rollout image tests
 **Files:**
 - Modify: `tests/test_eval_vla_rollout_artifacts.py`
 - Modify: `tests/test_train_vla_swanlab_logging.py`
 - Modify: `tests/test_train_vla_rollout_validation.py`
 - [ ] Add/adjust eval tests so they assert per-episode trajectory image paths are produced without requiring video export.
 - [ ] Add/adjust training tests so they assert training-time rollout validation forces `record_video=false`.
 - [ ] Add/adjust training tests so they assert trajectory image paths flow from eval summary into SwanLab media logging.
 - [ ] Add/adjust training tests so they assert image media is logged, not only scalar reward metrics.
 ### Task 2: Implement per-episode front trajectory image export in eval
 **Files:**
 - Modify: `roboimi/demos/vla_scripts/eval_vla.py`
 - Reuse/Read: `roboimi/utils/raw_action_trajectory_viewer.py`
 - Modify: `roboimi/vla/conf/eval/eval.yaml`
 - [ ] Add config plumbing for `save_trajectory_image` and `trajectory_image_camera_name`.
 - [ ] Ensure the default training-time camera resolution path is pinned to `front`.
 - [ ] Implement distinct per-episode image naming so 5 rollout episodes create 5 distinct PNGs.
 - [ ] Reuse the existing red trajectory representation logic when composing the PNG.
 - [ ] Ensure headless eval works under EGL even on machines with `DISPLAY` set.
 ### Task 3: Implement SwanLab rollout image logging in training
 **Files:**
 - Modify: `roboimi/demos/vla_scripts/train_vla.py`
 - Modify: `tests/test_train_vla_swanlab_logging.py`
 - Modify: `tests/test_train_vla_rollout_validation.py`
 - [ ] Make `run_rollout_validation()` force `record_video=false`.
 - [ ] Make `run_rollout_validation()` force `save_trajectory_image=true` and `trajectory_image_camera_name=front`.
 - [ ] Ensure rollout validation still uses 5 episodes per validation event for the requested run.
 - [ ] Add a best-effort helper that converts per-episode image paths into SwanLab image media payloads.
 - [ ] Keep image-upload failures non-fatal and warning-only.
 ### Task 4: Verify action-chunk semantics for the new run
 **Files:**
 - Verify: `roboimi/vla/agent.py`
 - Verify: `roboimi/vla/agent_imf.py`
 - Test: `tests/test_imf_vla_agent.py`
 - [ ] Confirm the existing queue logic still means “predict 8, execute first 4”.
 - [ ] Do not change branch defaults unless strictly necessary; prefer launch-time overrides.
 ### Task 5: Verify and launch the requested local training run
 **Files:**
 - Use: `roboimi/demos/vla_scripts/train_vla.py`
 - Use: `roboimi/demos/vla_scripts/eval_vla.py`
 - [ ] Run the targeted verification suite.
 - [ ] Run one real headless smoke eval and confirm a front trajectory PNG is produced while `video_mp4` stays null.
 - [ ] Launch the new local training run with explicit overrides including:
  - `agent=resnet_imf_attnres`
  - `agent.head.n_emb=384`
  - `agent.head.n_layer=12`
  - `agent.pred_horizon=8`
  - `agent.num_action_steps=4`
  - `train.max_steps=50000`
  - `train.rollout_num_episodes=5`
  - `train.use_swanlab=true`
  - current local baseline dataset/camera/CUDA/batch/lr/num_workers/backbone settings
 - [ ] Verify PID, GPU allocation, log tail, and SwanLab run URL.
--- a/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
+++ b/docs/superpowers/plans/2026-04-04-imf-horizon-grid-and-attnres-ablation.md
@@ -1,68 +0,0 @@
 # IMF Horizon Grid and AttnRes Ablation Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Run a 6-run Phase-1 IMF horizon/action-step experiment grid across available GPUs, monitor progress and collect best rollout metrics, then use the best horizon setting for a Phase-2 visual-attnres ablation.
 **Architecture:** Use the current IMF training code as-is for Phase-1 by sweeping explicit `(pred_horizon, num_action_steps)` overrides while keeping emb=384, layer=12, and max_steps=50k fixed. Maintain a local experiment suite directory with a manifest and machine-readable status snapshots so progress can be resumed and summarized across turns. After Phase-1 completes, compare the current head-only attnres setup against a variant that also adds attnres into the visual ResNet path.
 **Tech Stack:** Python, Hydra/OmegaConf, PyTorch, SSH/Tailscale, JSON/CSV status files, SwanLab.
 ---
 ### Task 1: Prepare the experiment suite manifest and state tracking
 **Files:**
 - Create: `experiment_suites/2026-04-04-imf-horizon-grid/manifest.json`
 - Create: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
 - Create: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
 - [ ] Define the 6 legal Phase-1 combinations: `(8,8)`, `(16,8)`, `(16,16)`, `(32,8)`, `(32,16)`, `(32,32)`.
 - [ ] Record for each run: name, host, GPU slot, command, log path, SwanLab run name, and completion criteria.
 - [ ] Define the comparison metric as the maximum rollout average reward seen during training (`max avg_reward`), preferably read from the best-checkpoint metadata and cross-checked against logs.
 - [ ] Keep `status.json` updated with per-run state: queued / running / finished / failed plus latest parsed progress.
 ### Task 2: Prepare the remote 8-GPU execution target
 **Files:**
 - Remote working directory under `/home/droid/`
 - Reuse or create a synced code directory for this suite
 - [ ] Verify the remote dataset path and environment path.
 - [ ] Verify GPU availability and reserve 6 GPUs for Phase-1 launches.
 - [ ] Sync the required code to a dedicated remote suite directory.
 - [ ] Record exact remote paths back into the local suite manifest.
 ### Task 3: Launch the 6 Phase-1 experiments in parallel
 **Files:**
 - Reuse: `roboimi/demos/vla_scripts/train_vla.py`
 - Modify only local suite tracking files unless a launch bug is discovered
 - [ ] Launch 6 runs concurrently with fixed settings: IMF, emb=384, layer=12, max_steps=50k.
 - [ ] Keep all other relevant training hyperparameters aligned to the current strong baseline unless a concrete blocker appears.
 - [ ] Assign one GPU per run on the 8xL20 host.
 - [ ] Capture PID, log path, and SwanLab URL for each run in `status.json`.
 ### Task 4: Monitor and summarize Phase-1 until all 6 finish
 **Files:**
 - Update: `experiment_suites/2026-04-04-imf-horizon-grid/status.json`
 - Update: `experiment_suites/2026-04-04-imf-horizon-grid/notes.md`
 - [ ] Periodically parse each run’s log/checkpoints to extract latest step, latest rollout reward, and best rollout reward so far.
 - [ ] Keep a resumable local summary so progress can be continued in later turns without rediscovery.
 - [ ] After all 6 runs finish, rank them by `max avg_reward` and write a compact Phase-1 summary.
 ### Task 5: Prepare the Phase-2 visual-attnres ablation
 **Files:**
 - Likely modify: vision backbone implementation and config files (to be confirmed after code inspection)
 - Add/update targeted tests for the visual backbone path if code changes are needed
 - [ ] Use the best Phase-1 `(pred_horizon, num_action_steps)` combination as the fixed rollout setting for Phase-2.
 - [ ] Compare:
  1. current setup: attnres only in the IMF head
  2. ablation setup: attnres in both IMF head and visual encoder path
 - [ ] Keep the rest of the training settings fixed.
 - [ ] Launch and monitor the Phase-2 pair after Phase-1 summary is complete.
--- a/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
+++ b/docs/superpowers/plans/2026-04-05-lewm-vit-backbone-implementation.md
@@ -1,92 +0,0 @@
 # LEWM ViT Backbone Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
 **Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
 **Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
 ---
 ### Task 1: Add failing tests for LEWM joint-vision backbone contract
 **Files:**
 - Create: `tests/test_lewm_vit_backbone.py`
 - Modify: `tests/test_imf_vla_agent.py`
 - [ ] **Step 1: Write the failing backbone shape/load test**
 - [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
 - [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
 - [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
 ### Task 2: Implement LEWM joint-multiview frozen backbone
 **Files:**
 - Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
 - Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
 - [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
 - [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
 - [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
 - [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
 - [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
 ### Task 3: Add minimal agent support for joint visual dim
 **Files:**
 - Modify: `roboimi/vla/agent.py`
 - Test: `tests/test_imf_vla_agent.py`
 - [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
 - [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
 - [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
 ### Task 4: Add Hydra configs for LEWM backbone training
 **Files:**
 - Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
 - Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
 - [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
 - [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
 - [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
 ### Task 5: Verify focused local tests
 **Files:**
 - Reuse the above
 - [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
 - [ ] **Step 2: If needed, run one tiny local import/forward smoke**
 ### Task 6: Sync to 5880 and remote smoke with real checkpoint
 **Files:**
 - Remote target: `/home/droid/roboimi_suite_20260404`
 - [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
 - [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
 - [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
 ### Task 7: Launch two real 50k runs on the 5880 machine
 **Files:**
 - Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
 - [ ] **Step 1: Launch embed384/layer12 on GPU0**
 - [ ] **Step 2: Launch embed256/layer12 on GPU1**
 - [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
 - [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
 ### Task 8: Update experiment tracking docs and commit
 **Files:**
 - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
 - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
 - Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
 - [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
 - [ ] **Step 2: Record running status after launch**
 - [ ] **Step 3: Commit implementation + docs with a focused message**
--- a/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
+++ b/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md
@@ -1,64 +0,0 @@
 # Phase-2 Full-AttnRes Vision Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.
 **Architecture:** Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing `ResNetDiffusionBackbone` via an opt-in mode and keep the agent/head/data interfaces unchanged.
 **Tech Stack:** PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.
 ---
 ### Task 1: Add failing tests for the new full-AttnRes visual backbone
 **Files:**
 - Create: `tests/test_attnres_resnet2d_backbone.py`
 - Update: `tests/test_imf_vla_agent.py`
 - [ ] **Step 1: Write a failing backbone shape test**
 - [ ] **Step 2: Run it to confirm the new backbone/config does not exist yet**
 - [ ] **Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208**
 - [ ] **Step 4: Run the targeted tests and capture the failure**
 ### Task 2: Implement a ResNet-like 2D AttnRes backbone
 **Files:**
 - Create: `roboimi/vla/models/backbones/attnres_resnet2d.py`
 - Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
 - [ ] **Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling**
 - [ ] **Step 2: Implement `AttnResImageBlock2D` for feature maps**
 - [ ] **Step 3: Implement `AttnResResNetLikeBackbone2D` with stage-wise downsampling**
 - [ ] **Step 4: Wire `_SingleRgbEncoder` to choose between original ResNet trunk and the new full-AttnRes trunk**
 - [ ] **Step 5: Run the new backbone tests**
 ### Task 3: Expose config switches and agent wiring
 **Files:**
 - Modify: `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
 - Modify: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
 - [ ] **Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk**
 - [ ] **Step 2: Add defaults for attnres image depth/heads/etc. if needed**
 - [ ] **Step 3: Add a Phase-2 launch override path that enables the new visual trunk**
 - [ ] **Step 4: Run agent wiring tests again**
 ### Task 4: Smoke-verify training path
 **Files:**
 - Reuse existing training scripts and configs
 - [ ] **Step 1: Run a short CPU or tiny-step smoke instantiation / `compute_loss` test**
 - [ ] **Step 2: If needed, run a very short training smoke launch**
 - [ ] **Step 3: Verify no cond-dim or rollout-loading regressions**
 ### Task 5: Launch the Phase-2 experiment
 **Files:**
 - Update experiment tracking under `experiment_suites/`
 - [ ] **Step 1: Use Phase-1 best setting (`pred_horizon=16`, `num_action_steps=8`)**
 - [ ] **Step 2: Launch baseline reference or reuse existing result**
 - [ ] **Step 3: Launch full-AttnRes vision experiment**
 - [ ] **Step 4: Track rollout metrics and compare max avg_reward**
--- a/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
+++ b/docs/superpowers/plans/2026-04-06-resnet-multitoken-imf.md
@@ -1,81 +0,0 @@
 # ResNet Multitoken IMF Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
 **Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
 **Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
 ---
 ### Task 1: Add failing tests for multi-token conditioning
 **Files:**
 - Modify: `tests/test_imf_vla_agent.py`
 - Modify: `tests/test_resnet_transformer_agent_wiring.py`
 - [ ] **Step 1: Add a direct agent test**
  - Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
  - Assert state is paired with each camera token, not concatenated across cameras first.
 - [ ] **Step 2: Add Hydra wiring test**
  - Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
  - Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
 - [ ] **Step 3: Run focused tests and verify RED**
  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
 ### Task 2: Implement multi-token ResNet conditioning path
 **Files:**
 - Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
 - Modify: `roboimi/vla/agent.py`
 - Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
 - [ ] **Step 1: Extend ResNet backbone**
  - Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
  - Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
 - [ ] **Step 2: Extend VLAAgent condition building**
  - Support visual features with rank 4 `(B,T,K,D)`.
  - Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
  - Track `condition_tokens_per_step` and `condition_sequence_length`.
 - [ ] **Step 3: Update transformer-head instantiation**
  - Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
 - [ ] **Step 4: Add Hydra config**
  - New agent config uses:
    - separate ResNet-18 per camera
    - standard residual vision trunk (`vision_backbone_mode=resnet`)
    - condition projector output dim tied to `${agent.head.n_emb}`
    - rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
 ### Task 3: Verify locally
 **Files:**
 - Modify only if verification reveals issues
 - [ ] **Step 1: Run focused tests and make them pass**
  - `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
 - [ ] **Step 2: Run regression subset**
  - `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
 - [ ] **Step 3: Run local smoke instantiation**
  - instantiate the new Hydra config and verify cond shape / sequence length
 ### Task 4: Launch 4 L20 experiments
 **Files:**
 - Remote repo copy under `/home/droid/roboimi_suite_20260404`
 - [ ] **Step 1: Sync code to `100.119.99.14`**
 - [ ] **Step 2: Smoke the new config on remote**
 - [ ] **Step 3: Launch runs**
  - `(n_emb=256, n_layer=12)`
  - `(n_emb=256, n_layer=16)`
  - `(n_emb=384, n_layer=12)`
  - `(n_emb=384, n_layer=16)`
 - [ ] **Step 4: Keep fixed across runs**
  - rollout episodes `10`
  - `pred_horizon=16`
  - `num_action_steps=8`
  - standard ResNet-18 vision trunk
  - three separate camera weights
 - [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**
--- a/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
+++ b/docs/superpowers/plans/2026-04-06-siglip2-multiview-vla.md
@@ -1,78 +0,0 @@
 # SigLIP2 Multiview VLA Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
 **Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
 **Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
 ---
 ### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
 **Files:**
 - Create: `tests/test_siglip2_diffusion_backbone.py`
 - Modify: `tests/test_imf_vla_agent.py`
 - [ ] **Step 1: Write failing backbone tests**
  - Instantiate the new backbone with a stub SigLIP2 vision model.
  - Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
  - Assert three views are encoded independently and projected.
 - [ ] **Step 2: Run focused tests and verify RED**
  - Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
  - Expect failure because the backbone/config/projector do not exist yet.
 - [ ] **Step 3: Extend agent wiring tests**
  - Add a Hydra/instantiate test for a new SigLIP2 IMF config.
  - Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
 ### Task 2: Implement SigLIP2 backbone and optional condition projector
 **Files:**
 - Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
 - Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
 - Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
 - Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
 - Modify: `roboimi/vla/models/backbones/__init__.py`
 - Modify: `roboimi/vla/agent.py`
 - [ ] **Step 1: Implement backbone**
  - Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
  - Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
  - Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
 - [ ] **Step 2: Implement optional condition projector**
  - Allow `VLAAgent` to accept `cond_projector`.
  - Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
  - Apply the projector in `_build_cond()` after visual+state concatenation.
 - [ ] **Step 3: Add Hydra configs**
  - New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
  - Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
 ### Task 3: Verify locally and prepare remote execution
 **Files:**
 - Modify as needed only if tests/smoke reveal issues
 - [ ] **Step 1: Run focused tests and make them pass**
  - `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
 - [ ] **Step 2: Run a local smoke instantiation**
  - Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
 - [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
 ### Task 4: Sync to 5880 and launch experiments
 **Files:**
 - Remote repo copy under `/home/droid/roboimi_suite_20260404`
 - [ ] **Step 1: Stop superseded remote jobs**
 - [ ] **Step 2: Sync updated code to remote**
  - Prefer `rsync` or `git push/pull` without overwriting unrelated files.
 - [ ] **Step 3: Remote smoke test**
  - Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
  - Confirm headless rollout path still uses `256x256` eval resize.
 - [ ] **Step 4: Launch experiment A**
  - `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
 - [ ] **Step 5: Launch experiment B**
  - `per_view_output_dim=192`, same other hyperparameters.
 - [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**
--- a/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
+++ b/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
@@ -1,241 +0,0 @@
 # VLA Training + Headless Rollout + SwanLab Design
 **Date:** 2026-03-30
 **Branch:** feat-align-dp-transformer-ee
 ## Goal
 在当前仓库中补齐默认 `resnet_transformer` / `Transformer1D` 路线的训练依赖，使用数据集 `/home/droid/project/diana_sim/sim_transfer` 启动训练；同时支持训练过程中的 SwanLab 标量日志上传，并为后续 rollout 验证提供 headless 模式，避免弹出 MuJoCo / OpenCV 图形界面。
 ## Non-Goals
 - 不重写整套训练框架
 - 不引入新的 workspace / callback 框架
 - 不在本轮做复杂的视频/媒体日志上传
 - 不修改数据集格式本身
 ## Current State
 - 默认训练配置已切到 `agent=resnet_transformer`，head 为 `Transformer1D`
 - 当前环境缺少训练所需的若干 Python 依赖：`diffusers`、`torchvision`、`einops`、`swanlab`
 - 评估环境 `make_sim_env(task_name)` 当前写死 `is_render=True`
 - 相机线程 `camera_viewer()` 默认会 `cv2.namedWindow/imshow`，即使只想拿图像也会弹窗
 - 训练脚本当前支持 train/val loss、checkpoint，但没有 SwanLab 集成
 - 数据集目录 `/home/droid/project/diana_sim/sim_transfer` 下已有 100 个 episode，但还没有 `dataset_stats.pkl`
 ## User Requirements
 1. 在现有 mamba 环境里补齐训练依赖
 2. 在 `/home/droid/project/diana_sim/sim_transfer` 上开始训练
 3. 如果训练中需要 rollout 验证，希望支持 headless，不弹 GUI
 4. 训练指标上传到 SwanLab
 5. 默认 SwanLab project 名为 `roboimi-vla`
 ## Proposed Approach
 采用“最小必要改造”方案：
 ### 1. Dependency Layer
 在现有 `roboimi` 环境中补齐缺失训练依赖，并优先保持现有环境名与脚本入口不变。
 #### Install Plan
 - 环境：继续使用现有 mamba 环境 `roboimi`
 - 安装方式：
  - 优先使用当前 env 的 `python -m pip install`
  - 安装包：
    - `diffusers`
    - `torchvision`
    - `einops`
    - `swanlab`
 - 版本策略：
  - 优先选择与当前 `torch==2.4.0` 可兼容的最新可安装版本
  - 若出现兼容性问题，再回退到与 `torch 2.4` 对齐的稳定版本
 - 复现策略：
  - 本轮会把**实际安装成功的 resolved versions** 补写回仓库的环境定义文件，避免后续环境漂移
 训练前验证以下 import：
 - `torch`
 - `hydra`
 - `omegaconf`
 - `diffusers`
 - `torchvision`
 - `einops`
 - `swanlab`
 - `cv2`
 - `h5py`
 - `mujoco`
 ### 2. Dataset Preparation
 直接复用现有 `SimpleRobotDataset`，仅将 `data.dataset_dir` 指向：
 - `/home/droid/project/diana_sim/sim_transfer`
 训练前使用现有统计脚本生成：
 - `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
 统计文件生成命令目标为：
 - 从仓库根目录执行
 - 直接针对 `/home/droid/project/diana_sim/sim_transfer` 输出 stats
 - 训练脚本不再依赖默认数据目录
 ### 3. SwanLab Logging
 在训练脚本中增加一个轻量 logging 集成层：
 - 通过配置决定是否启用 SwanLab，默认启用
 - 默认 project：`roboimi-vla`
 - API key 不写入仓库，不写入配置文件，只通过本地登录状态或环境变量使用
 - 当 `train.use_swanlab=true` 时：
  - 若 `swanlab` 不可 import，训练直接 fail fast
  - 若未登录或认证失败，训练直接 fail fast
 - 每个训练日志点上传：
  - `train/loss`
  - `train/lr`
  - `train/best_loss`
  - `train/step`
 - 每次验证时上传：
  - `val/loss`
 - 训练结束时记录最终 checkpoint 路径与 best checkpoint 路径
 ### 4. Headless Rollout Design
 目标是让 rollout 验证可以“拿到图像观测，但不弹任何窗口”。
 最小改造策略：
 - 给 `make_sim_env(...)` 增加 `headless` / `is_render` 参数
 - 给相机线程显示逻辑增加开关：
  - headless 时继续更新 `r_vis/top/front/...` 图像缓存
  - 但不执行 `cv2.namedWindow` / `cv2.imshow` / `cv2.waitKey`
 - 评估脚本中：
  - headless 时不调用 `env.render()`
  - 仍然允许 `env._get_image_obs()` 和 policy inference 正常运行
 #### Training-Time Rollout Scope
 - 本轮**会提供一个可选的 checkpoint-time rollout validation 路径**，默认关闭
 - 启用后，在训练保存 checkpoint 时可以调用同仓库的 rollout/eval 逻辑做少量 episode 验证
 - 此路径要求支持**唯一权威开关** `eval.headless=true`，即：
  - 不弹 MuJoCo viewer
  - 不执行 `cv2.namedWindow / cv2.imshow / cv2.waitKey`
  - 仍可读取图像并完成策略推理
 - 默认情况下不增加频繁 rollout，以避免拖慢训练；只提供能力与配置开关
 如果验证发现相机线程强依赖 GUI，我们的降级策略是：
 - 训练主流程 + SwanLab 必须先跑通
 - rollout validation 保持为显式可选能力
 - 但本轮仍要保证至少存在可调用的 headless 验证执行路径，而不是仅停留在文档层面
 ### 5. Training Execution Strategy
 分两步执行：
 #### Step A: Smoke Run
 使用较小步数启动一次 smoke training，确认：
 - 数据集可正常读取
 - 统计文件可加载
 - 模型可实例化
 - 单步前后向正常
 - checkpoint 正常写出
 - SwanLab 成功上传标量
 #### Step B: Real Training Run
 在 smoke run 成功后，再启动正式训练。
 ## Execution Commands
 ### A. Stats Generation
 从仓库根目录执行，生成：
 - `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
 命令模板：
 ```bash
 /home/droid/.conda/envs/roboimi/bin/python roboimi/vla/scripts/calculate_stats.py \
  --dataset_dir /home/droid/project/diana_sim/sim_transfer
 ```
 ### B. Smoke Training Command
 从仓库根目录执行，核心覆盖项包括：
 - `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
 - 较小 `train.max_steps`
 - 较高日志频率
 - 启用 SwanLab
 - 输出目录使用当前运行目录下的 `checkpoints/`
 命令模板：
 ```bash
 /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
  train.max_steps=20 \
  train.log_freq=1 \
  train.save_freq=10 \
  train.use_swanlab=true \
  train.swanlab_project=roboimi-vla \
  train.rollout_validate_on_checkpoint=false
 ```
 ### C. Real Training Command
 从仓库根目录执行，核心覆盖项包括：
 - `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
 - 正式 `train.max_steps`
 - 默认 project=`roboimi-vla`
 - 若启用 rollout validation，则传入 `eval.headless=true` 以及训练侧 rollout 开关
 命令模板：
 ```bash
 /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
  train.use_swanlab=true \
  train.swanlab_project=roboimi-vla \
  train.rollout_validate_on_checkpoint=true \
  eval.headless=true
 ```
 ### D. Output Behavior
 - checkpoint 输出目录：当前工作目录下的 `checkpoints/`
 - 关键文件：
  - `checkpoints/vla_model_step_<N>.pt`
  - `checkpoints/vla_model_best.pt`
  - `checkpoints/vla_model_final.pt`
 ## File-Level Changes
 - `environment.yml`
  - 补写新增训练依赖，保证后续可复现
 - `roboimi/demos/vla_scripts/train_vla.py`
  - 增加 SwanLab 集成
  - 增加更明确的数据集目录覆盖支持
  - 增加可选 checkpoint-time rollout validation 入口
  - 保持当前 optimizer 对齐逻辑不变
 - `roboimi/vla/conf/config.yaml`
  - 增加/扩展训练日志、SwanLab、rollout 相关配置项
 - `roboimi/vla/conf/eval/eval.yaml`
  - 增加 `headless` 等评估控制项
 - `roboimi/envs/double_pos_ctrl_env.py`
  - `make_sim_env` 支持 headless / no-render
 - `roboimi/envs/double_base.py`
  - 相机采集与 GUI 显示解耦
 - `roboimi/vla/scripts/calculate_stats.py`
  - 改为直接支持通过命令行传入外部 `dataset_dir`
 - tests（新增）
  - 覆盖 SwanLab 可选初始化路径
  - 覆盖 headless 环境下“不弹窗但可取图”的关键逻辑
 ## Validation Plan
 1. 补齐依赖后验证 import 全通过
 2. 生成 `dataset_stats.pkl`
 3. 运行训练 smoke run
 4. 确认 SwanLab dashboard 在 project `roboimi-vla` 下有标量更新
 5. 若启用 rollout 验证：确认 headless 下不弹 GUI，且 rollout 路径能真正执行
 6. 再启动正式训练
 ## Config Contract
 本轮新增/固定的配置键以以下形式为准：
 - `train.use_swanlab: true|false`
 - `train.swanlab_project: roboimi-vla`
 - `train.rollout_validate_on_checkpoint: true|false`
 - `eval.headless: true|false`
 ## Risks and Mitigations
 - **Risk:** GUI/相机线程与离屏渲染耦合
  - **Mitigation:** 先解耦显示与图像更新；必要时把 rollout 验证降级为第二阶段
 - **Risk:** 现有 env 依赖不完整
  - **Mitigation:** 先做 import 验证，再做 smoke run
 - **Risk:** 数据集过大导致 smoke run 也很慢
  - **Mitigation:** smoke run 只跑极小步数
 - **Risk:** SwanLab API key 泄漏
  - **Mitigation:** 不写入代码/配置，只保存在本地登录态或环境变量
 ## Success Criteria
 - 训练脚本能在 `/home/droid/project/diana_sim/sim_transfer` 上启动
 - 能成功写出 checkpoint 到 `checkpoints/`
 - SwanLab 在 `roboimi-vla` 项目下能看到 train/val 标量
 - headless rollout 具备不弹 GUI 的执行路径
 - 若训练侧启用 rollout validation，则该路径可以在 headless 模式下被实际调用
--- a/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
+++ b/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
@@ -1,16 +0,0 @@
 # Rollout Artifacts Design
 **Goal:** Add a one-off evaluation path that can record rollout video, export per-step timing breakdowns, and save executed end-effector trajectories for a selected checkpoint while preserving default eval behavior when artifact capture is disabled.
 **Approach:** Extend `roboimi/demos/vla_scripts/eval_vla.py` with optional evaluation-time artifact capture that stays backward compatible when disabled. Reuse existing environment observation and camera streams, record one camera stream to MP4, collect per-step timing around observation read / preprocessing / model inference / env step / total loop, and save per-step raw predicted EE actions plus executed EE poses after stepping.
 **Artifact contract:**
 - `video.mp4`: optional MP4 encoded from a selected camera stream (`r_vis`, `top`, `front`, etc.), written only when recording is enabled.
 - `trajectory.npz`: canonical trajectory export containing at minimum `step`, `reward`, `raw_action`, `executed_left_link7_pos`, `executed_left_link7_quat`, `executed_right_link7_pos`, `executed_right_link7_quat`, and optional duplicated tool-body poses if captured.
 - `timing.json`: JSON-serializable per-episode timing summary with millisecond units for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`, plus aggregate mean/std/min/max and counts. Raw per-step timing arrays should also be persisted in the NPZ for later analysis.
 **Checkpoint selection:** Prefer an explicitly requested checkpoint path. If the caller asks for “latest” or omits a path in the execution helper, select the newest fully written checkpoint file by mtime/name and fail clearly if none exists.
 **Stop-training / execution safety:** Before rollout, stop any active training process using the target run, wait for process exit, then verify the chosen checkpoint exists and is readable. If the most recent checkpoint is missing or mid-write, fall back to the previous completed checkpoint or `vla_model_best.pt` with the decision logged.
 **Backward compatibility:** With all new eval flags left at default values, `_run_eval` return shape must remain compatible with existing callers, training-time rollout validation should continue to work without passing new options, and no artifact files should be written.
--- a/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
+++ b/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
@@ -1,272 +0,0 @@
 # IMF-AttnRes Policy Migration Design
 **Date:** 2026-04-01
 **Status:** Approved in chat, written spec pending review
 ## Goal
 将 `/home/droid/project/diffusion_policy` 中提交 `185ed659` 的 IMF-AttnRes diffusion policy 迁移到当前 `roboimi` 仓库，作为当前 DiT / Transformer diffusion policy 的替代训练选项；同时迁移其训练目标与一步推理机制，并保持 RoboIMI 现有的仿真环境、三相机视觉输入、数据集格式、训练脚本和 rollout 验证工作流可继续使用。
 ## Non-Goals
 - 不迁移 external repo 中与当前任务无关的 obs encoder、dataset、env wrapper、PushT 专用逻辑。
 - 不强行复刻 external repo 中全部目录结构；仅迁移当前 RoboIMI 训练所必需的模型、loss、inference 语义。
 - 不在本次工作中同时保留旧 DiT 为默认训练目标；旧配置继续可用，但新模型单独提供 config 入口。
 ## User-Confirmed Requirements
 1. 迁移对象是 `185ed659` 中的 **IMF-AttnRes 模型相关代码**。
 2. 不只是迁移骨架，还要迁移：
   - **训练目标**
   - **一步推理机制**
 3. 视觉输入与当前 RoboIMI diffusion policy 一致：
   - 使用三个相机图像作为条件输入
   - 图像观测必须作为条件，而不是拼进输出预测目标
 4. 当前任务里，IMF policy 用来替代现有 DiT/Transformer diffusion policy 训练。
 5. 训练参数沿用最近一次训练的大体设置（后续由训练命令显式覆盖），但推理方式改为 IMF 的 one-step 机制。
 6. 用户接受 IMF 中“全注意力 / 非因果注意力”的实现约束。
 ## External Source of Truth
 迁移语义以 external repo 的以下文件为准：
 - `diffusion_policy/model/diffusion/attnres_transformer_components.py`
 - `diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
 - `diffusion_policy/policy/imf_transformer_hybrid_image_policy.py`
 - 参考配置：`image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
 其中最关键的差异是：该策略并非 DDPM/DDIM 多步去噪，而是 IMF 训练目标 + one-step 推理。
 ## Current RoboIMI Baseline
 当前 RoboIMI 中与该任务直接相关的基线如下：
 - 视觉编码：`ResNetDiffusionBackbone`
  - 三相机：`r_vis`, `top`, `front`
  - 每个时间步将相机特征与 `qpos` 拼接为 per-step condition
 - 策略主体：`VLAAgent`
  - `compute_loss()` 使用 DDPM 噪声预测损失
  - `predict_action()` 使用 DDIM 多步采样
  - 在线控制通过动作队列机制在 `select_action()` 中按 chunk 触发预测
 - 训练脚本：`roboimi/demos/vla_scripts/train_vla.py`
  - 支持 GPU 训练、SwanLab 日志、headless rollout 验证
 因此，本次迁移的核心不是换视觉 backbone，而是替换 **head + loss + inference semantics**。
 ## Recommended Integration Approach
 采用 **最小侵入式集成**：
 1. **保留当前 RoboIMI 的视觉编码、数据读取、rollout/eval、训练脚本主框架**。
 2. **新增 IMF 专用 head 模块**，在 RoboIMI 内本地实现：
   - AttnRes 组件
   - IMF transformer 主体
 3. **新增 IMF 专用 agent**，复用当前 `VLAAgent` 的：
   - 归一化逻辑
   - 相机顺序管理
   - 观测缓存 / 动作 chunk 缓存
   - rollout 接口
   但覆盖：
   - `compute_loss()`
   - `predict_action()`
 4. **新增独立 Hydra config**，让 IMF policy 作为新的 agent 选项，不破坏已有 resnet_transformer / gr00t_dit 配置。
 这样做的原因：
 - 迁移 IMF 语义时不必把当前 DDPM agent 搅乱；
 - rollout / eval / checkpoint 逻辑仍然可复用；
 - 便于和现有 Transformer / DiT 直接做 A/B 对比训练。
 ## Architecture
 ### 1. Observation / Conditioning Path
 沿用当前 RoboIMI 的视觉路径：
 - 输入观测：`images={r_vis, top, front}` + `qpos`
 - `ResNetDiffusionBackbone` 对每个相机编码，得到 per-camera feature
 - `state_encoder` 编码 `qpos`
 - 将三相机特征与 state feature 按时间步拼接，形成 `per_step_cond`
 这里不迁移 external repo 的 obs_encoder 实现；我们只对齐 **“图像作为条件 token 输入 transformer”** 这一语义。
 ### 2. Condition Tokenization
 对齐 external IMF transformer 的 token 使用方式：
 - action trajectory token：由 `(B, pred_horizon, action_dim)` 通过线性层映射到 `n_emb`
 - time token：两个标量 `r` 与 `t`，分别通过 sinusoidal embedding + linear projection 得到 token
 - observation token：`per_step_cond` 通过线性层映射到 `n_emb`
 - 最终 token 序列为：
  - `[r_token, t_token, obs_cond_tokens..., action_tokens...]`
 在当前任务中，obs token 数量等于 `obs_horizon`，且图像观测始终作为条件输入。
 ### 3. IMF-AttnRes Backbone
 在 RoboIMI 内新增 AttnRes backbone 实现，保持 external commit 的关键语义：
 - `RMSNorm` / `RMSNormNoWeight`
 - RoPE
 - Grouped Query Self-Attention
 - SwiGLU FFN
 - AttnRes operator / residual source aggregation
 - `AttnResTransformerBackbone`
 并保持：
 - **full attention**（不使用因果注意力）
 - `backbone_type='attnres_full'`
 - 输出仅切回 action token 部分，再经过最终 norm + head 得到 velocity-like 输出
 ### 4. Training Objective
 训练目标从当前 DDPM epsilon prediction 改为 external IMF 目标：
 给定真实轨迹 `x` 与随机噪声 `e`：
 1. 采样 `t ~ U(0,1)`、`r ~ U(0,1)`，并排序为 `t >= r`
 2. 构造插值状态：
   - `z_t = (1 - t) x + t e`
 3. 用模型计算：
   - `v = f(z_t, t, t, cond)`
 4. 对 `g(z, r, t) = f(z, r, t, cond)` 做 JVP，得到：
   - `u, du_dt`
 5. 构造 compound velocity：
   - `V = u + (t - r) * du_dt`
 6. 目标为：
   - `target = e - x`
 7. 用 action 维度上的 MSE 作为最终损失
 RoboIMI 现有 batch 中的 `action_is_pad` 仍要保留支持；如果存在 padding，只在有效 action 上计算损失。
 ### 5. One-Step Inference
 推理改为 external IMF 的一步采样语义：
 1. 从标准高斯初始化 action trajectory `z_t`
 2. 计算 `u = f(z_t, r=0, t=1, cond)`
 3. 一步更新：
   - `x_hat = z_t - (t-r) * u = z_t - u`
 4. 反归一化得到动作序列
 这意味着：
 - `num_inference_steps` 对 IMF policy 固定为 `1`
 - 不再调用 DDIM scheduler 的多步 `step()`
 - 在线控制中仍沿用当前 chunk 机制：
  - 动作队列为空时触发一次 `predict_action_chunk()`
  - 取预测序列中 `[obs_horizon-1 : obs_horizon-1+num_action_steps]` 这一段入队
 也就是说，**触发模型前向的规则不变，改变的是每次触发后的动作序列生成方式**。
 ## API / Code Structure
 计划中的主要代码边界如下：
 - `roboimi/vla/models/heads/attnres_transformer_components.py`
  - IMF AttnRes 基础组件
 - `roboimi/vla/models/heads/imf_transformer1d.py`
  - RoboIMI 版本 IMF transformer head
  - 对外暴露 `forward(sample, r, t, cond=None)`
  - 暴露 `get_optim_groups()` 供 AdamW 分组使用
 - `roboimi/vla/agent_imf.py`
  - 复用 `VLAAgent` 的观测处理 / normalization / queue 基础设施
  - 覆盖 IMF 的训练损失与 one-step 预测逻辑
 - Hydra config
  - `roboimi/vla/conf/head/imf_transformer1d.yaml`
  - `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
 训练脚本主流程尽量不改；只要求它能 instantiate 新 agent 并继续使用当前 rollout / checkpoint / swanlab 逻辑。
 ## Compatibility Decisions
 ## Initial Config Defaults To Preserve
 为避免迁移时语义漂移，首版 IMF 配置默认值明确固定为：
 - `backbone_type: attnres_full`
 - `n_head: 1`
 - `n_kv_head: 1`
 - `n_cond_layers: 0`
 - `time_as_cond: true`
 - `causal_attn: false`
 - `num_inference_steps: 1`
 这些默认值与 external `185ed659` 的 IMF-AttnRes 使用方式保持一致；后续调参可以覆盖，但首版迁移必须先以该语义跑通。
 ### Reuse From RoboIMI
 保留：
 - 三相机数据读取方式
 - ResNet visual backbone
 - qpos / action normalization
 - 训练循环、优化器、scheduler、SwanLab、headless rollout
 - `select_action()` 的在线 chunk 执行方式
 ### Replace With External IMF Semantics
 替换：
 - transformer head 实现
 - diffusion training objective
 - inference sampling semantics
 ### Intentionally Not Mirrored 1:1
 不强行与 external repo 一致的部分：
 - external repo 的整体 policy 基类继承体系
 - external repo 的 obs encoder 模块树
 - external repo 的 normalizer / mask generator 框架
 原因是当前 RoboIMI 已有稳定的数据接口和 rollout 流程，直接嫁接进去更稳。
 ## Testing / Verification Strategy
 迁移完成后至少验证以下内容：
 1. **单元 / 冒烟验证**
   - IMF head 前向 shape 正确
   - IMF agent `compute_loss()` 在真实 batch 上可前向、反向
   - IMF agent `predict_action()` 能输出 `(B, pred_horizon, action_dim)`
 2. **训练链路验证**
   - 使用 GPU 跑一个短训练任务，确认：
     - dataloader 正常
     - optimizer / lr scheduler 正常
     - SwanLab 正常记录配置和训练指标
 3. **rollout 验证**
   - 训练中周期性 headless rollout 能跑通
   - 环境仍按 EE-style `step()` 接收动作
 4. **最终交付**
   - 用用户指定的同类超参数启动正式训练
 ## Risks and Mitigations
 ### Risk 1: JVP 在 CUDA 注意力内核上不稳定
 缓解：沿用 external repo 的策略，在 JVP 路径上切换到 math SDP kernel，必要时 fallback 到 `torch.autograd.functional.jvp`。同时，JVP 的切线构造与 `u, du_dt` 计算流程必须严格对齐 external source，不在本次迁移中自行改写其数学语义。
 ### Risk 2: Optimizer 参数分组遗漏新模块
 缓解：IMF head 提供 `get_optim_groups()`，并在训练脚本中按“只要 head 提供该接口就使用”的策略统一处理，而不是绑定旧 `head_type`。
 ### Risk 3: 现有 rollout 逻辑假定 DDIM 多步采样
 缓解：保持 `select_action()` / `predict_action_chunk()` 接口不变，只替换 `predict_action()` 内部实现，确保 eval 代码无需理解 IMF 细节。
 ### Risk 4: 训练命令参数与新 config 不一致
 缓解：新增独立 agent config，并保留此前训练参数作为显式 CLI override 模板。
 ## Success Criteria
 以下条件全部满足，视为本次迁移成功：
 1. RoboIMI 中新增 IMF-AttnRes policy，可通过 Hydra config 单独启用。
 2. 训练时使用 external IMF 的 loss，而不是当前 DDPM epsilon loss。
 3. 推理时使用 one-step IMF 采样，而不是 DDIM 多步采样。
 4. 三相机图像始终作为条件输入参与模型前向。
 5. 在线 rollout 能在 headless 仿真环境中跑通。
 6. 能按最近一次实验参数模板成功启动训练。
--- a/docs/superpowers/specs/2026-04-02-imf-rollout-trajectory-images-design.md
+++ b/docs/superpowers/specs/2026-04-02-imf-rollout-trajectory-images-design.md
@@ -1,75 +0,0 @@
 # IMF Rollout Trajectory Images + Short-Horizon Training Design
 ## Background
 The current RoboIMI IMF training flow can perform rollout validation and log scalar reward metrics to SwanLab, but it does not yet emit the qualitative rollout artifacts now required for analysis. The user wants training-time rollout validation to save front-view trajectory images with the model-generated trajectory drawn in red, upload those images to SwanLab, and then start a new local short-horizon IMF training run.
 ## Goals
 1. During training-time rollout validation, save one **front-camera** trajectory image per rollout episode.
 2. The image must show the rollout EE trajectory in red.
 3. Reuse the existing repository trajectory visualization logic as much as practical, especially the existing red capsule-marker trajectory representation.
 4. Save 5 rollout images locally for each validation event and upload the same 5 images to SwanLab.
 5. Do **not** record rollout videos for this training-time validation flow.
 6. Start a new local IMF-AttnRes training run with:
   - `agent.head.n_emb=384`
   - `agent.head.n_layer=12`
   - `agent.pred_horizon=8`
   - `agent.num_action_steps=4`
   - `train.max_steps=50000`
   - `train.rollout_num_episodes=5`
   - `train.use_swanlab=true`
 ## Non-Goals
 - No IMF architecture or loss-function change.
 - No dataset schema change.
 - No rollout video generation for the new training flow.
 - No interactive viewer requirement.
 ## Existing Relevant Code
 - `roboimi/demos/vla_scripts/eval_vla.py`
  - already supports rollout summaries, optional trajectory export, and optional video export.
 - `roboimi/utils/raw_action_trajectory_viewer.py`
  - already contains the red trajectory capsule-marker construction logic.
 - `roboimi/demos/vla_scripts/train_vla.py`
  - already performs periodic rollout validation and scalar SwanLab logging.
 - `roboimi/vla/agent.py`
  - already implements “predict pred_horizon, execute first num_action_steps” queue semantics.
 ## Design Decisions
 ### 1. Artifact contract
 Each rollout episode will emit one distinct PNG file under the eval artifact directory. The file naming/path contract must be per-episode, not shared, so a 5-episode validation event yields 5 stable image paths without overwriting.
 ### 2. Trajectory definition
 The red trajectory corresponds to the **actually executed model action sequence** over the rollout loop: the raw EE actions returned and consumed step-by-step by the policy loop. For the requested short-horizon run, this means the visualization reflects repeated execution of the first 4 actions from each predicted 8-action chunk, not every discarded future prediction from replanning.
 ### 3. Camera choice
 The training-time image export path is explicitly pinned to the repo’s concrete `front` camera key. It must not silently use `camera_names[0]` if that is not `front`.
 ### 4. Rendering path
 `eval_vla.py` will add a lightweight headless image-export path that:
 - renders the `front` camera frame,
 - overlays the trajectory using the existing red trajectory representation,
 - saves a static PNG per episode.
 The implementation may reuse the existing marker-construction logic directly and add a minimal helper for final image composition/export.
 ### 5. Training-time behavior
 `train_vla.py` rollout validation must explicitly:
 - request/save trajectory images,
 - keep `record_video=false`,
 - return the 5 per-episode image paths in the rollout summary payload,
 - upload those 5 images to SwanLab,
 - keep image-upload failures non-fatal.
 ## Expected User-Visible Outcome
 For each scheduled validation event in the new training run:
 - 5 rollout episodes execute,
 - 5 front-view PNG trajectory images are saved locally,
 - the same 5 images are uploaded to SwanLab,
 - scalar reward metrics continue to be logged,
 - no rollout videos are generated.
 ## Risks and Mitigations
 - **Headless rendering conflicts from desktop env vars**: force headless eval onto EGL when `headless=true`.
 - **Image overwrite risk**: use explicit per-episode artifact paths.
 - **SwanLab media API mismatch**: isolate media logging in a small best-effort helper.
--- a/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
+++ b/docs/superpowers/specs/2026-04-05-lewm-vit-backbone-design.md
@@ -1,138 +0,0 @@
 # LEWM ViT Backbone Replacement Design
 ## Goal
 将当前 roboimi VLA policy 中的 ResNet 视觉编码器替换为来自 LEWM checkpoint 的冻结 ViT 视觉编码器（encoder + projector），仅使用最终 CLS token 的 192 维 embedding 作为视觉特征。
 ## User constraints
 - 使用 `/home/droid/下载/lewm_sim_transfer_checkpoint_usage.md` 中确认的训练好 checkpoint
 - 只使用视觉编码部分：`encoder + projector`
 - 权重冻结
 - 维持“视觉特征 + state 拼接，再送入 diffusion transformer”这一总体处理方式
 - 输入使用三视角：`[r_vis, top, front]`
 - 在 5880 机器上启动两个训练：`embed=384/layer=12` 和 `embed=256/layer=12`
 - `pred_horizon=16`
 - `num_action_steps=8`
 - 每个训练 `50k` steps
 - rollout 验证每次用 `10` 个 episodes，不是之前的 `5`
 ## Trusted existing facts
 1. LEWM checkpoint 路径：
   - `/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt`
 2. 需要加载的 state_dict 前缀：
   - `model.encoder.*`
   - `model.projector.*`
 3. LEWM ViT 配置：
   - encoder scale: `tiny`
   - hidden size: `192`
   - layers: `12`
   - attention heads: `3`
   - patch size: `14`
   - projector: `MLP(192 -> 2048 -> 192)` with `BatchNorm1d + GELU`
 4. LEWM 训练时三视角先拼成单图，再送入单个 ViT encoder；输出整体视觉 embedding 是 **192 维**。
 ## Key design decision
 ### Chosen design: fuse 3 cameras into one LEWM-style image, output one 192-d visual vector per timestep
 不是把 LEWM ViT 当成“每相机一个 192-d encoder”，而是按 LEWM 原训练方式：
 - 输入三视角图像字典 `{r_vis, top, front}`
 - 按固定顺序拼成一张 fused image
 - 走单个 frozen ViT + projector
 - 得到一个 **192 维总视觉特征**
 ### Why this is the right replacement
 当前 ResNet backbone 对外给到 policy head 的**总视觉特征维度**是：
 - 每相机 `64`
 - 三相机总计 `192`
 而 LEWM checkpoint 输出的 CLS/projector embedding 也是：
 - 总计 `192`
 因此，最自然的“直接平替当前 ResNet 视觉编码器”的方式是：
 - 用 LEWM backbone 直接产出一个 192-d 总视觉向量
 - 后续和 state `16-d` 拼接后，依旧得到 `208-d` 条件向量
 - 不改 diffusion head 的总体接口和语义
 ## Interface compatibility plan
 现有 `VLAAgent` 假设 backbone 暴露：
 - `camera_names`
 - `num_cameras`
 - `output_dim`（语义上是“每相机特征维度”）
 - `forward(images_dict) -> (B, T, total_visual_dim)`
 为了最小改动兼容现有 agent：
 - 新 LEWM backbone 的 `forward()` 返回 `(B, T, 192)`
 - `camera_names = ('r_vis', 'top', 'front')`
 - `num_cameras = 3`
 - `output_dim = 64`
 这样 `VLAAgent` 内部仍会计算：
 - `per_step_cond_dim = output_dim * num_cams + obs_dim = 64*3 + 16 = 208`
 与实际 `forward()` 输出的 `192 + 16 = 208` 保持一致。
 > 也就是说：`output_dim` 在这个 backbone 里保留为“与旧 ResNet 总特征等价的单相机占位维度”，而不是“真实 projector 输出维度”。这是一个兼容性 shim，用来避免改 agent 主逻辑。
 ## Image preprocessing design
 当前 roboimi dataset 已经把每个相机图像读成：
 - `(C, 224, 224)`
 - 值域 `[0, 1]`
 新 LEWM backbone 将：
 1. 按顺序取 `r_vis`, `top`, `front`
 2. 在宽度方向拼接，得到 fused image：
   - `(C, 224, 672)`
 3. 使用 LEWM 一致的 ImageNet normalize：
   - mean `[0.485, 0.456, 0.406]`
   - std `[0.229, 0.224, 0.225]`
 4. 调用 `ViTModel(..., interpolate_pos_encoding=True)`
 5. 取 `last_hidden_state[:, 0]`
 6. 送入 frozen projector，得到 `(B*T, 192)`
 ## Files to create / modify
 ### New files
 - `roboimi/vla/models/backbones/lewm_vit_backbone.py`
 - `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
 - `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
 - `tests/test_lewm_vit_backbone.py`
 ### Modified files
 - `roboimi/vla/models/backbones/__init__`（如果需要导出）
 - `tests/test_imf_vla_agent.py`（增加新 backbone 集成用例）
 - `roboimi/demos/vla_scripts/train_vla.py`（如需仅调整 rollout 默认/日志；如果命令覆盖足够，则尽量不改主逻辑）
 - 训练/实验 suite 文档（新增本次 LEWM ViT 训练记录）
 ## Testing plan
 1. **Unit test: load + forward**
   - 用 synthetic checkpoint 验证新 backbone 能正确加载 `model.encoder.*` 与 `model.projector.*`
   - 输入 3 相机 `(B,T,C,224,224)`
   - 输出 `(B,T,192)`
 2. **Agent integration test**
   - backbone.output_dim=64, num_cameras=3
   - agent `_build_cond()` 输出最后维度为 `208`
 3. **Remote smoke test on 5880**
   - 使用真实 checkpoint
   - `max_steps=2`
   - 两个实验各自 smoke 一次
 4. **Full run**
   - GPU0: `embed=384, layer=12`
   - GPU1: `embed=256, layer=12`
   - `rollout_num_episodes=10`
 ## Training launch contract
 - host: `100.73.14.65`
 - code dir: `/home/droid/roboimi_suite_20260404`
 - python: `/home/droid/miniforge3/envs/roboimi/bin/python`
 - dataset: `/home/droid/sim_dataset/sim_transfer`
 - cameras: `[r_vis, top, front]`
 - agent: new `lewm_imf_attnres`
 - max_steps: `50000`
 - rollout every `5` epochs
 - rollout episodes: `10`
 ## Risks
 1. LEWM 训练时的 fused image 预处理如果方向实现错了（224x672 vs 672x224），会导致分布偏移。
 2. 当前 roboimi env 需确保安装 `transformers`；从 `environment.yml` 看本地已有该依赖，但远端训练环境要 smoke 确认。
 3. 因为这是 frozen ViT + projector，若 projector BN 仍保持 train 模式，统计量会漂移，所以必须整体 `eval()` 并冻结。
 ## Recommended first implementation path
 - 先实现一个独立 `LEWMViTBackbone` 类，不改现有 `ResNetDiffusionBackbone` 主逻辑。
 - 再通过新的 hydra backbone/agent 配置接入。
 - 优先做到“最少侵入 + smoke 可跑 + 远端可训”。
--- a/docs/superpowers/specs/2026-04-05-phase2-full-attnres-vision-design.md
+++ b/docs/superpowers/specs/2026-04-05-phase2-full-attnres-vision-design.md
@@ -1,81 +0,0 @@
 # Phase-2 Full-AttnRes Vision Design
 ## Goal
 在当前 roboimi IMF policy 中，把视觉 backbone 里原先由 ResNet BasicBlock/Bottleneck 提供的残差单元全部替换为 AttnRes 风格单元，同时尽量保持现有 agent / cond / rollout / 训练脚本接口不变。
 ## User requirement interpretation
 这里按最严格解释执行：
 - 不是“在 ResNet 后面再加一个 AttnRes 模块”
 - 也不是“只在某几个 stage 加 AttnRes 混合”
 - 而是：视觉主干网络中原本依赖 ResNet residual block 的地方，统一改成 AttnRes residual operator 驱动的 block
 - 最终仍然输出与现有 `ResNetDiffusionBackbone` 相同的每相机特征接口，以便复用 `SpatialSoftmax -> Linear -> ReLU`、多相机拼接、state concat、IMF head 条件输入
 ## Recommended design
 ### Option A (recommended)
 保留 ResNet 的宏观 stage/stem 结构与通道/步幅规划，但把每个 stage 内的 BasicBlock/Bottleneck 替换为新的 `AttnResImageBlock2D`：
 - 输入仍是 `(B, C, H, W)` feature map
 - block 内先把空间维 flatten 成 token 序列 `(B, H*W, C)`
 - 用二维位置编码 / 可学习位置偏置 + AttnRes self-attention + AttnRes FFN 完成 block 变换
 - 再 reshape 回 `(B, C, H, W)`
 - stage 间下采样仍由显式 stride/downsample path 完成
 优点：
 - 最接近“ResNet 中所有残差都由 AttnRes 代替”的要求
 - 保留现有视觉输出接口和 cond_dim，不用改 agent/head/data pipeline
 - 仍可沿用现有多相机编码器框架
 缺点：
 - 需要新写 2D 版 AttnRes image block，而不是直接复用 1D IMF head block
 ### Option B
 完全移除 ResNet stage，换成 patchify + ViT/AttnRes 图像 transformer，再接 SpatialSoftmax/MLP。
 优点：实现概念更统一。  
 缺点：已经不算“把 ResNet 中残差替换掉”，而是直接换 backbone，和用户要求不完全一致。
 ### Option C
 保留现有 ResNet block，只在 block 外层加 AttnRes mixing。
 不推荐，因为不满足“所有残差均由 AttnRes 替代”。
 ## Concrete architecture choice
 采用 Option A：
 1. 保留 stem（conv/bn-or-gn/relu/maxpool）与 stage 边界
 2. 新增 `AttnResImageBlock2D`
 3. 新增 `AttnResResNetLikeBackbone2D`，负责堆叠 stage/block
 4. 在 `ResNetDiffusionBackbone` 中增加可选 backbone mode，例如：
   - `vision_backbone_mode: resnet`
   - `vision_backbone_mode: attnres_resnet`
 5. `resnet_imf_attnres` agent 配置新增一个 Phase-2 变体，默认打开 `attnres_resnet`
 6. 仍保持：
   - 每相机输出 `64`
   - 多相机总视觉输出 `3 * 64`
   - 与 state 拼接后 `cond_dim = 208`
 ## Files likely to change
 - `roboimi/vla/models/backbones/resnet_diffusion.py`
 - `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
 - `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
 - new: `roboimi/vla/models/backbones/attnres_resnet2d.py`
 - tests:
  - new: `tests/test_attnres_resnet2d_backbone.py`
  - update/add wiring test for agent cond dims
 ## Test plan
 1. New backbone instantiates and forwards `(B,T,C,H,W)` multi-camera input
 2. Output shape unchanged vs current backbone
 3. `output_dim == 64`
 4. 3-camera cond path still yields `208`
 5. Phase-2 config instantiates full IMF agent successfully
 6. One short CPU smoke forward for `compute_loss`
 ## Phase-2 experiment plan
 固定使用 Phase-1 最优组合：
 - `pred_horizon=16`
 - `num_action_steps=8`
 比较：
 1. baseline: current IMF head-only AttnRes + original ResNet vision backbone
 2. phase2: IMF head AttnRes + full AttnRes-replaced vision backbone
 训练超参保持与 Phase-1 最优设置一致，先跑一组 50k step 对比。
--- a/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
+++ b/docs/superpowers/specs/2026-04-06-resnet-multitoken-imf-design.md
@@ -1,32 +0,0 @@
 # ResNet Multitoken IMF Design
 **Status:** user-specified architecture, treated as approved on 2026-04-06.
 ## Goal
 Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
 ## Approved architecture
 - Vision trunk: standard `resnet18` residual network
 - Cameras: `front`, `top`, `r_vis`
 - Each camera uses its **own** ResNet-18 weights (`use_separate_rgb_encoder_per_camera=true`)
 - Each camera produces one visual token
 - For each obs step and each camera:
  1. take that camera visual token
  2. concatenate robot state
  3. project to one condition token
 - IMF input should receive **3 condition tokens per obs step**, not one concatenated token
 - With `obs_horizon=2`, IMF cond sequence length becomes `2 * 3 = 6`
 - IMF head remains on the existing IMF/AttnRes implementation path
 - Vision trunk remains standard ResNet; **no AttnRes vision replacement**
 ## Design choices
 - Extend `ResNetDiffusionBackbone` with an opt-in mode that returns per-camera tokens shaped `(B, T, num_cams, D)` instead of concatenating camera features into `(B, T, num_cams * D)`.
 - Teach `VLAAgent` to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten `(T, num_cams)` into one cond sequence for the IMF head.
 - Keep `per_step_cond_dim` as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
 - For the new experiments, set the condition-token width equal to `n_emb` via `cond_projector.output_dim=${agent.head.n_emb}`.
 ## Files expected to change
 - `roboimi/vla/models/backbones/resnet_diffusion.py`
 - `roboimi/vla/agent.py`
 - new Hydra agent config for the multitoken ResNet IMF variant
 - focused tests in `tests/test_imf_vla_agent.py` and/or `tests/test_resnet_transformer_agent_wiring.py`
--- a/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
+++ b/docs/superpowers/specs/2026-04-06-siglip2-multiview-vla-design.md
@@ -1,41 +0,0 @@
 # SigLIP2 Multiview VLA Design
 **Status:** user-specified architecture, treated as approved on 2026-04-06
 ## Goal
 Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
 ## Approved architecture
 - Backbone model: `google/siglip2-base-patch16-256`
 - Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
 - Input size:
  - dataset images stay at native `256x256` (no dataset-side resize)
  - eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
 - Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
 - Per-view projection experiments:
  1. `768 -> 96`
  2. `768 -> 192`
 - Conditioning pipeline:
  1. concatenate 3 projected camera vectors
  2. concatenate robot state
  3. project concatenated condition to `384`
  4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
 - Training/run defaults for requested experiments:
  - `n_emb=384`
  - `n_layer=12`
  - `pred_horizon=16`
  - `num_action_steps=8`
  - rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
 ## Design decisions
 - The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
 - The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
 - The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
 - One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
 ## Files expected to change
 - `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
 - `roboimi/vla/agent.py` for optional post-concat condition projection
 - Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
 - tests for backbone wiring and agent conditioning dims
 - remote launch commands/scripts only as needed for training
--- a/environment.yml
+++ b/environment.yml
@@ -1,458 +0,0 @@
 name: roboimi
 channels:
  - conda-forge
 dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - _python_abi3_support=1.0
  - aiohappyeyeballs=2.6.1
  - aiohttp=3.13.3
  - aiosignal=1.4.0
  - alsa-lib=1.2.9
  - anyio=4.12.1
  - aom=3.5.0
  - async-timeout=5.0.1
  - attr=2.5.1
  - attrs=25.4.0
  - aws-c-auth=0.7.22
  - aws-c-cal=0.6.15
  - aws-c-common=0.9.23
  - aws-c-compression=0.2.18
  - aws-c-event-stream=0.4.2
  - aws-c-http=0.8.2
  - aws-c-io=0.14.9
  - aws-c-mqtt=0.10.4
  - aws-c-s3=0.5.10
  - aws-c-sdkutils=0.1.16
  - aws-checksums=0.1.18
  - aws-crt-cpp=0.26.12
  - aws-sdk-cpp=1.11.329
  - box2d-py=2.3.8
  - brotli=1.1.0
  - brotli-bin=1.1.0
  - brotli-python=1.1.0
  - bzip2=1.0.8
  - c-ares=1.34.6
  - ca-certificates=2026.1.4
  - cairo=1.16.0
  - certifi=2026.1.4
  - cffi=1.17.1
  - charset-normalizer=3.4.4
  - click=8.3.1
  - cloudpickle=3.0.0
  - contourpy=1.3.0
  - cpython=3.10.19
  - cuda-cudart=12.6.68
  - cuda-cudart_linux-64=12.6.68
  - cuda-nvrtc=12.6.68
  - cuda-nvtx=12.6.68
  - cuda-version=12.6
  - cudnn=8.9.7.29
  - cycler=0.12.1
  - datasets=4.0.0
  - dav1d=1.2.1
  - dbus=1.13.6
  - dill=0.3.8
  - eigen=3.4.0
  - exceptiongroup=1.3.1
  - expat=2.6.3
  - farama-notifications=0.0.4
  - filelock=3.15.4
  - fluidsynth=2.3.3
  - font-ttf-dejavu-sans-mono=2.37
  - font-ttf-inconsolata=3.000
  - font-ttf-source-code-pro=2.038
  - font-ttf-ubuntu=0.83
  - fontconfig=2.14.2
  - fonts-conda-ecosystem=1
  - fonts-conda-forge=1
  - fonttools=4.53.1
  - freetype=2.12.1
  - frozenlist=1.7.0
  - fsspec=2024.6.1
  - gettext=0.22.5
  - gettext-tools=0.22.5
  - gflags=2.2.2
  - git-lfs=3.7.1
  - glog=0.7.1
  - gmp=6.3.0
  - gmpy2=2.1.5
  - graphite2=1.3.13
  - gym=0.26.1
  - gym-box2d=0.26.1
  - gym-notices=0.0.8
  - gymnasium=0.29.1
  - h11=0.16.0
  - h2=4.3.0
  - harfbuzz=7.3.0
  - hf-xet=1.2.1
  - hpack=4.1.0
  - httpcore=1.0.9
  - httpx=0.28.1
  - huggingface_hub=1.3.5
  - hyperframe=6.1.0
  - icu=72.1
  - idna=3.11
  - jack=1.9.22
  - jax-jumpy=1.0.0
  - jinja2=3.1.4
  - jpeg=9e
  - keyutils=1.6.3
  - kiwisolver=1.4.9
  - krb5=1.21.3
  - lame=3.100
  - lcms2=2.15
  - ld_impl_linux-64=2.40
  - lerc=4.0.0
  - libabseil=20240116.2
  - libarrow=16.1.0
  - libarrow-acero=16.1.0
  - libarrow-dataset=16.1.0
  - libarrow-substrait=16.1.0
  - libasprintf=0.22.5
  - libasprintf-devel=0.22.5
  - libavif=0.11.1
  - libblas=3.9.0
  - libbrotlicommon=1.1.0
  - libbrotlidec=1.1.0
  - libbrotlienc=1.1.0
  - libcap=2.69
  - libcblas=3.9.0
  - libcrc32c=1.1.2
  - libcublas=12.6.1.4
  - libcufft=11.2.6.59
  - libcurand=10.3.7.68
  - libcurl=8.12.1
  - libcusolver=11.6.4.69
  - libcusparse=12.5.3.3
  - libdb=6.2.32
  - libdeflate=1.17
  - libedit=3.1.20250104
  - libev=4.33
  - libevent=2.1.12
  - libexpat=2.6.3
  - libffi=3.4.2
  - libflac=1.4.3
  - libgcc=14.1.0
  - libgcc-ng=14.1.0
  - libgcrypt=1.11.0
  - libgettextpo=0.22.5
  - libgettextpo-devel=0.22.5
  - libgfortran=14.1.0
  - libgfortran-ng=14.1.0
  - libgfortran5=14.1.0
  - libglib=2.80.3
  - libgoogle-cloud=2.25.0
  - libgoogle-cloud-storage=2.25.0
  - libgpg-error=1.50
  - libgrpc=1.62.2
  - libhwloc=2.9.3
  - libiconv=1.17
  - libjpeg-turbo=2.1.4
  - liblapack=3.9.0
  - libmad=0.15.1b
  - libmagma=2.8.0
  - libmagma_sparse=2.8.0
  - libnghttp2=1.67.0
  - libnsl=2.0.1
  - libnvjitlink=12.6.68
  - libogg=1.3.5
  - libopenblas=0.3.27
  - libopus=1.3.1
  - libparquet=16.1.0
  - libpng=1.6.43
  - libprotobuf=4.25.3
  - libre2-11=2023.09.01
  - libsndfile=1.2.2
  - libsqlite=3.46.0
  - libssh2=1.11.1
  - libstdcxx=14.1.0
  - libstdcxx-ng=14.1.0
  - libsystemd0=256.5
  - libthrift=0.19.0
  - libtiff=4.5.0
  - libtorch=2.4.0
  - libutf8proc=2.8.0
  - libuuid=2.38.1
  - libuv=1.48.0
  - libvorbis=1.3.7
  - libwebp-base=1.4.0
  - libxcb=1.13
  - libxcrypt=4.4.36
  - libxml2=2.11.5
  - libzlib=1.3.1
  - llvm-openmp=18.1.8
  - lz4-c=1.9.4
  - markupsafe=2.1.5
  - matplotlib-base=3.9.2
  - mkl=2023.2.0
  - mpc=1.3.1
  - mpfr=4.2.1
  - mpg123=1.31.3
  - mpmath=1.3.0
  - multidict=6.7.0
  - multiprocess=0.70.16
  - munkres=1.1.4
  - nccl=2.22.3.1
  - ncurses=6.5
  - networkx=3.3
  - numpy=1.26.4
  - openjpeg=2.5.0
  - openssl=3.6.1
  - opusfile=0.12
  - orc=2.0.1
  - orocos-kdl=1.5.1
  - packaging=24.1
  - pandas=2.2.2
  - pcre2=10.44
  - pillow=9.4.0
  - pip=24.2
  - pixman=0.43.2
  - portaudio=19.6.0
  - portmidi=2.0.4
  - propcache=0.3.1
  - pthread-stubs=0.4
  - pulseaudio-client=16.1
  - pyarrow=16.1.0
  - pyarrow-core=16.1.0
  - pybind11=2.13.5
  - pybind11-global=2.13.5
  - pycparser=2.22
  - pygame=2.1.3
  - pyparsing=3.1.4
  - pysocks=1.7.1
  - python=3.10.14
  - python-dateutil=2.9.0
  - python-gil=3.10.19
  - python-orocos-kdl=1.5.1
  - python-tzdata=2024.1
  - python-xxhash=3.6.0
  - python_abi=3.10
  - pytorch=2.4.0
  - hydra-core=1.3.2
  - omegaconf=2.3.0
  - einops=0.8.2
  - diffusers=0.36.0
  - torchvision=0.19.0
  - pytz=2024.1
  - pyyaml=6.0.3
  - qhull=2020.2
  - re2=2023.09.01
  - readline=8.2
  - regex=2026.1.15
  - requests=2.32.5
  - s2n=1.4.16
  - safetensors=0.7.0
  - sdl2=2.26.5
  - sdl2_image=2.6.3
  - sdl2_mixer=2.6.3
  - sdl2_ttf=2.20.2
  - setuptools=72.2.0
  - shellingham=1.5.4
  - six=1.16.0
  - sleef=3.6.1
  - snappy=1.2.2
  - sniffio=1.3.1
  - stable-baselines3=2.3.2
  - sympy=1.13.2
  - tbb=2021.11.0
  - tk=8.6.13
  - tokenizers=0.22.2
  - tqdm=4.67.2
  - transformers=5.0.0
  - typer-slim=0.21.1
  - typing-extensions=4.12.2
  - typing_extensions=4.12.2
  - tzdata=2024a
  - unicodedata2=15.1.0
  - urllib3=2.5.0
  - wheel=0.44.0
  - xorg-kbproto=1.0.7
  - xorg-libice=1.1.1
  - xorg-libsm=1.2.4
  - xorg-libx11=1.8.4
  - xorg-libxau=1.0.11
  - xorg-libxdmcp=1.1.3
  - xorg-libxext=1.3.4
  - xorg-libxrender=0.9.10
  - xorg-renderproto=0.11.1
  - xorg-xextproto=7.3.0
  - xorg-xproto=7.0.31
  - xxhash=0.8.3
  - xz=5.2.6
  - yaml=0.2.5
  - yarl=1.22.0
  - zlib=1.3.1
  - zstandard=0.23.0
  - zstd=1.5.6
  - pip:
    - GitPython==3.1.46
    - Jinja2==3.1.6
    - MarkupSafe==3.0.3
    - PyOpenGL==3.1.7
    - PyYAML==6.0.3
    - Pygments==2.19.2
    - absl-py==2.1.0
    - accelerate==1.12.0
    - aiofiles==24.1.0
    - aiohappyeyeballs==2.6.1
    - aiohttp==3.13.3
    - aiosignal==1.4.0
    - annotated-doc==0.0.4
    - annotated-types==0.7.0
    - antlr4-python3-runtime==4.9.3
    - anyio==4.12.1
    - asciitree==0.3.3
    - asttokens==3.0.1
    - async-timeout==5.0.1
    - attrs==25.4.0
    - av==15.1.0
    - brotli==1.2.0
    - charset-normalizer==3.4.4
    - cmake==4.1.3
    - cmeel==0.58.0
    - cmeel-assimp==5.4.3.1
    - cmeel-boost==1.87.0.1
    - cmeel-console-bridge==1.0.2.3
    - cmeel-octomap==1.10.0
    - cmeel-qhull==8.0.2.1
    - cmeel-tinyxml==2.6.2.3
    - cmeel-tinyxml2==10.0.0
    - cmeel-urdfdom==3.1.1.1
    - cmeel-zlib==1.3.1
    - coal==3.0.2
    - coal-library==3.0.1
    - colorama==0.4.6
    - datasets==4.5.0
    - decorator==5.2.1
    - deepdiff==8.6.1
    - dill==0.4.0
    - docstring_parser==0.17.0
    - draccus==0.10.0
    - eigenpy==3.10.3
    - etils==1.7.0
    - evdev==1.9.2
    - exceptiongroup==1.3.1
    - executing==2.2.1
    - fastapi==0.128.0
    - fasteners==0.20
    - ffmpy==1.0.0
    - filelock==3.20.3
    - frozenlist==1.8.0
    - fsspec==2025.10.0
    - gitdb==4.0.12
    - glfw==2.7.0
    - gradio==6.3.0
    - gradio_client==2.0.3
    - groovy==0.1.2
    - gymnasium==1.2.3
    - h11==0.16.0
    - h5py==3.15.1
    - hf-xet==1.2.0
    - hf_transfer==0.1.9
    - httpcore==1.0.9
    - httpx==0.28.1
    - huggingface_hub==1.3.2
    - imageio==2.35.1
    - imageio-ffmpeg==0.6.0
    - importlib_metadata==8.7.1
    - importlib_resources==6.5.2
    - inquirerpy==0.3.4
    - ipython==8.38.0
    - jedi==0.19.2
    - jsonargparse==4.45.0
    - jsonlines==4.0.0
    - kiwisolver==1.4.5
    - lerobot==0.4.2
    - libcoal==3.0.2
    - libpinocchio==3.8.0
    - lightning==2.5.0.post0
    - lightning-utilities==0.15.2
    - lxml==5.3.0
    - markdown-it-py==4.0.0
    - matplotlib-inline==0.2.1
    - mdurl==0.1.2
    - mergedeep==1.3.4
    - mpmath==1.3.0
    - mujoco==3.2.2
    - mujoco-python-viewer==0.1.4
    - multidict==6.7.0
    - multiprocess==0.70.18
    - mypy_extensions==1.1.0
    - networkx==3.4.2
    - numcodecs==0.13.1
    - numpy==2.2.6
    - opencv-contrib-python==4.10.0.84
    - opencv-python==4.13.0.90
    - orderly-set==5.5.0
    - orjson==3.11.5
    - packaging==24.2
    - pandas==2.3.3
    - parso==0.8.5
    - pexpect==4.9.0
    - pfzy==0.3.4
    - pillow==12.1.0
    - pin==3.3.1
    - platformdirs==4.5.1
    - prompt_toolkit==3.0.52
    - propcache==0.4.1
    - protobuf==6.33.4
    - proxsuite==0.7.2
    - psutil==7.2.1
    - ptyprocess==0.7.0
    - pure_eval==0.2.3
    - pyarrow==22.0.0
    - pydantic==2.12.5
    - pydantic_core==2.41.5
    - pydub==0.25.1
    - pynput==1.8.1
    - pyquaternion==0.9.9
    - pyserial==3.5
    - python-dateutil==2.9.0.post0
    - python-multipart==0.0.21
    - python-xlib==0.33
    - pytorch-lightning==2.6.0
    - pyyaml-include==1.4.1
    - qwen-vl-utils==0.0.14
    - regex==2026.1.15
    - requests==2.32.5
    - rerun-sdk==0.26.2
    - rich==13.9.4
    - ruckig==0.9.2
    - safehttpx==0.1.7
    - safetensors==0.7.0
    - scipy==1.14.1
    - semantic-version==2.10.0
    - sentry-sdk==2.49.0
    - shellingham==1.5.4
    - smmap==5.0.2
    - stack-data==0.6.3
    - starlette==0.50.0
    - sympy==1.13.1
    - swanlab==0.7.13
    - termcolor==3.3.0
    - timm==1.0.24
    - toml==0.10.2
    - tomli==2.4.0
    - tomlkit==0.13.3
    - torchcodec==0.5
    - torchmetrics==1.8.2
    - tqdm==4.67.1
    - traitlets==5.14.3
    - typer==0.21.1
    - typer-slim==0.21.1
    - typeshed_client==2.8.2
    - typing-inspect==0.9.0
    - typing-inspection==0.4.2
    - typing_extensions==4.15.0
    - tzdata==2025.3
    - urdf_parser_py==0.0.4
    - urllib3==2.6.3
    - uv==0.9.28
    - uvicorn==0.40.0
    - wandb==0.24.0
    - wcwidth==0.2.14
    - xxhash==3.6.0
    - yarl==1.22.0
    - zarr==2.18.3
    - zipp==3.20.1
--- a/experiment_suites/2026-04-04-imf-horizon-grid/final_report.md
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/final_report.md
@@ -1,63 +0,0 @@
 # Phase-1 Final Report and Phase-2 Handoff
 - Finalized: 2026-04-05 00:34:20 CST
 - Scope: IMF AttnRes policy horizon/action-step grid on `sim_transfer`
 - Fixed setup: `n_emb=384`, `n_layer=12`, batch size `80`, learning rate `2.5e-4`, `max_steps=50k`, rollout every 5 epochs with 5 episodes, 3 cameras `[r_vis, top, front]`.
 - Main metric: `checkpoints/vla_model_best.pt` 中记录的训练期最大 `rollout_avg_reward`。
 ## Final leaderboard
 | Rank | Run ID | pred_horizon | executed action steps | Best avg_reward | Best step | Final loss |
 |---:|---|---:|---:|---:|---:|---:|
 | 1 | `ph16_ex8` | 16 | 8 | **610.8** | 21874 | 0.0034 |
 | 2 | `ph16_ex16` | 16 | 16 | 561.2 | 48124 | 0.0045 |
 | 3 | `ph32_ex32` | 32 | 32 | 513.2 | 43749 | 0.0040 |
 | 4 | `ph8_ex8` | 8 | 8 | 415.6 | 48124 | 0.0070 |
 | 5 | `ph32_ex8` | 32 | 8 | 361.6 | 43749 | 0.0048 |
 | 6 | `ph32_ex16` | 32 | 16 | 239.6 | 48124 | 0.0038 |
 ## Final conclusions
 1. **最佳组合是 `pred_horizon=16` + `num_action_steps=8`**，最佳平均奖励为 **610.8**，出现在 **step 21874**。
 2. 在 `pred_horizon=16` 下，执行 8 步优于执行 16 步，优势约 **+8.8%**（610.8 vs 561.2）。
 3. `pred_horizon=32` 时，对执行步长非常敏感：`32/32` 明显优于 `32/8` 和 `32/16`；特别是 `32/16` 退化最明显。
 4. 更长的预测窗口并不会自动带来更高 reward；**预测窗口与实际执行窗口的匹配关系** 是关键。
 5. 最佳 checkpoint 并不在训练结束时出现，而是在 50k 训练中较早的 **21.9k step** 出现，说明 rollout 验证比仅看 train loss 更重要。
 6. 因而 Phase-2 的比较基线固定为 **`ph16_ex8`**。
 ## Recommended baseline for follow-up experiments
 - Baseline run: `ph16_ex8`
 - Baseline best checkpoint: `step 21874`
 - Baseline best avg_reward: `610.8`
 - Baseline run dir: `/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223`
 ## Phase-2 target: full-AttnRes vision backbone
 本阶段按你的要求，不再只是 IMF head 中使用 AttnRes，而是把**之前视觉 ResNet 主干中的残差单元全部替换为 AttnRes 残差单元**。当前实现保留了 ResNet 风格的 stage / downsample 宏观结构，但视觉残差 trunk 已切换到 AttnRes：
 - implementation: `roboimi/vla/models/backbones/attnres_resnet2d.py`
 - wiring: `roboimi/vla/models/backbones/resnet_diffusion.py`
 - config: `roboimi/vla/conf/backbone/resnet_diffusion.yaml`
 相关代码已提交：
 - `a780068` — headless rollout 修复 + Phase-1 汇总
 - `2033169` — full-AttnRes vision backbone
 ## Phase-2 launch status (observed on 2026-04-05 00:36 CST)
 - Run: `imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424`
 - Host: `100.119.99.14`, GPU `3`
 - Config anchor: `pred_horizon=16`, `num_action_steps=8`
 - Vision backbone: `attnres_resnet`
 - Because batch size `80` OOMed on both local 5090 and remote L20, Phase-2 currently uses:
  - batch size: `40`
  - learning rate: `1.25e-4`
 - Latest confirmed progress: **step 1300**
 - First rollout has **not happened yet** at this observation point.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/xy7fjdmn0stdr19eu3gub
 ## Next action
 继续监控 Phase-2 full-AttnRes 训练，待其完成后直接与 Phase-1 baseline `610.8` 做对比，判断“视觉主干全部替换为 AttnRes”是否优于“仅 IMF 中使用 AttnRes”。
--- a/experiment_suites/2026-04-04-imf-horizon-grid/leaderboard.csv
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/leaderboard.csv
@@ -1,7 +0,0 @@
 rank,run_id,status,pred_horizon,num_action_steps,best_rollout_avg_reward,best_step,final_step,final_loss,host,run_dir,latest_step
 1,ph16_ex8,running,16,8,610.8,21874,50000,0.0034315965604037046,100.73.14.65,/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223,50000
 2,ph16_ex16,running,16,16,561.2,48124,50000,0.004544622730463743,100.119.99.14,/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223,50000
 3,ph32_ex32,finished,32,32,513.2,43749,50000,0.003953303210437298,local,/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223,49900
 4,ph8_ex8,running,8,8,415.6,48124,50000,0.007008877582848072,100.73.14.65,/home/droid/roboimi_suite_20260404/runs/imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223,50000
 5,ph32_ex8,running,32,8,361.6,43749,50000,0.004788532387465239,100.119.99.14,/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223,50000
 6,ph32_ex16,running,32,16,239.6,48124,50000,0.0038348555099219084,100.119.99.14,/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223,50000
--- a/experiment_suites/2026-04-04-imf-horizon-grid/manifest.json
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/manifest.json
@@ -1,115 +0,0 @@
 {
  "suite_name": "2026-04-04-imf-horizon-grid",
  "created_at": "2026-04-04 13:19:52",
  "updated_at": "2026-04-04 13:19:52",
  "phase": "phase1_launching",
  "metric": "max_avg_reward",
  "baseline": {
    "agent": "resnet_imf_attnres",
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "val_split": 0.0,
    "seed": 42,
    "scheduler_type": "cosine",
    "warmup_steps": 2000,
    "min_lr": 1e-06,
    "weight_decay": 1e-05,
    "grad_clip": 1.0,
    "inference_steps": 1,
    "embed_dim": 384,
    "n_layer": 12,
    "n_head": 1,
    "n_kv_head": 1,
    "freeze_backbone": false,
    "pretrained_backbone_weights": null,
    "camera_names": [
      "r_vis",
      "top",
      "front"
    ]
  },
  "runs": [
    {
      "id": "ph8_ex8",
      "pred_horizon": 8,
      "num_action_steps": 8,
      "host": "100.73.14.65",
      "host_label": "tailnet-5880",
      "gpu": 0,
      "workdir": "/home/droid/roboimi_suite_20260404",
      "python": "/home/droid/miniforge3/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "run_name": "imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223",
      "launch_state": "ready"
    },
    {
      "id": "ph16_ex8",
      "pred_horizon": 16,
      "num_action_steps": 8,
      "host": "100.73.14.65",
      "host_label": "tailnet-5880",
      "gpu": 1,
      "workdir": "/home/droid/roboimi_suite_20260404",
      "python": "/home/droid/miniforge3/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "run_name": "imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223",
      "launch_state": "ready"
    },
    {
      "id": "ph16_ex16",
      "pred_horizon": 16,
      "num_action_steps": 16,
      "host": "100.119.99.14",
      "host_label": "tailnet-l20",
      "gpu": 0,
      "workdir": "/home/droid/roboimi_suite_20260404",
      "python": "/home/droid/miniforge3/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "run_name": "imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223",
      "launch_state": "provisioning_required"
    },
    {
      "id": "ph32_ex8",
      "pred_horizon": 32,
      "num_action_steps": 8,
      "host": "100.119.99.14",
      "host_label": "tailnet-l20",
      "gpu": 1,
      "workdir": "/home/droid/roboimi_suite_20260404",
      "python": "/home/droid/miniforge3/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "run_name": "imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223",
      "launch_state": "provisioning_required"
    },
    {
      "id": "ph32_ex16",
      "pred_horizon": 32,
      "num_action_steps": 16,
      "host": "100.119.99.14",
      "host_label": "tailnet-l20",
      "gpu": 2,
      "workdir": "/home/droid/roboimi_suite_20260404",
      "python": "/home/droid/miniforge3/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "run_name": "imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223",
      "launch_state": "provisioning_required"
    },
    {
      "id": "ph32_ex32",
      "pred_horizon": 32,
      "num_action_steps": 32,
      "host": "local",
      "host_label": "local-5090",
      "gpu": 0,
      "workdir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy",
      "python": "/home/droid/.conda/envs/roboimi/bin/python",
      "dataset_dir": "/home/droid/project/diana_sim/sim_transfer",
      "run_name": "imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223",
      "launch_state": "ready"
    }
  ]
 }
--- a/experiment_suites/2026-04-04-imf-horizon-grid/notes.md
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/notes.md
@@ -1,20 +0,0 @@
 # IMF Horizon Grid Suite Notes
 - Created: 2026-04-04 13:19:52
 - Phase-1 matrix: (8,8), (16,8), (16,16), (32,8), (32,16), (32,32)
 - Fixed baseline: IMF AttnRes, n_emb=384, n_layer=12, batch_size=80, lr=2.5e-4, max_steps=50k, rollout every 5 epochs with 5 episodes.
 - Host allocation:
  - local RTX 5090: ph32_ex32
  - 100.73.14.65 RTX 5880 GPU0: ph8_ex8
  - 100.73.14.65 RTX 5880 GPU1: ph16_ex8
  - 100.119.99.14 L20 GPU0: ph16_ex16
  - 100.119.99.14 L20 GPU1: ph32_ex8
  - 100.119.99.14 L20 GPU2: ph32_ex16
 - 100.119.99.14 still needs env + dataset + swanlab credential copy before launch.
 - 2026-04-04 13:23:43: launched local ph32_ex32 (pid 1437836), remote 100.73 ph8_ex8 (pid 931824), ph16_ex8 (pid 931826); started 100.119 bootstrap (local pid 1437837).
 - 2026-04-04 13:25:43: first status sync — local ph32_ex32 step≈500; remote ph8_ex8 step≈400; remote ph16_ex8 step≈400.
 - 2026-04-04 13:27:41: second status sync — 100.119 bootstrap finished env copy and entered dataset copy; local ph32_ex32 step≈900; remote ph8_ex8 step≈800; remote ph16_ex8 step≈800.
 - 2026-04-04 13:35:31: 100.119 bootstrap data/env copy finished. Original validation command hit a quoting bug, then I manually revalidated torch+mujoco+swanlab and launched ph16_ex16/ph32_ex8/ph32_ex16 with pids 81129/81130/81131.
 - 2026-04-04 13:37:36: all 6 Phase-1 runs are now up. SwanLab links recorded in status.json; latest observed steps ~ local 900 / 5880 runs 800 / L20 runs 100.
 - 2026-04-04 14:41:08: diagnosed remote first-rollout crash as early mujoco import before MUJOCO_GL=egl in eval_vla.py via raw_action_trajectory_viewer. Added regression test tests/test_eval_vla_headless_import.py, fixed import to lazy-load, verified 20-step headless eval on 5880 and L20, then resumed 5 failed runs from step 4374. Current resumed pids: ph8_ex8=938714, ph16_ex8=938717, ph16_ex16=90169, ph32_ex8=90173, ph32_ex16=90175.
--- a/experiment_suites/2026-04-04-imf-horizon-grid/phase1_summary.md
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/phase1_summary.md
@@ -1,38 +0,0 @@
 # Phase-1 IMF Horizon Grid Summary
 - Generated: 2026-04-04 23:43:38
 - Fixed baseline: IMF AttnRes head, n_emb=384, n_layer=12, batch_size=80, lr=2.5e-4, max_steps=50k, rollout every 5 epochs with 5 episodes, 3 cameras `[r_vis, top, front]`.
 - Primary metric: `checkpoints/vla_model_best.pt -> rollout_avg_reward` (max training-time rollout average reward).
 ## Ranked results
 | Rank | Run ID | pred_horizon | num_action_steps | Best avg_reward | Best step | Final loss | Host |
 |---:|---|---:|---:|---:|---:|---:|---|
 | 1 | `ph16_ex8` | 16 | 8 | 610.8 | 21874 | 0.0034 | 100.73.14.65 |
 | 2 | `ph16_ex16` | 16 | 16 | 561.2 | 48124 | 0.0045 | 100.119.99.14 |
 | 3 | `ph32_ex32` | 32 | 32 | 513.2 | 43749 | 0.0040 | local |
 | 4 | `ph8_ex8` | 8 | 8 | 415.6 | 48124 | 0.0070 | 100.73.14.65 |
 | 5 | `ph32_ex8` | 32 | 8 | 361.6 | 43749 | 0.0048 | 100.119.99.14 |
 | 6 | `ph32_ex16` | 32 | 16 | 239.6 | 48124 | 0.0038 | 100.119.99.14 |
 ## Main observations
 - Best overall setting was **`pred_horizon=16`, `num_action_steps=8`** with **max avg_reward = 610.8** at step **21874**.
 - Comparing horizon 16: executing 8 steps outperformed executing 16 steps (`ph16_ex8` > `ph16_ex16`).
 - Comparing horizon 32: executing the full 32-step chunk was much better than executing 16 or 8 steps (`ph32_ex32` > `ph32_ex8` > `ph32_ex16`).
 - Short horizon 8 with 8-step execution was competitive but clearly below the best 16/8 and 32/32 settings.
 - In this sweep, increasing prediction horizon helped only when the executed chunk length matched a good control cadence; mismatch could hurt a lot (especially `ph32_ex16`).
 ## Raw results
 - `ph16_ex8`: best avg_reward=610.8 @ step 21874, final_loss=0.0034, run_dir=`/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223`
 - `ph16_ex16`: best avg_reward=561.2 @ step 48124, final_loss=0.0045, run_dir=`/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223`
 - `ph32_ex32`: best avg_reward=513.2 @ step 43749, final_loss=0.0040, run_dir=`/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223`
 - `ph8_ex8`: best avg_reward=415.6 @ step 48124, final_loss=0.0070, run_dir=`/home/droid/roboimi_suite_20260404/runs/imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223`
 - `ph32_ex8`: best avg_reward=361.6 @ step 43749, final_loss=0.0048, run_dir=`/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223`
 - `ph32_ex16`: best avg_reward=239.6 @ step 48124, final_loss=0.0038, run_dir=`/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223`
 ## Recommendation for Phase-2 anchor
 - Use **`pred_horizon=16`, `num_action_steps=8`** as the strongest Phase-1 baseline if the goal is purely maximizing rollout reward.
 - If phase-2 needs a more conservative action execution budget, `ph16_ex8` is the strongest non-full-32 execution setting and may still be a good comparison anchor.
--- a/experiment_suites/2026-04-04-imf-horizon-grid/status.json
+++ b/experiment_suites/2026-04-04-imf-horizon-grid/status.json
@@ -1,167 +0,0 @@
 {
  "suite_name": "2026-04-04-imf-horizon-grid",
  "updated_at": "2026-04-05 00:34:20",
  "phase": "phase1_completed",
  "provisioning": {
    "100.119.99.14": {
      "state": "completed_manual_launch",
      "bootstrap_pid_local": 1437837,
      "log_path": "experiment_suites/2026-04-04-imf-horizon-grid/provision_logs/100.119.99.14-bootstrap-20260404-131223.log",
      "env_copy": "completed",
      "dataset_copy": "completed",
      "launch_watcher_pid_local": null,
      "launch_watcher_log": "experiment_suites/2026-04-04-imf-horizon-grid/launch_logs/100.119.99.14-launch-watcher-20260404-131223.log",
      "swanlab_copy": "completed",
      "bootstrap_validation_note": "initial validation command had a quoting bug; manual validation passed and launches were started successfully"
    }
  },
  "runs": {
    "ph8_ex8": {
      "status": "finished",
      "host": "100.73.14.65",
      "gpu": 0,
      "run_name": "imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223",
      "workdir": "/home/droid/roboimi_suite_20260404",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223",
      "pred_horizon": 8,
      "num_action_steps": 8,
      "pid": 938714,
      "launch_log": "experiment_suite_launch_logs/imf-p1-ph08-ex08-emb384-l12-ms50k-5880g0-20260404-131223.restartfix-20260404-143827.log",
      "latest_step": 50000,
      "latest_log_sync": "2026-04-05 00:34:20",
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/i5syc57b6zq7rbkrtqy7b",
      "process_running": false,
      "best_step": 48124,
      "best_rollout_avg_reward": 415.6,
      "final_loss": 0.007008877582848072
    },
    "ph16_ex8": {
      "status": "finished",
      "host": "100.73.14.65",
      "gpu": 1,
      "run_name": "imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223",
      "workdir": "/home/droid/roboimi_suite_20260404",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223",
      "pred_horizon": 16,
      "num_action_steps": 8,
      "pid": 938717,
      "launch_log": "experiment_suite_launch_logs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223.restartfix-20260404-143827.log",
      "latest_step": 50000,
      "latest_log_sync": "2026-04-05 00:34:20",
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/4rusbrpfxmw4ffii1ul5w",
      "process_running": false,
      "best_step": 21874,
      "best_rollout_avg_reward": 610.8,
      "final_loss": 0.0034315965604037046
    },
    "ph16_ex16": {
      "status": "finished",
      "host": "100.119.99.14",
      "gpu": 0,
      "run_name": "imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223",
      "workdir": "/home/droid/roboimi_suite_20260404",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223",
      "pred_horizon": 16,
      "num_action_steps": 16,
      "pid": 90169,
      "launch_log": "experiment_suite_launch_logs/imf-p1-ph16-ex16-emb384-l12-ms50k-l20g0-20260404-131223.restartfix-20260404-143827.log",
      "latest_log_sync": "2026-04-05 00:34:20",
      "latest_step": 50000,
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/wwm232k6190gexnze8mg6",
      "process_running": false,
      "best_step": 48124,
      "best_rollout_avg_reward": 561.2,
      "final_loss": 0.004544622730463743
    },
    "ph32_ex8": {
      "status": "finished",
      "host": "100.119.99.14",
      "gpu": 1,
      "run_name": "imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223",
      "workdir": "/home/droid/roboimi_suite_20260404",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223",
      "pred_horizon": 32,
      "num_action_steps": 8,
      "pid": 90173,
      "launch_log": "experiment_suite_launch_logs/imf-p1-ph32-ex08-emb384-l12-ms50k-l20g1-20260404-131223.restartfix-20260404-143827.log",
      "latest_log_sync": "2026-04-05 00:34:20",
      "latest_step": 50000,
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/o5y2xjb2rsb3lmfcuhy4p",
      "process_running": false,
      "best_step": 43749,
      "best_rollout_avg_reward": 361.6,
      "final_loss": 0.004788532387465239
    },
    "ph32_ex16": {
      "status": "finished",
      "host": "100.119.99.14",
      "gpu": 2,
      "run_name": "imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223",
      "workdir": "/home/droid/roboimi_suite_20260404",
      "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
      "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223",
      "pred_horizon": 32,
      "num_action_steps": 16,
      "pid": 90175,
      "launch_log": "experiment_suite_launch_logs/imf-p1-ph32-ex16-emb384-l12-ms50k-l20g2-20260404-131223.restartfix-20260404-143827.log",
      "latest_log_sync": "2026-04-05 00:34:20",
      "latest_step": 50000,
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/54cjpgba9eqsopdm0l8d3",
      "process_running": false,
      "best_step": 48124,
      "best_rollout_avg_reward": 239.6,
      "final_loss": 0.0038348555099219084
    },
    "ph32_ex32": {
      "status": "finished",
      "host": "local",
      "gpu": 0,
      "run_name": "imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223",
      "workdir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy",
      "dataset_dir": "/home/droid/project/diana_sim/sim_transfer",
      "log_path": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223/train_vla.log",
      "run_dir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223",
      "pred_horizon": 32,
      "num_action_steps": 32,
      "pid": 1437836,
      "launch_log": "experiment_suites/2026-04-04-imf-horizon-grid/launch_logs/imf-p1-ph32-ex32-emb384-l12-ms50k-5090-20260404-131223.launch.log",
      "latest_step": 49900,
      "latest_log_sync": "2026-04-05 00:34:20",
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/ajs2m218jd260hawhy5ns",
      "process_running": false,
      "latest_rollout_avg_reward": 513.2,
      "best_rollout_avg_reward": 513.2,
      "best_step": 43749,
      "final_loss": 0.003953303210437298
    }
  },
  "monitor": {
    "state": "stopped",
    "pid_local": null,
    "log_path": "experiment_suites/2026-04-04-imf-horizon-grid/monitor_logs/status-sync-20260404-131223.log",
    "interval_seconds": 300,
    "stopped_at": "2026-04-05 00:34:20",
    "stop_reason": "phase1 suite finalized after all six runs completed"
  },
  "debug": {
    "remote_rollout_failure_20260404": {
      "root_cause": "eval_vla.py imported raw_action_trajectory_viewer at module import time, which imported mujoco before MUJOCO_GL=egl was set; remote headless rollout then fell back to GLFW/X11 and crashed with mujoco.FatalError: gladLoadGL error during env.reset()->mj.Renderer(...)",
      "fixed_file": "roboimi/demos/vla_scripts/eval_vla.py",
      "verification": {
        "pytest": "tests/test_eval_vla_headless_import.py passed",
        "remote_eval_5880": "1 episode x 20 steps headless eval passed",
        "remote_eval_l20": "1 episode x 20 steps headless eval passed"
      }
    }
  },
  "phase1_summary_md": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/experiment_suites/2026-04-04-imf-horizon-grid/phase1_summary.md"
 }
--- a/experiment_suites/2026-04-05-camera-ablation-summary.md
+++ b/experiment_suites/2026-04-05-camera-ablation-summary.md
@@ -1,69 +0,0 @@
 # Camera Ablation Summary (`pred_horizon=16`, `num_action_steps=8`, ResNet IMF)
 - Generated: 2026-04-05
 - Common setup: original ResNet vision backbone, `n_emb=384`, `n_layer=12`, `batch_size=80`, `lr=2.5e-4`, `max_steps=50k`, rollout every 5 epochs with 5 episodes, headless eval.
 - Metric for comparison: `checkpoints/vla_model_best.pt -> rollout_avg_reward`.
 ## Leaderboard
 | Rank | Cameras | Best avg_reward | Best step | Final loss | Run name |
 |---:|---|---:|---:|---:|---|
 | 1 | `top + front` | **274.8** | 48124 | 0.0056 | `imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023` |
 | 2 | `top` | **271.2** | 43749 | 0.0052 | `imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844` |
 | 3 | `r_vis + front` | **244.0** | 21874 | 0.0043 | `imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029` |
 | 4 | `r_vis` | **6.4** | 17499 | 0.0047 | `imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844` |
 | 5 | `r_vis + top` | **1.2** | 4374 | 0.0047 | `imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844` |
 | 6 | `front` | **0.0** | 4374 | 0.0074 | `imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607` |
 ## Main takeaways
 1. **`top` 是最关键的单相机视角**：`top only = 271.2`，几乎与 `top + front = 274.8` 持平。
 2. **`front` 单独几乎没有效用**：`front only = 0.0`。
 3. **`r_vis` 单独也基本无效**：`r_vis only = 6.4`。
 4. **`r_vis + front` 可以显著优于单独 `front` / `r_vis`**，说明这两个视角有一定互补性，但仍明显弱于任何包含 `top` 且表现正常的配置。
 5. **`r_vis + top` 的结果异常差**：只有 `1.2`，远低于 `top only = 271.2`。这说明简单加入 `r_vis` 并不保证增益，甚至可能破坏当前设置下的学习。
 6. **训练 loss 与 rollout reward 明显不一致**：例如 `r_vis + top` 和 `r_vis only` 的 final loss 都不高，但 reward 很差，因此本组实验必须以 rollout reward 而不是 loss 选型。
 ## Horizontal comparison views
 ### Single-camera comparison
 - `top`: **271.2**
 - `r_vis`: **6.4**
 - `front`: **0.0**
 结论：**`top >>> r_vis > front`**。
 ### Two-camera comparison
 - `top + front`: **274.8**
 - `r_vis + front`: **244.0**
 - `r_vis + top`: **1.2**
 结论：
 - **最稳妥的双相机组合是 `top + front`**。
 - `r_vis + front` 有效，但不如 `top + front`。
 - `r_vis + top` 在当前设置下几乎失效。
 ### Incremental effect of adding a second view
 - 在 `top` 基础上加 `front`：`271.2 -> 274.8`，**增益很小**。
 - 在 `front` 基础上加 `r_vis`：`0.0 -> 244.0`，**增益很大**。
 - 在 `top` 基础上加 `r_vis`：`271.2 -> 1.2`，**显著退化**。
 ## Practical recommendation
 如果只从这 6 个实验里选：
 - **首选**：`top + front`
 - **次选**：`top only`
 - 如果必须不用 `top`：`r_vis + front` 明显优于 `front only` / `r_vis only`
 - **不建议**：`r_vis + top`
 ## Note relative to previous 3-camera baseline
 此前 3 相机 `[r_vis, top, front]` 的最佳 reward 为 **610.8**。
 因此这次 6 个 camera ablation 的最佳结果（`top + front = 274.8`）说明：
 - 当前这个训练批次里，**去掉任意一个视角都会显著低于之前的 3 相机最优结果**；
 - 但在去掉视角的约束下，**`top` 仍然是最核心的保留对象**。
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/CHECKLIST.md
@@ -1,8 +0,0 @@
 # CHECKLIST
 - [x] Confirm remote free GPU
 - [x] Create front-only run contract
 - [x] Remote smoke test passes
 - [x] Launch 50k run on remote GPU0
 - [x] Record pid / log / SwanLab
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/PLAN.md
@@ -1,28 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone, using only the `front` camera as image conditioning.
 ## Fixed comparison contract
 - Same as the active `top/front` run except image input is reduced to `[front]`
 - Agent: `resnet_imf_attnres`
 - Vision backbone mode: `resnet`
 - `pred_horizon=16`, `num_action_steps=8`
 - `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
 - `inference_steps=1`
 - `batch_size=80`, `lr=2.5e-4`, cosine, warmup=2000
 - dataset: `/home/droid/sim_dataset/sim_transfer`
 - cameras: `[front]` only
 - rollout every 5 epochs with 5 episodes, headless
 ## Resource plan
 - Host: `100.119.99.14`
 - GPU: `0`
 ## Important dimension override
 - Single-camera visual cond dim = `64 + 16 = 80`, so override `agent.head.cond_dim=80` and `agent.num_cams=1`.
 ## Execution path
 1. 2-step smoke test on remote GPU0.
 2. If smoke passes, launch 50k main run with SwanLab.
 3. Record pid / run_dir / log / URL locally.
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/notes.md
@@ -1,6 +0,0 @@
 # Notes
 - 2026-04-05 09:55:27: remote 2-step smoke passed on `100.119.99.14` GPU0 with `front` only, batch=80, no OOM.
 - 2026-04-05 09:56:26: launched main run `imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607`.
 - 2026-04-05 09:57:36: confirmed training is stable through step 200, latest loss 0.2830.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/7kdii8oc6tjkcyu5y0lwq
--- a/experiment_suites/2026-04-05-front-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-front-only-resnet-1cam/status.json
@@ -1,51 +0,0 @@
 {
  "suite_name": "2026-04-05-front-only-resnet-1cam",
  "updated_at": "2026-04-05 09:57:36",
  "phase": "running",
  "baseline_reference": {
    "source_run": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
    "notes": "Same hyperparameters as the active top/front run, but image input is reduced to [front] only."
  },
  "smoke_test": {
    "status": "passed",
    "host": "100.119.99.14",
    "gpu": 0,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-frontonly-resnet-ph16-ex08-20260405-095509",
    "batch_size": 80,
    "max_steps": 2,
    "note": "2-step remote CUDA smoke passed on L20 GPU0 without OOM."
  },
  "main_run": {
    "status": "running",
    "host": "100.119.99.14",
    "gpu": 0,
    "launch_pid": 158874,
    "pid": 158877,
    "run_name": "imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607/train_vla.log",
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607.launch.log",
    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
    "camera_names": [
      "front"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_cond_dim": 80,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/7kdii8oc6tjkcyu5y0lwq",
    "latest_step": 200,
    "latest_loss": 0.283,
    "process_running": true
  }
 }
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/CHECKLIST.md
@@ -1,8 +0,0 @@
 # CHECKLIST
 - [x] Confirm camera mapping (`right` -> `r_vis`)
 - [x] Create front+r_vis run contract
 - [x] Remote smoke test passes
 - [x] Launch 50k run on remote GPU1
 - [x] Record pid / log / SwanLab
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/PLAN.md
@@ -1,23 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone, using `front` + `r_vis` cameras only.
 ## Fixed comparison contract
 - Same hyperparameters as the active top/front and front-only runs
 - Agent: `resnet_imf_attnres`
 - Vision backbone mode: `resnet`
 - `pred_horizon=16`, `num_action_steps=8`
 - `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
 - `inference_steps=1`
 - `batch_size=80`, `lr=2.5e-4`, cosine warmup 2000
 - dataset: `/home/droid/sim_dataset/sim_transfer`
 - cameras: `[r_vis, front]`
 - rollout every 5 epochs with 5 episodes, headless
 ## Important dimension override
 - Two-camera visual cond dim = `64*2 + 16 = 144`, so set `agent.num_cams=2`, `agent.head.cond_dim=144`.
 ## Resource plan
 - Host: `100.119.99.14`
 - GPU: `1`
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/notes.md
@@ -1,6 +0,0 @@
 # Notes
 - 2026-04-05 10:20:09: remote 2-step smoke passed on `100.119.99.14` GPU1 with `r_vis + front`, batch=80, no OOM.
 - 2026-04-05 10:20:49: launched main run `imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029`.
 - 2026-04-05 10:22:03: confirmed training is stable through step 200, latest loss 0.3321.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/3fyzjfdcbiq7frtbqv6ss
--- a/experiment_suites/2026-04-05-front-rvis-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-front-rvis-resnet-2cam/status.json
@@ -1,55 +0,0 @@
 {
  "suite_name": "2026-04-05-front-rvis-resnet-2cam",
  "updated_at": "2026-04-05 10:22:03",
  "phase": "running",
  "interpretation": {
    "right_camera_name": "r_vis"
  },
  "baseline_reference": {
    "source_run": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
    "notes": "Same hyperparameters as the active top/front run, replacing top with r_vis."
  },
  "smoke_test": {
    "status": "passed",
    "host": "100.119.99.14",
    "gpu": 1,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-frontrvis-resnet-ph16-ex08-20260405-102001",
    "batch_size": 80,
    "max_steps": 2,
    "note": "2-step remote CUDA smoke passed on L20 GPU1 without OOM."
  },
  "main_run": {
    "status": "running",
    "host": "100.119.99.14",
    "gpu": 1,
    "launch_pid": 159910,
    "pid": 159913,
    "run_name": "imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029/train_vla.log",
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029.launch.log",
    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
    "camera_names": [
      "r_vis",
      "front"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_cond_dim": 144,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/3fyzjfdcbiq7frtbqv6ss",
    "latest_step": 200,
    "latest_loss": 0.3321,
    "process_running": true
  }
 }
--- a/experiment_suites/2026-04-05-full-attnres-vision-phase2/manifest.json
+++ b/experiment_suites/2026-04-05-full-attnres-vision-phase2/manifest.json
@@ -1,20 +0,0 @@
 {
  "suite_name": "2026-04-05-full-attnres-vision-phase2",
  "created_at": "2026-04-05 00:12:14",
  "phase": "phase2_running",
  "baseline_reference": {
    "run_id": "ph16_ex8",
    "best_rollout_avg_reward": 610.8,
    "best_step": 21874,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223"
  },
  "candidate": {
    "run_name": "imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-ms50k-20260405-001214",
    "host": "local",
    "gpu": 0,
    "pred_horizon": 16,
    "num_action_steps": 8,
    "vision_backbone_mode": "attnres_resnet",
    "notes": "Full-AttnRes vision backbone replacing ResNet residual units; IMF head unchanged."
  }
 }
--- a/experiment_suites/2026-04-05-full-attnres-vision-phase2/notes.md
+++ b/experiment_suites/2026-04-05-full-attnres-vision-phase2/notes.md
@@ -1,9 +0,0 @@
 # Full-AttnRes Vision Phase-2
 - Created: 2026-04-05 00:12:14
 - Baseline reference: ph16_ex8 best avg_reward=610.8
 - Candidate run: imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-ms50k-20260405-001214
 - 2026-04-05 00:23:03: batch=80 OOM on both 5090 and L20; using validated fallback batch=40, lr=1.25e-4 on remote L20 GPU3.
 - 2026-04-05 00:24:24: launching candidate imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424 on 100.119.99.14 GPU3 with batch=40 lr=1.25e-4.
 - 2026-04-05 00:27:17: remote phase2 run is active on 100.119.99.14 GPU3, validated at least to step 200. SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/xy7fjdmn0stdr19eu3gub
 - 2026-04-05 00:36:54: latest confirmed progress is step 1300 on 100.119.99.14 GPU3; first rollout not reached yet.
--- a/experiment_suites/2026-04-05-full-attnres-vision-phase2/status.json
+++ b/experiment_suites/2026-04-05-full-attnres-vision-phase2/status.json
@@ -1,32 +0,0 @@
 {
  "suite_name": "2026-04-05-full-attnres-vision-phase2",
  "updated_at": "2026-04-05 00:36:54",
  "phase": "phase2_running",
  "baseline_reference": {
    "run_id": "ph16_ex8",
    "best_rollout_avg_reward": 610.8,
    "best_step": 21874,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223"
  },
  "candidate": {
    "run_name": "imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424",
    "host": "100.119.99.14",
    "gpu": 3,
    "pred_horizon": 16,
    "num_action_steps": 8,
    "vision_backbone_mode": "attnres_resnet",
    "notes": "Full-AttnRes vision backbone replacing ResNet residual units; IMF head unchanged.",
    "status": "running",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424/train_vla.log",
    "pid": 151187,
    "batch_size": 40,
    "lr": 0.000125,
    "num_workers": 12,
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-p2-full-attnres-vision-ph16-ex08-emb384-l12-b40-lr1p25e4-ms50k-l20g3-20260405-002424.launch.log",
    "note": "Local 5090 and remote L20 both OOM at batch=80; switched to batch=40 and linearly scaled lr to 1.25e-4 after smoke validation on L20.",
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/xy7fjdmn0stdr19eu3gub",
    "latest_step": 1300,
    "latest_log_sync": "2026-04-05 00:36:54"
  }
 }
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json
@@ -1,73 +0,0 @@
 {
  "date": "2026-04-06",
  "branch": "feat-imf-attnres-policy",
  "worktree": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy",
  "model": "LEWM ViT frozen visual encoder + IMF AttnRes diffusion head",
  "checkpoint_path": "/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt",
  "visual_contract": {
    "input_camera_names": ["r_vis", "top", "front"],
    "fused_camera_names": ["front", "top", "r_vis"],
    "joint_output_dim": 192,
    "freeze_backbone": true,
    "dataset_image_resize_shape": null,
    "eval_image_resize_shape": [256, 256],
    "fused_short_side_resize": 224
  },
  "training_contract": {
    "pred_horizon": 16,
    "num_action_steps": 8,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 10,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "scheduler_type": "cosine",
    "warmup_steps": 2000,
    "min_lr": 1e-06,
    "weight_decay": 1e-05,
    "grad_clip": 1.0
  },
  "verification": {
    "local_tests": "38 passed",
    "remote_dataset_shape": [2, 3, 256, 256],
    "remote_eval_prepared_shape": [3, 256, 256],
    "remote_smoke_run": {
      "run_name": "smoke-lewm-imf-rawpath-emb384-20260406-002002",
      "result": "passed",
      "details": "2-step train + checkpoint-triggered 1-episode headless rollout succeeded with corrected raw256 path"
    }
  },
  "superseded_runs": [
    {
      "run_name": "lewm-vit-imf-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260405-201914",
      "reason": "stopped due to incorrect early per-camera 224 resize"
    },
    {
      "run_name": "lewm-vit-imf-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260405-201914",
      "reason": "stopped due to incorrect early per-camera 224 resize"
    }
  ],
  "full_runs": [
    {
      "host": "100.73.14.65",
      "gpu": 0,
      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124",
      "pid": 1058589,
      "log_path": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124.launch.log",
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/y5tzgqe0u966w9ak41i31",
      "head_n_emb": 384,
      "head_n_layer": 12
    },
    {
      "host": "100.73.14.65",
      "gpu": 1,
      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124",
      "pid": 1058590,
      "log_path": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124.launch.log",
      "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/2esr9y7t2dgesstgrn5i6",
      "head_n_emb": 256,
      "head_n_layer": 12
    }
  ]
 }
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/notes.md
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/notes.md
@@ -1,25 +0,0 @@
 # 2026-04-06 LEWM ViT Transfer Notes
 ## Root-cause fix
 The first LEWM runs were stopped because the data path still resized each camera view to `224x224` **before** multiview fusion. That preserved the final tensor shape but broke the original LEWM geometry.
 Corrected path now is:
 - **Training dataset**: keep stored per-view `256x256` images (`data.image_resize_shape=null` at launch; dataset instantiate override is `None` for LEWM)
 - **Eval rollout input**: resize live MuJoCo `480x640` camera images to `256x256` per view
 - **Backbone**: fuse `front, top, r_vis` on the LEWM axis, then resize fused short side to `224`
 ## Verification
 - Local tests passed (`38 passed` across the focused suite)
 - Remote check:
  - dataset sample image shape: `(2, 3, 256, 256)`
  - eval-prepared live frame shape: `(3, 256, 256)`
 - Remote smoke passed with real checkpoint:
  - `smoke-lewm-imf-rawpath-emb384-20260406-002002`
 ## Current runs
 - `lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124`
 - `lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124`
--- a/experiment_suites/2026-04-05-lewm-vit-transfer/status.json
+++ b/experiment_suites/2026-04-05-lewm-vit-transfer/status.json
@@ -1,19 +0,0 @@
 {
  "status": "running",
  "updated_at": "2026-04-06T00:22:10+08:00",
  "remote_host": "100.73.14.65",
  "runs": [
    {
      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb384-l12-ph16-ex08-step50k-roll10-5880g0-20260406-002124",
      "pid": 1058589,
      "gpu": 0,
      "state": "running"
    },
    {
      "run_name": "lewm-vit-imf-raw256fix-sim-transfer-emb256-l12-ph16-ex08-step50k-roll10-5880g1-20260406-002124",
      "pid": 1058590,
      "gpu": 1,
      "state": "running"
    }
  ]
 }
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/CHECKLIST.md
@@ -1,7 +0,0 @@
 # CHECKLIST
 - [x] Create run contract
 - [x] Remote smoke test passes
 - [x] Launch 50k main run
 - [x] Record pid / log / SwanLab
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/PLAN.md
@@ -1,12 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone using r_vis only as the only image conditioning.
 ## Fixed comparison contract
 - same hyperparameters as the active top/front run
 - cameras: ['r_vis']
 - num_cams=1
 - head.cond_dim=80
 - host: 100.119.99.14
 - gpu: 3
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/notes.md
@@ -1,6 +0,0 @@
 # Notes
 - 2026-04-05 12:58:22: smoke passed for ['r_vis'] on 100.119.99.14 GPU3.
 - 2026-04-05 12:59:24: launched main run `imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844`.
 - 2026-04-05 13:01:20: latest confirmed progress step=400, loss=0.1165.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/qnuh7vln9mqomxxldyecq
--- a/experiment_suites/2026-04-05-rvis-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-rvis-only-resnet-1cam/status.json
@@ -1,47 +0,0 @@
 {
  "suite_name": "2026-04-05-rvis-only-resnet-1cam",
  "updated_at": "2026-04-05 13:01:20",
  "phase": "running",
  "smoke_test": {
    "status": "passed",
    "host": "100.119.99.14",
    "gpu": 3,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-rvisonly-resnet-ph16-ex08-20260405-125812",
    "batch_size": 80,
    "max_steps": 2,
    "note": "2-step remote CUDA smoke passed without OOM."
  },
  "main_run": {
    "status": "running",
    "host": "100.119.99.14",
    "gpu": 3,
    "launch_pid": 164812,
    "pid": 164816,
    "run_name": "imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844/train_vla.log",
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844.launch.log",
    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
    "camera_names": [
      "r_vis"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_cond_dim": 80,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/qnuh7vln9mqomxxldyecq",
    "latest_step": 400,
    "latest_loss": 0.1165,
    "process_running": true
  }
 }
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/CHECKLIST.md
@@ -1,7 +0,0 @@
 # CHECKLIST
 - [x] Create run contract
 - [x] Remote smoke test passes
 - [x] Launch 50k main run
 - [x] Record pid / log / SwanLab
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/PLAN.md
@@ -1,12 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone using r_vis + top as the only image conditioning.
 ## Fixed comparison contract
 - same hyperparameters as the active top/front run
 - cameras: ['r_vis', 'top']
 - num_cams=2
 - head.cond_dim=144
 - host: 100.119.99.14
 - gpu: 2
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/notes.md
@@ -1,6 +0,0 @@
 # Notes
 - 2026-04-05 12:58:22: smoke passed for ['r_vis', 'top'] on 100.119.99.14 GPU2.
 - 2026-04-05 12:59:24: launched main run `imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844`.
 - 2026-04-05 13:01:20: latest confirmed progress step=200, loss=0.2845.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/umsm6402eb81et7wx7z4a
--- a/experiment_suites/2026-04-05-rvistop-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-rvistop-resnet-2cam/status.json
@@ -1,48 +0,0 @@
 {
  "suite_name": "2026-04-05-rvistop-resnet-2cam",
  "updated_at": "2026-04-05 13:01:20",
  "phase": "running",
  "smoke_test": {
    "status": "passed",
    "host": "100.119.99.14",
    "gpu": 2,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-rvistop-resnet-ph16-ex08-20260405-125812",
    "batch_size": 80,
    "max_steps": 2,
    "note": "2-step remote CUDA smoke passed without OOM."
  },
  "main_run": {
    "status": "running",
    "host": "100.119.99.14",
    "gpu": 2,
    "launch_pid": 164745,
    "pid": 164749,
    "run_name": "imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844/train_vla.log",
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844.launch.log",
    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
    "camera_names": [
      "r_vis",
      "top"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_cond_dim": 144,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/umsm6402eb81et7wx7z4a",
    "latest_step": 200,
    "latest_loss": 0.2845,
    "process_running": true
  }
 }
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/CHECKLIST.md
@@ -1,8 +0,0 @@
 # CHECKLIST
 - [x] Confirm baseline hyperparameters from trusted prior run
 - [x] Confirm local GPU availability
 - [x] Smoke test with `top/front` cameras only
 - [x] Launch 50k run
 - [x] Record pid / run dir / log path / SwanLab URL
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/PLAN.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/PLAN.md
@@ -1,30 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone (no full-AttnRes vision replacement), using only `top` and `front` cameras as image conditioning.
 ## Fixed comparison contract
 - Agent: `resnet_imf_attnres`
 - Vision backbone mode: `resnet`
 - `pred_horizon=16`
 - `num_action_steps=8`
 - `n_emb=384`, `n_layer=12`, `n_head=1`, `n_kv_head=1`
 - `inference_steps=1`
 - `batch_size=80`, `lr=2.5e-4`, cosine scheduler, warmup 2000
 - dataset: `/home/droid/project/diana_sim/sim_transfer`
 - cameras: `[top, front]` only
 - training budget: `max_steps=50000`
 - rollout validation: every 5 epochs, 5 episodes, headless
 ## Resource plan
 - Host: local
 - GPU: RTX 5090 (GPU 0)
 ## Execution path
 1. Run a short 2-step smoke test on GPU with the exact 2-camera config.
 2. If smoke passes, launch the 50k main run with durable log redirection.
 3. Record run name, pid, log path, and SwanLab URL into suite status.
 ## Fallbacks
 - If batch 80 OOMs, fall back to batch 64 with scaled lr 2.0e-4.
 - If dataloader startup is unstable, reduce num_workers from 12 to 8.
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/notes.md
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/notes.md
@@ -1,5 +0,0 @@
 # Notes
 - 2026-04-05 08:50:04: 2-step smoke test passed locally on RTX 5090 with `top/front` cameras, batch=80, no OOM.
 - 2026-04-05 08:50:42: launched main run `imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023` on local GPU0.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/vi77mn5dwd19z4nttxab8
--- a/experiment_suites/2026-04-05-top-front-resnet-2cam/status.json
+++ b/experiment_suites/2026-04-05-top-front-resnet-2cam/status.json
@@ -1,51 +0,0 @@
 {
  "suite_name": "2026-04-05-top-front-resnet-2cam",
  "updated_at": "2026-04-05 08:52:12",
  "phase": "running",
  "baseline_reference": {
    "source_run": "imf-p1-ph16-ex08-emb384-l12-ms50k-5880g1-20260404-131223",
    "best_rollout_avg_reward": 610.8,
    "best_step": 21874,
    "notes": "Same IMF baseline as Phase-1 best, but switch cameras from [r_vis, top, front] to [top, front] and keep the original ResNet vision backbone."
  },
  "smoke_test": {
    "status": "passed",
    "run_dir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/smoke-topfront-resnet-ph16-ex08-20260405-085000",
    "batch_size": 80,
    "num_workers": 4,
    "max_steps": 2,
    "note": "2-step local CUDA smoke passed without OOM using top/front only."
  },
  "main_run": {
    "status": "running",
    "host": "local",
    "gpu": 0,
    "pid": 1693348,
    "run_name": "imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
    "run_dir": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023",
    "log_path": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/runs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023/train_vla.log",
    "launch_log": "/home/droid/project/roboimi/.worktrees/feat-imf-attnres-policy/experiment_suites/2026-04-05-top-front-resnet-2cam/launch_logs/imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023.launch.log",
    "dataset_dir": "/home/droid/project/diana_sim/sim_transfer",
    "camera_names": [
      "top",
      "front"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/vi77mn5dwd19z4nttxab8",
    "latest_step": 500,
    "latest_loss": 0.0978,
    "process_running": true
  }
 }
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/CHECKLIST.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/CHECKLIST.md
@@ -1,7 +0,0 @@
 # CHECKLIST
 - [x] Create run contract
 - [x] Remote smoke test passes
 - [x] Launch 50k main run
 - [x] Record pid / log / SwanLab
 - [x] Report status back to user
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/PLAN.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/PLAN.md
@@ -1,12 +0,0 @@
 # PLAN
 ## Goal
 Train a 50k-step IMF baseline with the original ResNet vision backbone using top only as the only image conditioning.
 ## Fixed comparison contract
 - same hyperparameters as the active top/front run
 - cameras: ['top']
 - num_cams=1
 - head.cond_dim=80
 - host: 100.119.99.14
 - gpu: 4
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/notes.md
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/notes.md
@@ -1,6 +0,0 @@
 # Notes
 - 2026-04-05 12:58:22: smoke passed for ['top'] on 100.119.99.14 GPU4.
 - 2026-04-05 12:59:24: launched main run `imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844`.
 - 2026-04-05 13:01:20: latest confirmed progress step=400, loss=0.1233.
 - SwanLab: https://swanlab.cn/@game-loader/roboimi-vla/runs/egzo29l3z9ftsaunhf025
--- a/experiment_suites/2026-04-05-top-only-resnet-1cam/status.json
+++ b/experiment_suites/2026-04-05-top-only-resnet-1cam/status.json
@@ -1,47 +0,0 @@
 {
  "suite_name": "2026-04-05-top-only-resnet-1cam",
  "updated_at": "2026-04-05 13:01:20",
  "phase": "running",
  "smoke_test": {
    "status": "passed",
    "host": "100.119.99.14",
    "gpu": 4,
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/smoke-toponly-resnet-ph16-ex08-20260405-125812",
    "batch_size": 80,
    "max_steps": 2,
    "note": "2-step remote CUDA smoke passed without OOM."
  },
  "main_run": {
    "status": "running",
    "host": "100.119.99.14",
    "gpu": 4,
    "launch_pid": 164808,
    "pid": 164813,
    "run_name": "imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844",
    "run_dir": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844",
    "log_path": "/home/droid/roboimi_suite_20260404/runs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844/train_vla.log",
    "launch_log": "/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844.launch.log",
    "dataset_dir": "/home/droid/sim_dataset/sim_transfer",
    "camera_names": [
      "top"
    ],
    "pred_horizon": 16,
    "num_action_steps": 8,
    "head_cond_dim": 80,
    "head_n_emb": 384,
    "head_n_layer": 12,
    "vision_backbone_mode": "resnet",
    "pretrained_backbone_weights": null,
    "freeze_backbone": false,
    "batch_size": 80,
    "lr": 0.00025,
    "num_workers": 12,
    "max_steps": 50000,
    "rollout_val_freq_epochs": 5,
    "rollout_num_episodes": 5,
    "swanlab_url": "https://swanlab.cn/@game-loader/roboimi-vla/runs/egzo29l3z9ftsaunhf025",
    "latest_step": 400,
    "latest_loss": 0.1233,
    "process_running": true
  }
 }
--- a/generate_dataset_videos.py
+++ b/generate_dataset_videos.py
@@ -1,324 +0,0 @@
 #!/usr/bin/env python3
 """
 将 HDF5 数据集转换为视频，用于可视化检查
 功能：
 1. 将单个 episode 转换为视频
 2. 对比多个 episode 的视频
 3. 放慢播放速度便于观察
 """
 import os
 import h5py
 import glob
 import cv2
 import numpy as np
 def episode_to_video(episode_file, output_path, camera='top', fps=30, slow_factor=1):
    """
    将单个 episode 转换为视频
    Args:
        episode_file: HDF5 文件路径
        output_path: 输出视频路径
        camera: 要使用的相机名称
        fps: 帧率
        slow_factor: 慢放倍数（1=正常，2=半速）
    """
    try:
        with h5py.File(episode_file, 'r') as f:
            # 读取图像序列
            img_path = f'/observations/images/{camera}'
            if img_path not in f:
                print(f"  ❌ 相机 {camera} 不存在")
                return False
            images = f[img_path][:]  # shape: (T, H, W, C)
            qpos = f['/observations/qpos'][:]
            actions = f['/action'][:]
            total_frames = len(images)
            height, width = images.shape[1], images.shape[2]
            # 创建视频写入器
            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
            actual_fps = fps // slow_factor
            out = cv2.VideoWriter(output_path, fourcc, actual_fps, (width, height))
            # 逐帧写入
            for i in range(total_frames):
                frame = images[i].astype(np.uint8)
                # 在图像上添加信息
                info_text = [
                    f"Episode: {os.path.basename(episode_file).replace('.hdf5', '')}",
                    f"Frame: {i}/{total_frames}",
                    f"qpos[0:3]: [{qpos[i, 0]:.2f}, {qpos[i, 1]:.2f}, {qpos[i, 2]:.2f}]",
                ]
                for j, text in enumerate(info_text):
                    cv2.putText(frame, text, (10, 30 + j*30),
                               cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
                out.write(frame)
            out.release()
            print(f"  ✅ 保存: {output_path}")
            print(f"     帧数: {total_frames}, 尺寸: {width}x{height}, FPS: {actual_fps}")
            return True
    except Exception as e:
        print(f"  ❌ 错误: {e}")
        return False
 def generate_all_videos(camera='top', num_episodes=5, slow_factor=1):
    """生成前 N 个 episode 的视频"""
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    if len(episode_files) == 0:
        print(f"❌ 没有找到数据文件: {dataset_dir}")
        return
    # 创建输出目录
    output_dir = '/tmp/dataset_videos'
    os.makedirs(output_dir, exist_ok=True)
    print(f"找到 {len(episode_files)} 个 episode 文件")
    print(f"将生成前 {min(num_episodes, len(episode_files))} 个 episode 的视频\n")
    # 生成视频
    for i in range(min(num_episodes, len(episode_files))):
        ep_file = episode_files[i]
        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
        output_path = f"{output_dir}/{ep_name}_{camera}.mp4"
        print(f"[{i+1}/{min(num_episodes, len(episode_files))}] {ep_name}")
        episode_to_video(ep_file, output_path, camera=camera, slow_factor=slow_factor)
        print()
    print(f"✅ 所有视频已保存到: {output_dir}")
    print(f"\n播放方法:")
    print(f"  # 播放单个视频")
    print(f"  vlc {output_dir}/*.mp4")
    print(f"  ")
    print(f"  # 或用文件管理器")
    print(f"  nautilus {output_dir}")
 def generate_multi_camera_video(episode_idx=0, slow_factor=1):
    """生成包含多个相机的视频（分屏显示）"""
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    if episode_idx >= len(episode_files):
        print(f"❌ Episode {episode_idx} 不存在")
        return
    ep_file = episode_files[episode_idx]
    try:
        with h5py.File(ep_file, 'r') as f:
            # 获取所有相机
            cameras = []
            for key in f.keys():
                if 'images' in key:
                    for cam_name in f[key].keys():
                        if cam_name not in cameras:
                            cameras.append(cam_name)
            print(f"Episode {episode_idx} 的相机: {cameras}")
            # 读取所有相机的图像
            all_images = {}
            for cam in cameras:
                img_path = f'/observations/images/{cam}'
                if img_path in f:
                    all_images[cam] = f[img_path][:]
            if not all_images:
                print("❌ 没有找到图像数据")
                return
            # 获取第一个相机的尺寸
            first_cam = list(all_images.keys())[0]
            total_frames = len(all_images[first_cam])
            height, width = all_images[first_cam].shape[1], all_images[first_cam].shape[2]
            # 创建多相机布局
            num_cams = len(all_images)
            cols = min(2, num_cams)
            rows = (num_cams + cols - 1) // cols
            canvas_width = width * cols
            canvas_height = height * rows
            # 创建视频写入器
            output_path = f'/tmp/dataset_videos/episode_{episode_idx}_all_cameras.mp4'
            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
            out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
            # 逐帧合成
            for i in range(total_frames):
                canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
                for cam_idx, cam_name in enumerate(all_images.keys()):
                    img = all_images[cam_name][i]
                    # 计算在画布上的位置
                    row = cam_idx // cols
                    col = cam_idx % cols
                    y_start = row * height
                    y_end = y_start + height
                    x_start = col * width
                    x_end = x_start + width
                    # 调整大小（如果需要）
                    if img.shape[:2] != (height, width):
                        img = cv2.resize(img, (width, height))
                    # 放到画布上
                    canvas[y_start:y_end, x_start:x_end] = img
                    # 添加相机名称
                    cv2.putText(canvas, cam_name, (x_start + 10, y_start + 30),
                               cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
                # 添加帧信息
                cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 10),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
                out.write(canvas)
            out.release()
            print(f"✅ 保存多相机视频: {output_path}")
    except Exception as e:
        print(f"❌ 错误: {e}")
 def compare_episodes(camera='top', slow_factor=2):
    """并排对比多个 episode 的视频"""
    dataset_dir = "roboimi/demos/dataset/sim_transfer"
    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
    # 选择要对比的 episode
    episodes_to_compare = [0, 1, 2, 3, 4]  # 对比前 5 个
    print(f"对比 Episodes: {episodes_to_compare}")
    # 读取所有 episode 的数据
    all_data = []
    for ep_idx in episodes_to_compare:
        if ep_idx >= len(episode_files):
            continue
        try:
            with h5py.File(episode_files[ep_idx], 'r') as f:
                img_path = f'/observations/images/{camera}'
                if img_path in f:
                    all_data.append({
                        'idx': ep_idx,
                        'images': f[img_path][:],
                        'qpos': f['/observations/qpos'][:]
                    })
        except:
            pass
    if len(all_data) == 0:
        print("❌ 没有数据")
        return
    # 获取参数
    first_data = all_data[0]
    height, width = first_data['images'].shape[1], first_data['images'].shape[2]
    total_frames = min([d['images'].shape[0] for d in all_data])
    # 创建并排布局
    num_compare = len(all_data)
    canvas_width = width * num_compare
    canvas_height = height
    # 创建视频
    output_path = f'/tmp/dataset_videos/compare_{camera}.mp4'
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
    print(f"生成对比视频，共 {total_frames} 帧...")
    # 逐帧对比
    for i in range(total_frames):
        canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
        for j, data in enumerate(all_data):
            img = data['images'][i]
            qpos = data['qpos'][i]
            # 调整大小（如果需要）
            if img.shape[:2] != (height, width):
                img = cv2.resize(img, (width, height))
            # 放到画布上
            x_start = j * width
            x_end = x_start + width
            canvas[:, x_start:x_end] = img
            # 添加信息
            ep_name = f"Ep {data['idx']}"
            cv2.putText(canvas, ep_name, (x_start + 10, 30),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2)
            cv2.putText(canvas, f"qpos[0:3]: [{qpos[0]:.2f}, {qpos[1]:.2f}, {qpos[2]:.2f}]",
                       (x_start + 10, height - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
        # 添加帧号
        cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
        out.write(canvas)
        if i % 100 == 0:
            print(f"  进度: {i}/{total_frames}")
    out.release()
    print(f"✅ 保存对比视频: {output_path}")
 if __name__ == "__main__":
    import sys
    print("="*60)
    print("数据集视频生成工具")
    print("="*60)
    if len(sys.argv) > 1:
        command = sys.argv[1]
        if command == 'compare':
            # 对比多个 episode
            camera = sys.argv[2] if len(sys.argv) > 2 else 'top'
            compare_episodes(camera=camera, slow_factor=2)
        elif command == 'multi':
            # 多相机视频
            ep_idx = int(sys.argv[2]) if len(sys.argv) > 2 else 0
            generate_multi_camera_video(episode_idx=ep_idx, slow_factor=1)
        else:
            print("未知命令")
    else:
        # 默认：生成前 5 个 episode 的视频
        print("\n生成前 5 个 episode 的视频（top 相机，慢放 2x）...")
        print("="*60 + "\n")
        generate_all_videos(camera='top', num_episodes=5, slow_factor=2)
        print("\n" + "="*60)
        print("其他用法:")
        print("  python generate_dataset_videos.py compare top    # 对比多个 episode")
        print("  python generate_dataset_videos.py multi 0        # 多相机视频")
        print("="*60)
--- a/gr00t/main.py
+++ b/gr00t/main.py
@@ -1,125 +0,0 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 GR00T (diffusion-based DiT policy) model builder.
 This module provides functions to build GR00T models and optimizers
 from configuration dictionaries (typically from config.yaml's 'gr00t:' section).
 """
 import argparse
 from pathlib import Path
 import numpy as np
 import torch
 from .models import build_gr00t_model
 def get_args_parser():
    """
    Create argument parser for GR00T model configuration.
    All parameters can be overridden via args_override dictionary in
    build_gr00t_model_and_optimizer(). This allows loading from config.yaml.
    """
    parser = argparse.ArgumentParser('GR00T training and evaluation script', add_help=False)
    # Training parameters
    parser.add_argument('--lr', default=1e-5, type=float,
                        help='Learning rate for main parameters')
    parser.add_argument('--lr_backbone', default=1e-5, type=float,
                        help='Learning rate for backbone parameters')
    parser.add_argument('--weight_decay', default=1e-4, type=float,
                        help='Weight decay for optimizer')
    # GR00T model architecture parameters
    parser.add_argument('--embed_dim', default=1536, type=int,
                        help='Embedding dimension for transformer')
    parser.add_argument('--hidden_dim', default=1024, type=int,
                        help='Hidden dimension for MLP layers')
    parser.add_argument('--state_dim', default=16, type=int,
                        help='State (qpos) dimension')
    parser.add_argument('--action_dim', default=16, type=int,
                        help='Action dimension')
    parser.add_argument('--num_queries', default=16, type=int,
                        help='Number of action queries (chunk size)')
    # DiT (Diffusion Transformer) parameters
    parser.add_argument('--num_layers', default=16, type=int,
                        help='Number of transformer layers')
    parser.add_argument('--nheads', default=32, type=int,
                        help='Number of attention heads')
    parser.add_argument('--mlp_ratio', default=4, type=float,
                        help='MLP hidden dimension ratio')
    parser.add_argument('--dropout', default=0.2, type=float,
                        help='Dropout rate')
    # Backbone parameters
    parser.add_argument('--backbone', default='dino_v2', type=str,
                        help='Backbone architecture (dino_v2, resnet18, resnet34)')
    parser.add_argument('--position_embedding', default='sine', type=str,
                        choices=('sine', 'learned'),
                        help='Type of positional encoding')
    # Camera configuration
    parser.add_argument('--camera_names', default=[], nargs='+',
                        help='List of camera names for observations')
    # Other parameters (not directly used but kept for compatibility)
    parser.add_argument('--batch_size', default=15, type=int)
    parser.add_argument('--epochs', default=20000, type=int)
    parser.add_argument('--masks', action='store_true',
                        help='Use intermediate layer features')
    parser.add_argument('--dilation', action='store_false',
                        help='Use dilated convolution in backbone')
    return parser
 def build_gr00t_model_and_optimizer(args_override):
    """
    Build GR00T model and optimizer from config dictionary.
    This function is designed to work with config.yaml loading:
    1. Parse default arguments
    2. Override with values from args_override (typically from config['gr00t'])
    3. Build model and optimizer
    Args:
        args_override: Dictionary of config values, typically from config.yaml's 'gr00t:' section
                      Expected keys: embed_dim, hidden_dim, state_dim, action_dim,
                                     num_queries, nheads, mlp_ratio, dropout, num_layers,
                                     lr, lr_backbone, camera_names, backbone, etc.
    Returns:
        model: GR00T model on CUDA
        optimizer: AdamW optimizer with separate learning rates for backbone and other params
    """
    parser = argparse.ArgumentParser('GR00T training and evaluation script',
                                     parents=[get_args_parser()])
    args = parser.parse_args()
    # Override with config values
    for k, v in args_override.items():
        setattr(args, k, v)
    # Build model
    model = build_gr00t_model(args)
    model.cuda()
    # Create parameter groups with different learning rates
    param_dicts = [
        {
            "params": [p for n, p in model.named_parameters()
                      if "backbone" not in n and p.requires_grad]
        },
        {
            "params": [p for n, p in model.named_parameters()
                      if "backbone" in n and p.requires_grad],
            "lr": args.lr_backbone,
        },
    ]
    optimizer = torch.optim.AdamW(param_dicts,
                                  lr=args.lr,
                                  weight_decay=args.weight_decay)
    return model, optimizer
--- a/gr00t/models/init.py
+++ b/gr00t/models/init.py
@@ -1,3 +0,0 @@
 from .gr00t import build_gr00t_model
 __all__ = ['build_gr00t_model']
--- a/gr00t/models/dit.py
+++ b/gr00t/models/dit.py
@@ -1,142 +0,0 @@
 from typing import Optional
 from diffusers import ConfigMixin, ModelMixin
 from diffusers.configuration_utils import register_to_config
 from diffusers.models.embeddings import SinusoidalPositionalEmbedding, TimestepEmbedding, Timesteps
 import torch
 from torch import nn
 import torch.nn.functional as F
 class TimestepEncoder(nn.Module):
    def __init__(self, args):
        super().__init__()
        embedding_dim = args.embed_dim
        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=1)
        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
    def forward(self, timesteps):
        dtype = next(self.parameters()).dtype
        timesteps_proj = self.time_proj(timesteps).to(dtype)
        timesteps_emb = self.timestep_embedder(timesteps_proj)  # (N, D)
        return timesteps_emb
 class AdaLayerNorm(nn.Module):
    def __init__(self, embedding_dim, norm_eps=1e-5, norm_elementwise_affine=False):
        super().__init__()
        output_dim = embedding_dim * 2
        self.silu = nn.SiLU()
        self.linear = nn.Linear(embedding_dim, output_dim)
        self.norm = nn.LayerNorm(output_dim // 2, norm_eps, norm_elementwise_affine)
    def forward(
        self,
        x: torch.Tensor,
        temb: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        temb = self.linear(self.silu(temb))
        scale, shift = temb.chunk(2, dim=1)
        x = self.norm(x) * (1 + scale[:, None]) + shift[:, None]
        return x
 class BasicTransformerBlock(nn.Module):
    def __init__(self, args, crosss_attention_dim, use_self_attn=False):
        super().__init__()
        dim = args.embed_dim
        num_heads = args.nheads
        mlp_ratio = args.mlp_ratio
        dropout = args.dropout
        self.norm1 = AdaLayerNorm(dim)
        if not use_self_attn:
            self.attn = nn.MultiheadAttention(
                embed_dim=dim,
                num_heads=num_heads,
                dropout=dropout,
                kdim=crosss_attention_dim,
                vdim=crosss_attention_dim,
                batch_first=True,
            )
        else:
            self.attn = nn.MultiheadAttention(
                embed_dim=dim,
                num_heads=num_heads,
                dropout=dropout,
                batch_first=True,
            )
        self.norm2 = nn.LayerNorm(dim, eps=1e-5, elementwise_affine=False)
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * mlp_ratio),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(dim * mlp_ratio, dim),
            nn.Dropout(dropout)
        )
    def forward(self, hidden_states, temb, context=None):
        norm_hidden_states = self.norm1(hidden_states, temb)
        attn_output = self.attn(
            norm_hidden_states,
            context if context is not None else norm_hidden_states,
            context if context is not None else norm_hidden_states,
        )[0]
        hidden_states = attn_output + hidden_states
        norm_hidden_states = self.norm2(hidden_states)
        ff_output = self.mlp(norm_hidden_states)
        hidden_states = ff_output + hidden_states
        return hidden_states
 class DiT(nn.Module):
    def __init__(self, args, cross_attention_dim):
        super().__init__()
        inner_dim = args.embed_dim
        num_layers = args.num_layers
        output_dim = args.hidden_dim
        self.timestep_encoder = TimestepEncoder(args)
        all_blocks = []
        for idx in range(num_layers):
            use_self_attn = idx % 2 == 1
            if use_self_attn:
                block = BasicTransformerBlock(args, crosss_attention_dim=None, use_self_attn=True)
            else:
                block = BasicTransformerBlock(args, crosss_attention_dim=cross_attention_dim, use_self_attn=False)
            all_blocks.append(block)
        self.transformer_blocks = nn.ModuleList(all_blocks)
        self.norm_out = nn.LayerNorm(inner_dim, eps=1e-6, elementwise_affine=False)
        self.proj_out_1 = nn.Linear(inner_dim, 2 * inner_dim)
        self.proj_out_2 = nn.Linear(inner_dim, output_dim)
    def forward(self, hidden_states, timestep, encoder_hidden_states):
        temb = self.timestep_encoder(timestep)
        hidden_states = hidden_states.contiguous()
        encoder_hidden_states = encoder_hidden_states.contiguous()    
        for idx, block in enumerate(self.transformer_blocks):
            if idx % 2 == 1:
                hidden_states = block(hidden_states, temb)
            else:
                hidden_states = block(hidden_states, temb, context=encoder_hidden_states)
        conditioning = temb
        shift, scale = self.proj_out_1(F.silu(conditioning)).chunk(2, dim=1)
        hidden_states = self.norm_out(hidden_states) * (1 + scale[:, None]) + shift[:, None]
        return self.proj_out_2(hidden_states)
 def build_dit(args, cross_attention_dim):
    return DiT(args, cross_attention_dim)
--- a/gr00t/models/gr00t.py
+++ b/gr00t/models/gr00t.py
@@ -1,124 +0,0 @@
 from .modules import (
    build_action_decoder,
    build_action_encoder,
    build_state_encoder,
    build_time_sampler,
    build_noise_scheduler,
 )
 from .backbone import build_backbone
 from .dit import build_dit
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 class gr00t(nn.Module):
    def __init__(
            self,
            backbones,
            dit,
            state_encoder,
            action_encoder,
            action_decoder,
            time_sampler,
            noise_scheduler,
            num_queries,
            camera_names,
    ):
        super().__init__()
        self.num_queries = num_queries
        self.camera_names = camera_names
        self.dit = dit
        self.state_encoder = state_encoder
        self.action_encoder = action_encoder
        self.action_decoder = action_decoder
        self.time_sampler = time_sampler
        self.noise_scheduler = noise_scheduler
        if backbones is not None:
            self.backbones = nn.ModuleList(backbones)
        else:
            raise NotImplementedError
    def forward(self, qpos, image, actions=None, is_pad=None):
        is_training = actions is not None # train or val
        bs, _ = qpos.shape
        all_cam_features = []
        for cam_id, cam_name in enumerate(self.camera_names):
            # features, pos = self.backbones[0](image[:, cam_id]) # HARDCODED
            features, pos = self.backbones[cam_id](image[:, cam_id])
            features = features[0] # take the last layer feature
            B, C, H, W = features.shape
            features_seq = features.permute(0, 2, 3, 1).reshape(B, H * W, C)
            all_cam_features.append(features_seq)
        encoder_hidden_states = torch.cat(all_cam_features, dim=1)
        state_features = self.state_encoder(qpos)  # [B, 1, emb_dim]
        if is_training:
            # training logic
            timesteps = self.time_sampler(bs, actions.device, actions.dtype)
            noisy_actions, target_velocity = self.noise_scheduler.add_noise(
                actions, timesteps
            )
            t_discretized = (timesteps[:, 0, 0] * 1000).long()
            action_features = self.action_encoder(noisy_actions, t_discretized)
            sa_embs = torch.cat((state_features, action_features), dim=1)
            model_output = self.dit(sa_embs, t_discretized, encoder_hidden_states)
            pred = self.action_decoder(model_output)
            pred_actions = pred[:, -actions.shape[1] :]
            action_loss = F.mse_loss(pred_actions, target_velocity, reduction='none')
            return pred_actions, action_loss
        else:
            actions = torch.randn(bs, self.num_queries, qpos.shape[-1], device=qpos.device, dtype=qpos.dtype)
            k = 5
            dt = 1.0 / k
            for t in range(k):
                t_cont = t / float(k)
                t_discretized = int(t_cont * 1000)
                timesteps = torch.full((bs,), t_discretized, device=qpos.device, dtype=qpos.dtype)
                action_features = self.action_encoder(actions, timesteps)
                sa_embs = torch.cat((state_features, action_features), dim=1)
                # Create tensor of shape [B] for DiT (consistent with training path)
                model_output = self.dit(sa_embs, timesteps, encoder_hidden_states)
                pred = self.action_decoder(model_output)
                pred_velocity = pred[:, -self.num_queries :]
                actions = actions + pred_velocity * dt
            return actions, _
 def build_gr00t_model(args):
    state_dim = args.state_dim
    action_dim = args.action_dim
    backbones = []
    for _ in args.camera_names:
        backbone = build_backbone(args)
        backbones.append(backbone)
    cross_attention_dim = backbones[0].num_channels
    dit = build_dit(args, cross_attention_dim)
    state_encoder = build_state_encoder(args)
    action_encoder = build_action_encoder(args)
    action_decoder = build_action_decoder(args)
    time_sampler = build_time_sampler(args)
    noise_scheduler = build_noise_scheduler(args)
    model = gr00t(
        backbones,
        dit,
        state_encoder,
        action_encoder,
        action_decoder,
        time_sampler,
        noise_scheduler,
        args.num_queries,
        args.camera_names,
    )
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("number of parameters: %.2fM" % (n_parameters/1e6,))
    return model
--- a/gr00t/models/modules.py
+++ b/gr00t/models/modules.py
@@ -1,179 +0,0 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 # ActionEncoder
 class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.embed_dim = args.embed_dim
    def forward(self, timesteps):
        timesteps = timesteps.float()
        B, T = timesteps.shape
        device = timesteps.device
        half_dim = self.embed_dim // 2
        exponent = -torch.arange(half_dim, dtype=torch.float, device=device) * (
            torch.log(torch.tensor(10000.0)) / half_dim
        )
        freqs = timesteps.unsqueeze(-1) * exponent.exp()
        sin = torch.sin(freqs)
        cos = torch.cos(freqs)
        enc = torch.cat([sin, cos], dim=-1)  # (B, T, w)
        return enc
 class ActionEncoder(nn.Module):
    def __init__(self, args):
        super().__init__()
        action_dim = args.action_dim
        embed_dim = args.embed_dim
        self.W1 = nn.Linear(action_dim, embed_dim)
        self.W2 = nn.Linear(2 * embed_dim, embed_dim)
        self.W3 = nn.Linear(embed_dim, embed_dim)
        self.pos_encoder = SinusoidalPositionalEncoding(args)
    def forward(self, actions, timesteps):
        B, T, _ = actions.shape
        # 1) Expand each batch's single scalar time 'tau' across all T steps
        #    so that shape => (B, T)
        #    Handle different input shapes: (B,), (B, 1), (B, 1, 1)
        #    Reshape to (B,) then expand to (B, T)
        # if timesteps.dim() == 3:
        #     # Shape (B, 1, 1) or (B, T, 1) -> (B,)
        #     timesteps = timesteps[:, 0, 0]
        # elif timesteps.dim() == 2:
        #     # Shape (B, 1) or (B, T) -> take first element if needed
        #     if timesteps.shape[1] == 1:
        #         timesteps = timesteps[:, 0]
        #     # else: already (B, T), use as is
        # elif timesteps.dim() != 1:
        #     raise ValueError(
        #         f"Expected `timesteps` to have shape (B,), (B, 1), or (B, 1, 1), got {timesteps.shape}"
        #     )
        # Now timesteps should be (B,), expand to (B, T)
        if timesteps.dim() == 1 and timesteps.shape[0] == B:
            timesteps = timesteps.unsqueeze(1).expand(-1, T)
        else:
            raise ValueError(
                "Expected `timesteps` to have shape (B,) so we can replicate across T."
            )
        # 2) Standard action MLP step for shape => (B, T, w)
        a_emb = self.W1(actions)
        # 3) Get the sinusoidal encoding (B, T, w)
        tau_emb = self.pos_encoder(timesteps).to(dtype=a_emb.dtype)
        # 4) Concat along last dim => (B, T, 2w), then W2 => (B, T, w), swish
        x = torch.cat([a_emb, tau_emb], dim=-1)
        x = F.silu(self.W2(x))
        # 5) Finally W3 => (B, T, w)
        x = self.W3(x)
        return x
 def build_action_encoder(args):
    return ActionEncoder(args)
 # StateEncoder
 class StateEncoder(nn.Module):
    def __init__(self, args):
        super().__init__()
        input_dim = args.state_dim
        hidden_dim = args.hidden_dim
        output_dim = args.embed_dim
        self.mlp = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
    def forward(self, states):
        state_emb = self.mlp(states)  # [B, emb_dim]
        state_emb = state_emb.unsqueeze(1)
        return state_emb  # [B, 1, emb_dim]
 def build_state_encoder(args):
    return StateEncoder(args)
 # ActionDecoder
 class ActionDecoder(nn.Module):
    def __init__(self,args):
        super().__init__()
        input_dim = args.hidden_dim
        hidden_dim = args.hidden_dim
        output_dim = args.action_dim
        self.num_queries = args.num_queries
        self.mlp = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
    def forward(self, model_output):
        pred_actions = self.mlp(model_output)
        return pred_actions[:, -self.num_queries:]
 def build_action_decoder(args):
    return ActionDecoder(args)
 # TimeSampler
 class TimeSampler(nn.Module):
    def __init__(self, noise_s = 0.999, noise_beta_alpha=1.5, noise_beta_beta=1.0):
        super().__init__()
        self.noise_s = noise_s
        self.beta_dist = torch.distributions.Beta(noise_beta_alpha, noise_beta_beta)
    def forward(self, batch_size, device, dtype):
        sample = self.beta_dist.sample([batch_size]).to(device, dtype=dtype)
        sample = (1 - sample) * self.noise_s
        return sample[:, None, None]
 def build_time_sampler(args):
    return TimeSampler()
 # NoiseScheduler
 import torch
 import torch.nn as nn
 class FlowMatchingScheduler(nn.Module):
    def __init__(self):
        super().__init__()
    # --- 训练逻辑：加噪并计算目标 ---
    def add_noise(self, actions, timesteps):
        noise = torch.randn_like(actions)
        noisy_samples = actions * timesteps + noise * (1 - timesteps)
        target_velocity = actions - noise
        return noisy_samples, target_velocity
    # --- 推理逻辑：欧拉步 (Euler Step) ---
    def step(self, model_output, sample, dt):
        prev_sample = sample + model_output * dt
        return prev_sample
 def build_noise_scheduler(args):
    return FlowMatchingScheduler()
--- a/gr00t/policy.py
+++ b/gr00t/policy.py
@@ -1,90 +0,0 @@
 """
 GR00T Policy wrapper for imitation learning.
 This module provides the gr00tPolicy class that wraps the GR00T model
 for training and evaluation in the imitation learning framework.
 """
 import torch.nn as nn
 from torch.nn import functional as F
 from torchvision.transforms import v2
 import torch
 from roboimi.gr00t.main import build_gr00t_model_and_optimizer
 class gr00tPolicy(nn.Module):
    """
    GR00T Policy for action prediction using diffusion-based DiT architecture.
    This policy wraps the GR00T model and handles:
    - Image resizing to match DINOv2 patch size requirements
    - Image normalization (ImageNet stats)
    - Training with action chunks and loss computation
    - Inference with diffusion sampling
    """
    def __init__(self, args_override):
        super().__init__()
        model, optimizer = build_gr00t_model_and_optimizer(args_override)
        self.model = model
        self.optimizer = optimizer
        # DINOv2 requires image dimensions to be multiples of patch size (14)
        # Common sizes: 224x224, 336x336, etc. (14*16=224, 14*24=336)
        self.patch_h = 16  # Number of patches vertically
        self.patch_w = 22  # Number of patches horizontally
        target_size = (self.patch_h * 14, self.patch_w * 14)  # (224, 308)
        # Training transform with data augmentation
        self.train_transform = v2.Compose([
            v2.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
            v2.RandomPerspective(distortion_scale=0.5),
            v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
            v2.GaussianBlur(kernel_size=(9, 9), sigma=(0.1, 2.0)),
            v2.Resize(target_size),
            v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ])
        # Inference transform (no augmentation)
        self.inference_transform = v2.Compose([
            v2.Resize(target_size),
            v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ])
    def __call__(self, qpos, image, actions=None, is_pad=None):
        """
        Forward pass for training or inference.
        Args:
            qpos: Joint positions [B, state_dim]
            image: Camera images [B, num_cameras, C, H, W]
            actions: Ground truth actions [B, chunk_size, action_dim] (training only)
            is_pad: Padding mask [B, chunk_size] (training only)
        Returns:
            Training: dict with 'mse' loss
            Inference: predicted actions [B, num_queries, action_dim]
        """
        # Apply transforms (resize + normalization)
        if actions is not None:  # training time
            image = self.train_transform(image)
        else:  # inference time
            image = self.inference_transform(image)
        if actions is not None:  # training time
            actions = actions[:, :self.model.num_queries]
            is_pad = is_pad[:, :self.model.num_queries]
            _, action_loss = self.model(qpos, image, actions, is_pad)
            # Mask out padded positions
            mse_loss = (action_loss * ~is_pad.unsqueeze(-1)).mean()
            loss_dict = {
                'loss': mse_loss
            }
            return loss_dict
        else:  # inference time
            a_hat, _ = self.model(qpos, image)
            return a_hat
    def configure_optimizers(self):
        """Return the optimizer for training."""
        return self.optimizer
--- a/roboimi/.gitattributes
+++ b/roboimi/.gitattributes
@@ -1 +0,0 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/roboimi/init.py
+++ b/roboimi/init.py
--- a/roboimi/assets/models/manipulators/DianaMed/box.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/box.xml
@@ -3,7 +3,7 @@
      <body name="box" pos="0.2 1.0 0.47">
            <joint name="red_box_joint" type="free" frictionloss="0.01" />
            <inertial pos="0 0 0" mass="0.05" diaginertia="0.002 0.002 0.002" />
-            <geom contype="1" conaffinity="1"  condim="4" solimp="2 1 0.01" solref="0.01 1" friction="1 0.005 0.0001" pos="0 0 0" size="0.018 0.018 0.02" type="box" name="red_box" rgba="1 0 0 1" />
+            <geom contype="1" conaffinity="1"  condim="4" solimp="2 1 0.01" solref="0.01 1" friction="1 0.005 0.0001" pos="0 0 0" size="0.02 0.02 0.02" type="box" name="red_box" rgba="1 0 0 1" />
        </body>
  </worldbody>
 </mujoco>
--- a/roboimi/assets/models/manipulators/DianaMed/table_square.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/table_square.xml
@@ -8,6 +8,5 @@
      </body>
      <camera name="top" pos="0.0 1.0 2.0" fovy="44" mode="targetbody" target="table"/>
      <camera name="angle" pos="0.0 0.0 2.0" fovy="37" mode="targetbody" target="table"/>
      <camera name="front" pos="0 0 0.8" fovy="65" mode="fixed" quat="0.7071 0.7071 0 0"/>
  </worldbody>
 </mujoco>
--- a/roboimi/assets/robots/arm_base.py
+++ b/roboimi/assets/robots/arm_base.py
@@ -1,46 +1,8 @@
 import mujoco
 import numpy as np
 from pathlib import Path
 from roboimi.utils.KDL_utils import KDL_utils
 def resolve_robot_asset_path(asset_path):
    if asset_path is None:
        return None
    raw_path = Path(asset_path).expanduser()
    if raw_path.is_absolute():
        return str(raw_path.resolve())
    current_dir = Path(__file__).resolve().parent
    package_root = current_dir.parents[1]
    repo_root = current_dir.parents[2]
    candidates = []
    if raw_path.parts and raw_path.parts[0] == 'roboimi':
        candidates.append(repo_root / raw_path)
    candidates.extend([
        current_dir / raw_path,
        package_root / raw_path,
        repo_root / raw_path,
    ])
    normalized_candidates = []
    seen = set()
    for candidate in candidates:
        resolved = candidate.resolve()
        if resolved not in seen:
            normalized_candidates.append(resolved)
            seen.add(resolved)
    for candidate in normalized_candidates:
        if candidate.exists():
            return str(candidate)
    return str(normalized_candidates[0])
 class ArmBase(object):
    def __init__(self,
                 name=None,
@@ -49,8 +11,8 @@ class ArmBase(object):
                 gripper=None
                 ):
        self.name = name
-        self.urdf_path = resolve_robot_asset_path(urdf_path)
+        self.urdf_path = urdf_path
-        self.xml_path = resolve_robot_asset_path(xml_path)
+        self.xml_path = xml_path
        self.gripper = gripper
        self.robot_model = mujoco.MjModel.from_xml_path(filename=self.xml_path, assets=None)
        self.robot_data = mujoco.MjData(self.robot_model)
--- a/roboimi/assets/robots/diana_med.py
+++ b/roboimi/assets/robots/diana_med.py
@@ -58,8 +58,8 @@ class BiDianaMed(ArmBase):
    def __init__(self):
        super().__init__(
            name="Bidiana",
-            urdf_path="roboimi/assets/models/manipulators/DianaMed/DualDianaMed.urdf",
+            urdf_path="./assets/models/manipulators/DianaMed/DualDianaMed.urdf",
-            xml_path="roboimi/assets/models/manipulators/DianaMed/bi_diana_transfer_ee.xml",
+            xml_path="./assets/models/manipulators/DianaMed/bi_diana_transfer_ee.xml",
            gripper=None
        )
        self.left_arm = self.Arm(self, 'single', self.urdf_path)
--- a/roboimi/ddt/main.py
+++ b/roboimi/ddt/main.py
@@ -0,0 +1,112 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 DDT 模型构建和优化器配置。
 """
 import argparse
 from pathlib import Path
 import numpy as np
 import torch
 from .models import build_DDT_model
 def get_args_parser():
    """获取 DDT 模型的参数解析器。"""
    parser = argparse.ArgumentParser('DDT model configuration', add_help=False)
    # 学习率
    parser.add_argument('--lr', default=1e-4, type=float)
    parser.add_argument('--lr_backbone', default=1e-5, type=float)
    parser.add_argument('--batch_size', default=2, type=int)
    parser.add_argument('--weight_decay', default=1e-4, type=float)
    parser.add_argument('--epochs', default=300, type=int)
    parser.add_argument('--lr_drop', default=200, type=int)
    parser.add_argument('--clip_max_norm', default=0.1, type=float,
                        help='gradient clipping max norm')
    parser.add_argument('--qpos_noise_std', action='store', default=0, type=float)
    # Backbone 参数
    parser.add_argument('--backbone', default='resnet18', type=str,
                        help="Name of the convolutional backbone to use")
    parser.add_argument('--dilation', action='store_true',
                        help="If true, replace stride with dilation in the last conv block")
    parser.add_argument('--position_embedding', default='sine', type=str, 
                        choices=('sine', 'learned'),
                        help="Type of positional embedding")
    parser.add_argument('--camera_names', default=[], type=list,
                        help="A list of camera names")
    # Transformer 参数
    parser.add_argument('--enc_layers', default=4, type=int,
                        help="Number of encoding layers in the transformer")
    parser.add_argument('--dec_layers', default=6, type=int,
                        help="Number of decoding layers in the transformer")
    parser.add_argument('--dim_feedforward', default=2048, type=int,
                        help="Intermediate size of the feedforward layers")
    parser.add_argument('--hidden_dim', default=512, type=int,
                        help="Size of the embeddings (dimension of the transformer)")
    parser.add_argument('--dropout', default=0.1, type=float,
                        help="Dropout applied in the transformer")
    parser.add_argument('--nheads', default=8, type=int,
                        help="Number of attention heads")
    parser.add_argument('--num_queries', default=100, type=int,
                        help="Number of query slots (action horizon)")
    parser.add_argument('--pre_norm', action='store_true')
    parser.add_argument('--state_dim', default=14, type=int)
    parser.add_argument('--action_dim', default=14, type=int)
    # DDT 特有参数
    parser.add_argument('--num_blocks', default=12, type=int,
                        help="Total number of transformer blocks in DDT")
    parser.add_argument('--mlp_ratio', default=4.0, type=float,
                        help="MLP hidden dimension ratio")
    parser.add_argument('--num_inference_steps', default=10, type=int,
                        help="Number of diffusion inference steps")
    # Segmentation (未使用)
    parser.add_argument('--masks', action='store_true',
                        help="Train segmentation head if provided")
    return parser
 def build_DDT_model_and_optimizer(args_override):
    """构建 DDT 模型和优化器。
    Args:
        args_override: 覆盖默认参数的字典
    Returns:
        model: DDT 模型
        optimizer: AdamW 优化器
    """
    parser = argparse.ArgumentParser('DDT training script', parents=[get_args_parser()])
    args = parser.parse_args([])  # 空列表避免命令行参数干扰
    # 应用参数覆盖
    for k, v in args_override.items():
        setattr(args, k, v)
    # 构建模型
    model = build_DDT_model(args)
    model.cuda()
    # 配置优化器（backbone 使用较小学习率）
    param_dicts = [
        {
            "params": [p for n, p in model.named_parameters() 
                      if "backbone" not in n and p.requires_grad]
        },
        {
            "params": [p for n, p in model.named_parameters() 
                      if "backbone" in n and p.requires_grad],
            "lr": args.lr_backbone,
        },
    ]
    optimizer = torch.optim.AdamW(
        param_dicts, 
        lr=args.lr,
        weight_decay=args.weight_decay
    )
    return model, optimizer
--- a/roboimi/ddt/models/init.py
+++ b/roboimi/ddt/models/init.py
@@ -0,0 +1,7 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 from .model import build as build_ddt
 from .model import build_ddt
 def build_DDT_model(args):
    """构建 DDT 模型的统一入口。"""
    return build_ddt(args)
--- a/roboimi/ddt/models/backbone.py
+++ b/roboimi/ddt/models/backbone.py
--- a/roboimi/ddt/models/ddt.py
+++ b/roboimi/ddt/models/ddt.py
@@ -0,0 +1,631 @@
 """
 动作序列扩散 Transformer (Action Decoupled Diffusion Transformer)
 基于 DDT 架构修改，用于生成机器人动作序列。
 主要改动:
    1. 2D RoPE → 1D RoPE (适配时序数据)
    2. LabelEmbedder → ObservationEncoder (观测条件)
    3. 去除 patchify/unpatchify (动作序列已是 1D)
 """
 import math
 from typing import Tuple, Optional
 import torch
 import torch.nn as nn
 from torch.nn.functional import scaled_dot_product_attention
 # ============================================================================
 # 通用工具函数
 # ============================================================================
 def modulate(x: torch.Tensor, shift: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
    """AdaLN 调制函数。
    Args:
        x: 输入张量。
        shift: 偏移量。
        scale: 缩放量。
    Returns:
        调制后的张量: x * (1 + scale) + shift
    """
    return x * (1 + scale) + shift
 # ============================================================================
 # 1D 旋转位置编码 (RoPE)
 # ============================================================================
 def precompute_freqs_cis_1d(dim: int, seq_len: int, theta: float = 10000.0) -> torch.Tensor:
    """预计算 1D 旋转位置编码的复数频率。
    用于时序数据（如动作序列）的位置编码，相比 2D RoPE 更简单高效。
    Args:
        dim: 每个注意力头的维度 (head_dim)。
        seq_len: 序列长度。
        theta: RoPE 的基础频率，默认 10000.0。
    Returns:
        复数频率张量，形状为 (seq_len, dim//2)。
    """
    # 计算频率: 1 / (theta^(2i/dim))
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))  # [dim//2]
    # 位置索引
    t = torch.arange(seq_len).float()  # [seq_len]
    # 外积得到位置-频率矩阵
    freqs = torch.outer(t, freqs)  # [seq_len, dim//2]
    # 转换为复数形式 (极坐标)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # [seq_len, dim//2]
    return freqs_cis
 def apply_rotary_emb_1d(
        xq: torch.Tensor,
        xk: torch.Tensor,
        freqs_cis: torch.Tensor,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
    """应用 1D 旋转位置编码到 Query 和 Key。
    Args:
        xq: Query 张量，形状为 (B, N, H, Hc)。
        xk: Key 张量，形状为 (B, N, H, Hc)。
        freqs_cis: 预计算的复数频率，形状为 (N, Hc//2)。
    Returns:
        应用 RoPE 后的 (xq, xk)，形状不变。
    """
    # 调整 freqs_cis 形状以便广播: [1, N, 1, Hc//2]
    freqs_cis = freqs_cis[None, :, None, :]
    # 将实数张量视为复数: [B, N, H, Hc] -> [B, N, H, Hc//2] (复数)
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    # 复数乘法实现旋转
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)  # [B, N, H, Hc]
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)
 # ============================================================================
 # 基础组件
 # ============================================================================
 class Embed(nn.Module):
    """线性嵌入层，将输入投影到隐藏空间。"""
    def __init__(
            self,
            in_chans: int = 3,
            embed_dim: int = 768,
            norm_layer: Optional[nn.Module] = None,
            bias: bool = True,
    ):
        """初始化 Embed。
        Args:
            in_chans: 输入通道数/维度。
            embed_dim: 输出嵌入维度。
            norm_layer: 可选的归一化层。
            bias: 是否使用偏置。
        """
        super().__init__()
        self.in_chans = in_chans
        self.embed_dim = embed_dim
        self.proj = nn.Linear(in_chans, embed_dim, bias=bias)
        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.proj(x)
        x = self.norm(x)
        return x
 class TimestepEmbedder(nn.Module):
    """扩散时间步嵌入器。
    使用正弦位置编码 + MLP 将标量时间步映射到高维向量。
    """
    def __init__(self, hidden_size: int, frequency_embedding_size: int = 256):
        """初始化 TimestepEmbedder。
        Args:
            hidden_size: 输出嵌入维度。
            frequency_embedding_size: 正弦编码的维度。
        """
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
            nn.SiLU(),
            nn.Linear(hidden_size, hidden_size, bias=True),
        )
        self.frequency_embedding_size = frequency_embedding_size
    @staticmethod
    def timestep_embedding(t: torch.Tensor, dim: int, max_period: float = 10.0) -> torch.Tensor:
        """生成正弦时间步嵌入。
        Args:
            t: 时间步张量，形状为 (B,)。
            dim: 嵌入维度。
            max_period: 最大周期。
        Returns:
            时间步嵌入，形状为 (B, dim)。
        """
        half = dim // 2
        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32, device=t.device) / half
        )
        args = t[..., None].float() * freqs[None, ...]
        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
        if dim % 2:
            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
        return embedding
    def forward(self, t: torch.Tensor) -> torch.Tensor:
        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
        t_emb = self.mlp(t_freq)
        return t_emb
 class ObservationEncoder(nn.Module):
    """观测状态编码器。
    将机器人的观测向量（如关节位置、末端位姿、图像特征等）
    编码为条件向量，用于条件扩散生成。
    Attributes:
        encoder: 两层 MLP 编码器。
    Example:
        >>> encoder = ObservationEncoder(obs_dim=128, hidden_size=512)
        >>> obs = torch.randn(2, 128)
        >>> cond = encoder(obs)  # [2, 512]
    """
    def __init__(self, obs_dim: int, hidden_size: int):
        """初始化 ObservationEncoder。
        Args:
            obs_dim: 观测向量的维度。
            hidden_size: 输出条件向量的维度。
        """
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.SiLU(),
            nn.Linear(hidden_size, hidden_size),
        )
    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        """前向传播。
        Args:
            obs: 观测向量，形状为 (B, obs_dim)。
        Returns:
            条件向量，形状为 (B, hidden_size)。
        """
        return self.encoder(obs)
 class FinalLayer(nn.Module):
    """最终输出层，使用 AdaLN 调制后输出预测结果。"""
    def __init__(self, hidden_size: int, out_channels: int):
        """初始化 FinalLayer。
        Args:
            hidden_size: 输入隐藏维度。
            out_channels: 输出通道数/维度。
        """
        super().__init__()
        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
        self.linear = nn.Linear(hidden_size, out_channels, bias=True)
        self.adaLN_modulation = nn.Sequential(
            nn.Linear(hidden_size, 2 * hidden_size, bias=True)
        )
    def forward(self, x: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
        """前向传播。
        Args:
            x: 输入张量，形状为 (B, N, hidden_size)。
            c: 条件张量，形状为 (B, N, hidden_size) 或 (B, 1, hidden_size)。
        Returns:
            输出张量，形状为 (B, N, out_channels)。
        """
        shift, scale = self.adaLN_modulation(c).chunk(2, dim=-1)
        x = modulate(self.norm_final(x), shift, scale)
        x = self.linear(x)
        return x
 # ============================================================================
 # 归一化和前馈网络
 # ============================================================================
 class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization (RMS 归一化层)。
    RMSNorm 是 LayerNorm 的简化版本，去掉了均值中心化操作，只保留缩放。
    相比 LayerNorm 计算更快，效果相当，被广泛用于 LLaMA、Mistral 等大模型。
    数学公式:
        RMSNorm(x) = x / sqrt(mean(x^2) + eps) * weight
    """
    def __init__(self, hidden_size: int, eps: float = 1e-6):
        """初始化 RMSNorm。
        Args:
            hidden_size: 输入特征的维度。
            eps: 防止除零的小常数。
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
 class FeedForward(nn.Module):
    """SwiGLU 前馈网络 (Feed-Forward Network)。
    使用 SwiGLU 门控激活函数的前馈网络，来自 LLaMA 架构。
    结构:
        output = W2(SiLU(W1(x)) * W3(x))
    """
    def __init__(self, dim: int, hidden_dim: int):
        """初始化 FeedForward。
        Args:
            dim: 输入和输出的特征维度。
            hidden_dim: 隐藏层维度（实际使用 2/3 * hidden_dim）。
        """
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.w2(torch.nn.functional.silu(self.w1(x)) * self.w3(x))
        return x
 # ============================================================================
 # 注意力机制
 # ============================================================================
 class RAttention(nn.Module):
    """带有旋转位置编码的多头自注意力 (Rotary Attention)。
    集成了以下技术:
        - 1D RoPE: 通过复数旋转编码时序位置信息
        - QK-Norm: 对 Query 和 Key 进行归一化，稳定训练
        - Flash Attention: 使用 scaled_dot_product_attention
    """
    def __init__(
            self,
            dim: int,
            num_heads: int = 8,
            qkv_bias: bool = False,
            qk_norm: bool = True,
            attn_drop: float = 0.,
            proj_drop: float = 0.,
            norm_layer: nn.Module = RMSNorm,
    ) -> None:
        """初始化 RAttention。
        Args:
            dim: 输入特征维度，必须能被 num_heads 整除。
            num_heads: 注意力头数。
            qkv_bias: QKV 投影是否使用偏置。
            qk_norm: 是否对 Q, K 进行归一化。
            attn_drop: 注意力权重的 dropout 率。
            proj_drop: 输出投影的 dropout 率。
            norm_layer: 归一化层类型。
        """
        super().__init__()
        assert dim % num_heads == 0, 'dim should be divisible by num_heads'
        self.dim = dim
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5
        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.q_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
        self.k_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)
    def forward(
            self,
            x: torch.Tensor,
            pos: torch.Tensor,
            mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """前向传播。
        Args:
            x: 输入张量，形状为 (B, N, C)。
            pos: 1D RoPE 位置编码，形状为 (N, head_dim//2)。
            mask: 可选的注意力掩码。
        Returns:
            输出张量，形状为 (B, N, C)。
        """
        B, N, C = x.shape
        # QKV 投影
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 1, 3, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]  # [B, N, H, Hc]
        # QK-Norm
        q = self.q_norm(q)
        k = self.k_norm(k)
        # 应用 1D RoPE
        q, k = apply_rotary_emb_1d(q, k, freqs_cis=pos)
        # 调整维度: [B, N, H, Hc] -> [B, H, N, Hc]
        q = q.view(B, -1, self.num_heads, C // self.num_heads).transpose(1, 2)
        k = k.view(B, -1, self.num_heads, C // self.num_heads).transpose(1, 2).contiguous()
        v = v.view(B, -1, self.num_heads, C // self.num_heads).transpose(1, 2).contiguous()
        # Scaled Dot-Product Attention
        x = scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
        # 输出投影
        x = x.transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x
 # ============================================================================
 # Transformer Block
 # ============================================================================
 class ActionDDTBlock(nn.Module):
    """动作 DDT Transformer Block。
    结构: Pre-Norm + AdaLN + Attention + FFN
    """
    def __init__(self, hidden_size: int, num_heads: int, mlp_ratio: float = 4.0):
        """初始化 ActionDDTBlock。
        Args:
            hidden_size: 隐藏层维度。
            num_heads: 注意力头数。
            mlp_ratio: FFN 隐藏层倍率。
        """
        super().__init__()
        self.norm1 = RMSNorm(hidden_size, eps=1e-6)
        self.attn = RAttention(hidden_size, num_heads=num_heads, qkv_bias=False)
        self.norm2 = RMSNorm(hidden_size, eps=1e-6)
        mlp_hidden_dim = int(hidden_size * mlp_ratio)
        self.mlp = FeedForward(hidden_size, mlp_hidden_dim)
        self.adaLN_modulation = nn.Sequential(
            nn.Linear(hidden_size, 6 * hidden_size, bias=True)
        )
    def forward(
            self,
            x: torch.Tensor,
            c: torch.Tensor,
            pos: torch.Tensor,
            mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """前向传播。
        Args:
            x: 输入张量，形状为 (B, N, hidden_size)。
            c: 条件张量，形状为 (B, 1, hidden_size) 或 (B, N, hidden_size)。
            pos: 位置编码。
            mask: 可选的注意力掩码。
        Returns:
            输出张量，形状为 (B, N, hidden_size)。
        """
        # AdaLN 调制参数
        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = \
            self.adaLN_modulation(c).chunk(6, dim=-1)
        # Attention 分支
        x = x + gate_msa * self.attn(modulate(self.norm1(x), shift_msa, scale_msa), pos, mask=mask)
        # FFN 分支
        x = x + gate_mlp * self.mlp(modulate(self.norm2(x), shift_mlp, scale_mlp))
        return x
 # ============================================================================
 # 主模型: ActionDDT
 # ============================================================================
 class ActionDDT(nn.Module):
    """动作序列解耦扩散 Transformer (Action Decoupled Diffusion Transformer)。
    基于 DDT 架构，专为机器人动作序列生成设计。
    将模型解耦为编码器和解码器两部分，编码器状态可缓存以加速推理。
    架构:
        - 编码器: 前 num_encoder_blocks 个 block，生成状态 s
        - 解码器: 剩余 block，使用状态 s 对动作序列 x 去噪
        - 使用 1D RoPE 进行时序位置编码
        - 使用 AdaLN 注入时间步和观测条件
    Args:
        action_dim: 动作向量维度（如 7-DoF 机械臂为 7）。
        obs_dim: 观测向量维度。
        action_horizon: 预测的动作序列长度。
        hidden_size: Transformer 隐藏层维度。
        num_blocks: Transformer block 总数。
        num_encoder_blocks: 编码器 block 数量。
        num_heads: 注意力头数。
        mlp_ratio: FFN 隐藏层倍率。
    输入:
        x (Tensor): 带噪声的动作序列，形状为 (B, T, action_dim)。
        t (Tensor): 扩散时间步，形状为 (B,)，取值范围 [0, 1]。
        obs (Tensor): 观测条件，形状为 (B, obs_dim)。
        s (Tensor, optional): 缓存的编码器状态。
    输出:
        x (Tensor): 预测的速度场/噪声，形状为 (B, T, action_dim)。
        s (Tensor): 编码器状态，可缓存复用。
    Example:
        >>> model = ActionDDT(action_dim=7, obs_dim=128, action_horizon=16)
        >>> x = torch.randn(2, 16, 7)   # 带噪声的动作序列
        >>> t = torch.rand(2)            # 随机时间步
        >>> obs = torch.randn(2, 128)    # 观测条件
        >>> out, state = model(x, t, obs)
        >>> out.shape
        torch.Size([2, 16, 7])
    """
    def __init__(
            self,
            action_dim: int = 7,
            obs_dim: int = 128,
            action_horizon: int = 16,
            hidden_size: int = 512,
            num_blocks: int = 12,
            num_encoder_blocks: int = 4,
            num_heads: int = 8,
            mlp_ratio: float = 4.0,
    ):
        super().__init__()
        # 保存配置
        self.action_dim = action_dim
        self.obs_dim = obs_dim
        self.action_horizon = action_horizon
        self.hidden_size = hidden_size
        self.num_blocks = num_blocks
        self.num_encoder_blocks = num_encoder_blocks
        self.num_heads = num_heads
        # 动作嵌入层
        self.x_embedder = Embed(action_dim, hidden_size, bias=True)
        self.s_embedder = Embed(action_dim, hidden_size, bias=True)
        # 条件嵌入
        self.t_embedder = TimestepEmbedder(hidden_size)
        self.obs_encoder = ObservationEncoder(obs_dim, hidden_size)
        # 输出层
        self.final_layer = FinalLayer(hidden_size, action_dim)
        # Transformer blocks
        self.blocks = nn.ModuleList([
            ActionDDTBlock(hidden_size, num_heads, mlp_ratio)
            for _ in range(num_blocks)
        ])
        # 预计算 1D 位置编码
        pos = precompute_freqs_cis_1d(hidden_size // num_heads, action_horizon)
        self.register_buffer('pos', pos)
        # 初始化权重
        self.initialize_weights()
    def initialize_weights(self):
        """初始化模型权重。"""
        # 嵌入层使用 Xavier 初始化
        for embedder in [self.x_embedder, self.s_embedder]:
            w = embedder.proj.weight.data
            nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
            nn.init.constant_(embedder.proj.bias, 0)
        # 时间步嵌入 MLP
        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
        # 观测编码器
        for m in self.obs_encoder.encoder:
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
        # 输出层零初始化 (AdaLN-Zero)
        nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
        nn.init.constant_(self.final_layer.adaLN_modulation[-1].bias, 0)
        nn.init.constant_(self.final_layer.linear.weight, 0)
        nn.init.constant_(self.final_layer.linear.bias, 0)
    def forward(
            self,
            x: torch.Tensor,
            t: torch.Tensor,
            obs: torch.Tensor,
            s: Optional[torch.Tensor] = None,
            mask: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """前向传播。
        Args:
            x: 带噪声的动作序列 [B, T, action_dim]
            t: 扩散时间步 [B] 或 [B, 1]，取值范围 [0, 1]
            obs: 观测条件 [B, obs_dim]
            s: 可选的编码器状态缓存 [B, T, hidden_size]
            mask: 可选的注意力掩码
        Returns:
            x: 预测的速度场/噪声 [B, T, action_dim]
            s: 编码器状态 [B, T, hidden_size]，可缓存复用
        """
        B, T, _ = x.shape
        # 1. 时间步嵌入: [B] -> [B, 1, hidden_size]
        t_emb = self.t_embedder(t.view(-1)).view(B, 1, self.hidden_size)
        # 2. 观测条件嵌入: [B, obs_dim] -> [B, 1, hidden_size]
        obs_emb = self.obs_encoder(obs).view(B, 1, self.hidden_size)
        # 3. 融合条件: c = SiLU(t + obs)
        c = nn.functional.silu(t_emb + obs_emb)
        # 4. 编码器部分: 生成状态 s
        if s is None:
            # 状态嵌入: [B, T, action_dim] -> [B, T, hidden_size]
            s = self.s_embedder(x)
            # 通过编码器 blocks
            for i in range(self.num_encoder_blocks):
                s = self.blocks[i](s, c, self.pos, mask)
            # 融合时间信息
            s = nn.functional.silu(t_emb + s)
        # 5. 解码器部分: 去噪
        # 输入嵌入: [B, T, action_dim] -> [B, T, hidden_size]
        x = self.x_embedder(x)
        # 通过解码器 blocks，以 s 作为条件
        for i in range(self.num_encoder_blocks, self.num_blocks):
            x = self.blocks[i](x, s, self.pos, None)
        # 6. 最终层: [B, T, hidden_size] -> [B, T, action_dim]
        x = self.final_layer(x, s)
        return x, s
--- a/roboimi/ddt/models/model.py
+++ b/roboimi/ddt/models/model.py
@@ -0,0 +1,304 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 DDT model and criterion classes.
 核心组装文件，将 Backbone、Transformer、Diffusion 组件组装为完整模型。
 """
 import torch
 import torch.nn.functional as F
 from torch import nn
 import numpy as np
 from .backbone import build_backbone
 class SpatialSoftmax(nn.Module):
    """Spatial Softmax 层，将特征图转换为关键点坐标。
    来自 Diffusion Policy，保留空间位置信息。
    对每个通道计算软注意力加权的期望坐标。
    Args:
        num_kp: 关键点数量（等于输入通道数）
        temperature: Softmax 温度参数（可学习）
        learnable_temperature: 是否学习温度参数
    输入: [B, C, H, W]
    输出: [B, C * 2] - 每个通道输出 (x, y) 坐标
    """
    def __init__(self, num_kp: int = None, temperature: float = 1.0, learnable_temperature: bool = True):
        super().__init__()
        self.num_kp = num_kp
        if learnable_temperature:
            self.temperature = nn.Parameter(torch.ones(1) * temperature)
        else:
            self.register_buffer('temperature', torch.ones(1) * temperature)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, C, H, W = x.shape
        # 生成归一化坐标网格 [-1, 1]
        pos_x = torch.linspace(-1, 1, W, device=x.device, dtype=x.dtype)
        pos_y = torch.linspace(-1, 1, H, device=x.device, dtype=x.dtype)
        # 展平空间维度
        x_flat = x.view(B, C, -1)  # [B, C, H*W]
        # Softmax 得到注意力权重
        attention = F.softmax(x_flat / self.temperature, dim=-1)  # [B, C, H*W]
        # 计算期望坐标
        # pos_x: [W] -> [1, 1, W] -> repeat -> [1, 1, H*W]
        pos_x_grid = pos_x.view(1, 1, 1, W).expand(1, 1, H, W).reshape(1, 1, -1)
        pos_y_grid = pos_y.view(1, 1, H, 1).expand(1, 1, H, W).reshape(1, 1, -1)
        # 加权求和得到期望坐标
        expected_x = (attention * pos_x_grid).sum(dim=-1)  # [B, C]
        expected_y = (attention * pos_y_grid).sum(dim=-1)  # [B, C]
        # 拼接 x, y 坐标
        keypoints = torch.cat([expected_x, expected_y], dim=-1)  # [B, C * 2]
        return keypoints
 from .ddt import ActionDDT
 def get_sinusoid_encoding_table(n_position, d_hid):
    """生成正弦位置编码表。"""
    def get_position_angle_vec(position):
        return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
    sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
    return torch.FloatTensor(sinusoid_table).unsqueeze(0)
 class DDT(nn.Module):
    """DDT (Decoupled Diffusion Transformer) 模型。
    将视觉 Backbone 和 ActionDDT 扩散模型组合，实现基于图像观测的动作序列生成。
    架构:
        1. Backbone: 提取多相机图像特征
        2. 特征投影: 将图像特征投影到隐藏空间 (Bottleneck 降维)
        3. 状态编码: 编码机器人关节状态
        4. ActionDDT: 扩散 Transformer 生成动作序列
    Args:
        backbones: 视觉骨干网络列表（每个相机一个）
        state_dim: 机器人状态维度
        action_dim: 动作维度
        num_queries: 预测的动作序列长度
        camera_names: 相机名称列表
        hidden_dim: Transformer 隐藏维度
        num_blocks: Transformer block 数量
        num_encoder_blocks: 编码器 block 数量
        num_heads: 注意力头数
        num_kp: Spatial Softmax 的关键点数量 (默认 32)
    """
    def __init__(
        self,
        backbones,
        state_dim: int,
        action_dim: int,
        num_queries: int,
        camera_names: list,
        hidden_dim: int = 512,
        num_blocks: int = 12,
        num_encoder_blocks: int = 4,
        num_heads: int = 8,
        mlp_ratio: float = 4.0,
        num_kp: int = 32, # [修改] 新增参数，默认 32
    ):
        super().__init__()
        self.num_queries = num_queries
        self.camera_names = camera_names
        self.hidden_dim = hidden_dim
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.num_kp = num_kp
        # Backbone 相关
        self.backbones = nn.ModuleList(backbones)
        # [修改] 投影层: ResNet Channels -> num_kp (32)
        # 这是一个 Bottleneck 层，大幅减少特征通道数
        self.input_proj = nn.Conv2d(
            backbones[0].num_channels, num_kp, kernel_size=1
        )
        # 状态编码 (2层 MLP，与 Diffusion Policy 一致)
        # 状态依然映射到 hidden_dim (512)，保持信息量
        self.input_proj_robot_state = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        # [修改] 图像特征聚合 (SpatialSoftmax)
        # 输入: [B, num_kp, H, W]
        # 输出: [B, num_kp * 2] (每个通道的 x, y 坐标)
        self.img_feature_proj = SpatialSoftmax(num_kp=num_kp)
        # [修改] 计算观测维度: 图像特征 + 状态
        # 图像部分: 关键点数量 * 2(x,y) * 摄像头数量
        img_feature_dim = num_kp * 2 * len(camera_names)
        obs_dim = img_feature_dim + hidden_dim
        # ActionDDT 扩散模型
        self.action_ddt = ActionDDT(
            action_dim=action_dim,
            obs_dim=obs_dim, # 使用新的、更紧凑的维度
            action_horizon=num_queries,
            hidden_size=hidden_dim,
            num_blocks=num_blocks,
            num_encoder_blocks=num_encoder_blocks,
            num_heads=num_heads,
            mlp_ratio=mlp_ratio,
        )
    def encode_observations(self, qpos, image):
        """编码观测（图像 + 状态）为条件向量。
        Args:
            qpos: 机器人关节状态 [B, state_dim]
            image: 多相机图像 [B, num_cam, C, H, W]
        Returns:
            obs: 观测条件向量 [B, obs_dim]
        """
        bs = qpos.shape[0]
        # 编码图像特征
        all_cam_features = []
        for cam_id, cam_name in enumerate(self.camera_names):
            features, pos = self.backbones[cam_id](image[:, cam_id])
            features = features[0]  # 取最后一层特征
            # [说明] 这里的 input_proj 现在会将通道压缩到 32
            features = self.input_proj(features)  # [B, num_kp, H', W']
            # [说明] SpatialSoftmax 提取 32 个关键点坐标
            features = self.img_feature_proj(features)  # [B, num_kp * 2]
            all_cam_features.append(features)
        # 拼接所有相机特征
        img_features = torch.cat(all_cam_features, dim=-1)  # [B, num_kp * 2 * num_cam]
        # 编码状态
        qpos_features = self.input_proj_robot_state(qpos)  # [B, hidden_dim]
        # 拼接观测
        obs = torch.cat([img_features, qpos_features], dim=-1)  # [B, obs_dim]
        return obs
    def forward(
        self,
        qpos,
        image,
        env_state,
        actions=None,
        is_pad=None,
        timesteps=None,
    ):
        """前向传播。
        训练时:
            输入带噪声的动作序列和时间步，预测噪声/速度场
        推理时:
            通过扩散采样生成动作序列
        Args:
            qpos: 机器人关节状态 [B, state_dim]
            image: 多相机图像 [B, num_cam, C, H, W]
            env_state: 环境状态（未使用）
            actions: 动作序列 [B, T, action_dim]（训练时为带噪声动作）
            is_pad: padding 标记 [B, T]（未使用）
            timesteps: 扩散时间步 [B]（训练时提供）
        Returns:
            训练时: (noise_pred, encoder_state)
            推理时: (action_pred, encoder_state)
        """
        # 1. 编码观测
        obs = self.encode_observations(qpos, image)
        # 2. 扩散模型前向
        if actions is not None and timesteps is not None:
            # 训练模式: 预测噪声
            noise_pred, encoder_state = self.action_ddt(
                x=actions,
                t=timesteps,
                obs=obs,
            )
            return noise_pred, encoder_state
        else:
            # 推理模式: 需要在 Policy 层进行扩散采样
            # 这里返回编码的观测，供 Policy 层使用
            return obs, None
    def get_obs_dim(self):
        """返回观测向量的维度。"""
        # [修改] 使用 num_kp 重新计算
        return self.num_kp * 2 * len(self.camera_names) + self.hidden_dim
 def build(args):
    """构建 DDT 模型。
    Args:
        args: 包含模型配置的参数对象
            - state_dim: 状态维度
            - action_dim: 动作维度
            - camera_names: 相机名称列表
            - hidden_dim: 隐藏维度
            - num_queries: 动作序列长度
            - num_blocks: Transformer block 数量
            - enc_layers: 编码器层数
            - nheads: 注意力头数
            - num_kp: 关键点数量 (可选，默认32)
    Returns:
        model: DDT 模型实例
    """
    state_dim = args.state_dim
    action_dim = args.action_dim
    # 构建 Backbone（每个相机一个）
    backbones = []
    for _ in args.camera_names:
        backbone = build_backbone(args)
        backbones.append(backbone)
    # 构建 DDT 模型
    model = DDT(
        backbones=backbones,
        state_dim=state_dim,
        action_dim=action_dim,
        num_queries=args.num_queries,
        camera_names=args.camera_names,
        hidden_dim=args.hidden_dim,
        num_blocks=getattr(args, 'num_blocks', 12),
        num_encoder_blocks=getattr(args, 'enc_layers', 4),
        num_heads=args.nheads,
        mlp_ratio=getattr(args, 'mlp_ratio', 4.0),
        num_kp=getattr(args, 'num_kp', 32), # [修改] 传递 num_kp 参数
    )
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("number of parameters: %.2fM" % (n_parameters / 1e6,))
    return model
 def build_ddt(args):
    """build 的别名，保持接口一致性。"""
    return build(args)
--- a/roboimi/ddt/models/position_encoding.py
+++ b/roboimi/ddt/models/position_encoding.py
--- a/roboimi/ddt/models/transformer.py
+++ b/roboimi/ddt/models/transformer.py
@@ -0,0 +1,312 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 DETR Transformer class.
 Copy-paste from torch.nn.Transformer with modifications:
    * positional encodings are passed in MHattention
    * extra LN at the end of encoder is removed
    * decoder returns a stack of activations from all decoding layers
 """
 import copy
 from typing import Optional, List
 import torch
 import torch.nn.functional as F
 from torch import nn, Tensor
 class Transformer(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False):
        super().__init__()
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)
        self._reset_parameters()
        self.d_model = d_model
        self.nhead = nhead
    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    def forward(self, src, mask, query_embed, pos_embed, latent_input=None, proprio_input=None, additional_pos_embed=None):
        # TODO flatten only when input has H and W
        if len(src.shape) == 4: # has H and W
            # flatten NxCxHxW to HWxNxC
            bs, c, h, w = src.shape
            src = src.flatten(2).permute(2, 0, 1)
            pos_embed = pos_embed.flatten(2).permute(2, 0, 1).repeat(1, bs, 1)
            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
            # mask = mask.flatten(1)
            additional_pos_embed = additional_pos_embed.unsqueeze(1).repeat(1, bs, 1) # seq, bs, dim
            pos_embed = torch.cat([additional_pos_embed, pos_embed], axis=0)
            addition_input = torch.stack([latent_input, proprio_input], axis=0)
            src = torch.cat([addition_input, src], axis=0)
        else:
            assert len(src.shape) == 3
            # flatten NxHWxC to HWxNxC
            bs, hw, c = src.shape
            src = src.permute(1, 0, 2)
            pos_embed = pos_embed.unsqueeze(1).repeat(1, bs, 1)
            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
        tgt = torch.zeros_like(query_embed)
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)
        hs = hs.transpose(1, 2)
        return hs
 class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers, norm=None):
        super().__init__()
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
    def forward(self, src,
                mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        output = src
        for layer in self.layers:
            output = layer(output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)
        if self.norm is not None:
            output = self.norm(output)
        return output
 class TransformerDecoder(nn.Module):
    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate
    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt
        intermediate = []
        for layer in self.layers:
            output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:
                intermediate.append(self.norm(output))
        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)
        if self.return_intermediate:
            return torch.stack(intermediate)
        return output.unsqueeze(0)
 class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos
    def forward_post(self,
                     src,
                     src_mask: Optional[Tensor] = None,
                     src_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None):
        q = k = self.with_pos_embed(src, pos)
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src
    def forward_pre(self, src,
                    src_mask: Optional[Tensor] = None,
                    src_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None):
        src2 = self.norm1(src)
        q = k = self.with_pos_embed(src2, pos)
        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src2 = self.norm2(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
        src = src + self.dropout2(src2)
        return src
    def forward(self, src,
                src_mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
        return self.forward_post(src, src_mask, src_key_padding_mask, pos)
 class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos
    def forward_post(self, tgt, memory,
                     tgt_mask: Optional[Tensor] = None,
                     memory_mask: Optional[Tensor] = None,
                     tgt_key_padding_mask: Optional[Tensor] = None,
                     memory_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None,
                     query_pos: Optional[Tensor] = None):
        q = k = self.with_pos_embed(tgt, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = tgt + self.dropout3(tgt2)
        tgt = self.norm3(tgt)
        return tgt
    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        return tgt
    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        if self.normalize_before:
            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
        return self.forward_post(tgt, memory, tgt_mask, memory_mask,
                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
 def _get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
 def build_transformer(args):
    return Transformer(
        d_model=args.hidden_dim,
        dropout=args.dropout,
        nhead=args.nheads,
        dim_feedforward=args.dim_feedforward,
        num_encoder_layers=args.enc_layers,
        num_decoder_layers=args.dec_layers,
        normalize_before=args.pre_norm,
        return_intermediate_dec=True,
    )
 def _get_activation_fn(activation):
    """Return an activation function given a string"""
    if activation == "relu":
        return F.relu
    if activation == "gelu":
        return F.gelu
    if activation == "glu":
        return F.glu
    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")
--- a/roboimi/ddt/policy.py
+++ b/roboimi/ddt/policy.py
@@ -0,0 +1,147 @@
 """
 DDT Policy - 基于扩散模型的动作生成策略。
 支持 Flow Matching 训练和推理。
 """
 import torch
 import torch.nn as nn
 from torch.nn import functional as F
 import torchvision.transforms as transforms
 from torchvision.transforms import v2
 import math
 from roboimi.ddt.main import build_DDT_model_and_optimizer
 class DDTPolicy(nn.Module):
    """DDT (Decoupled Diffusion Transformer) 策略。
    使用 Flow Matching 进行训练，支持多步扩散采样推理。
    带数据增强，适配 DINOv2 等 ViT backbone。
    Args:
        args_override: 配置参数字典
            - num_inference_steps: 推理时的扩散步数
            - qpos_noise_std: qpos 噪声标准差（训练时数据增强）
            - patch_h, patch_w: 图像 patch 数量（用于计算目标尺寸）
    """
    def __init__(self, args_override):
        super().__init__()
        model, optimizer = build_DDT_model_and_optimizer(args_override)
        self.model = model
        self.optimizer = optimizer
        self.num_inference_steps = args_override.get('num_inference_steps', 10)
        self.qpos_noise_std = args_override.get('qpos_noise_std', 0.0)
        # 图像尺寸配置 (适配 DINOv2)
        self.patch_h = args_override.get('patch_h', 16)
        self.patch_w = args_override.get('patch_w', 22)
        print(f'DDT Policy: {self.num_inference_steps} steps, '
              f'image size ({self.patch_h*14}, {self.patch_w*14})')
    def __call__(self, qpos, image, actions=None, is_pad=None):
        """前向传播。
        训练时: 使用 Flow Matching 损失
        推理时: 通过扩散采样生成动作
        Args:
            qpos: 机器人关节状态 [B, state_dim]
            image: 多相机图像 [B, num_cam, C, H, W]
            actions: 目标动作序列 [B, T, action_dim]（训练时提供）
            is_pad: padding 标记 [B, T]
        Returns:
            训练时: loss_dict
            推理时: 预测的动作序列 [B, T, action_dim]
        """
        env_state = None
        # 图像预处理
        if actions is not None:  # 训练时：数据增强
            transform = v2.Compose([
                v2.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
                v2.RandomPerspective(distortion_scale=0.5),
                v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
                v2.GaussianBlur(kernel_size=(9, 9), sigma=(0.1, 2.0)),
                v2.Resize((self.patch_h * 14, self.patch_w * 14)),
                v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
            ])
            if self.qpos_noise_std > 0:
                qpos = qpos + (self.qpos_noise_std ** 0.5) * torch.randn_like(qpos)
        else:  # 推理时
            transform = v2.Compose([
                v2.Resize((self.patch_h * 14, self.patch_w * 14)),
                v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
            ])
        image = transform(image)
        if actions is not None:
            actions = actions[:, :self.model.num_queries]
            is_pad = is_pad[:, :self.model.num_queries]
            loss_dict = self._compute_loss(qpos, image, actions, is_pad)
            return loss_dict
        else:
            a_hat = self._sample(qpos, image)
            return a_hat
    def _compute_loss(self, qpos, image, actions, is_pad):
        """计算 Flow Matching 损失。
        Flow Matching 目标: 学习从噪声到数据的向量场
        损失: ||v_theta(x_t, t) - (x_1 - x_0)||^2
        其中 x_t = (1-t)*x_0 + t*x_1, x_0 是噪声, x_1 是目标动作
        """
        B, T, action_dim = actions.shape
        device = actions.device
        t = torch.rand(B, device=device)
        noise = torch.randn_like(actions)
        t_expand = t.view(B, 1, 1).expand(B, T, action_dim)
        x_t = (1 - t_expand) * noise + t_expand * actions
        target_velocity = actions - noise
        pred_velocity, _ = self.model(
            qpos=qpos,
            image=image,
            env_state=None,
            actions=x_t,
            timesteps=t,
        )
        all_loss = F.mse_loss(pred_velocity, target_velocity, reduction='none')
        loss = (all_loss * ~is_pad.unsqueeze(-1)).mean()
        return {'flow_loss': loss, 'loss': loss}
    @torch.no_grad()
    def _sample(self, qpos, image):
        """通过 ODE 求解进行扩散采样。
        使用 Euler 方法从 t=0 积分到 t=1:
        x_{t+dt} = x_t + v_theta(x_t, t) * dt
        """
        B = qpos.shape[0]
        T = self.model.num_queries
        action_dim = self.model.action_dim
        device = qpos.device
        x = torch.randn(B, T, action_dim, device=device)
        obs = self.model.encode_observations(qpos, image)
        dt = 1.0 / self.num_inference_steps
        for i in range(self.num_inference_steps):
            t = torch.full((B,), i * dt, device=device)
            velocity, _ = self.model.action_ddt(x=x, t=t, obs=obs)
            x = x + velocity * dt
        return x
    def configure_optimizers(self):
        """返回优化器。"""
        return self.optimizer
--- a/roboimi/ddt/util/init.py
+++ b/roboimi/ddt/util/init.py
@@ -0,0 +1 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
--- a/roboimi/ddt/util/box_ops.py
+++ b/roboimi/ddt/util/box_ops.py
@@ -0,0 +1,88 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 Utilities for bounding box manipulation and GIoU.
 """
 import torch
 from torchvision.ops.boxes import box_area
 def box_cxcywh_to_xyxy(x):
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
         (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=-1)
 def box_xyxy_to_cxcywh(x):
    x0, y0, x1, y1 = x.unbind(-1)
    b = [(x0 + x1) / 2, (y0 + y1) / 2,
         (x1 - x0), (y1 - y0)]
    return torch.stack(b, dim=-1)
 # modified from torchvision to also return the union
 def box_iou(boxes1, boxes2):
    area1 = box_area(boxes1)
    area2 = box_area(boxes2)
    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
    wh = (rb - lt).clamp(min=0)  # [N,M,2]
    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
    union = area1[:, None] + area2 - inter
    iou = inter / union
    return iou, union
 def generalized_box_iou(boxes1, boxes2):
    """
    Generalized IoU from https://giou.stanford.edu/
    The boxes should be in [x0, y0, x1, y1] format
    Returns a [N, M] pairwise matrix, where N = len(boxes1)
    and M = len(boxes2)
    """
    # degenerate boxes gives inf / nan results
    # so do an early check
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
    iou, union = box_iou(boxes1, boxes2)
    lt = torch.min(boxes1[:, None, :2], boxes2[:, :2])
    rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:])
    wh = (rb - lt).clamp(min=0)  # [N,M,2]
    area = wh[:, :, 0] * wh[:, :, 1]
    return iou - (area - union) / area
 def masks_to_boxes(masks):
    """Compute the bounding boxes around the provided masks
    The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.
    Returns a [N, 4] tensors, with the boxes in xyxy format
    """
    if masks.numel() == 0:
        return torch.zeros((0, 4), device=masks.device)
    h, w = masks.shape[-2:]
    y = torch.arange(0, h, dtype=torch.float)
    x = torch.arange(0, w, dtype=torch.float)
    y, x = torch.meshgrid(y, x)
    x_mask = (masks * x.unsqueeze(0))
    x_max = x_mask.flatten(1).max(-1)[0]
    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
    y_mask = (masks * y.unsqueeze(0))
    y_max = y_mask.flatten(1).max(-1)[0]
    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
    return torch.stack([x_min, y_min, x_max, y_max], 1)
--- a/roboimi/ddt/util/misc.py
+++ b/roboimi/ddt/util/misc.py
@@ -0,0 +1,468 @@
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
 """
 Misc functions, including distributed helpers.
 Mostly copy-paste from torchvision references.
 """
 import os
 import subprocess
 import time
 from collections import defaultdict, deque
 import datetime
 import pickle
 from packaging import version
 from typing import Optional, List
 import torch
 import torch.distributed as dist
 from torch import Tensor
 # needed due to empty tensor bug in pytorch and torchvision 0.5
 import torchvision
 if version.parse(torchvision.__version__) < version.parse('0.7'):
    from torchvision.ops import _new_empty_tensor
    from torchvision.ops.misc import _output_size
 class SmoothedValue(object):
    """Track a series of values and provide access to smoothed values over a
    window or the global series average.
    """
    def __init__(self, window_size=20, fmt=None):
        if fmt is None:
            fmt = "{median:.4f} ({global_avg:.4f})"
        self.deque = deque(maxlen=window_size)
        self.total = 0.0
        self.count = 0
        self.fmt = fmt
    def update(self, value, n=1):
        self.deque.append(value)
        self.count += n
        self.total += value * n
    def synchronize_between_processes(self):
        """
        Warning: does not synchronize the deque!
        """
        if not is_dist_avail_and_initialized():
            return
        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
        dist.barrier()
        dist.all_reduce(t)
        t = t.tolist()
        self.count = int(t[0])
        self.total = t[1]
    @property
    def median(self):
        d = torch.tensor(list(self.deque))
        return d.median().item()
    @property
    def avg(self):
        d = torch.tensor(list(self.deque), dtype=torch.float32)
        return d.mean().item()
    @property
    def global_avg(self):
        return self.total / self.count
    @property
    def max(self):
        return max(self.deque)
    @property
    def value(self):
        return self.deque[-1]
    def __str__(self):
        return self.fmt.format(
            median=self.median,
            avg=self.avg,
            global_avg=self.global_avg,
            max=self.max,
            value=self.value)
 def all_gather(data):
    """
    Run all_gather on arbitrary picklable data (not necessarily tensors)
    Args:
        data: any picklable object
    Returns:
        list[data]: list of data gathered from each rank
    """
    world_size = get_world_size()
    if world_size == 1:
        return [data]
    # serialized to a Tensor
    buffer = pickle.dumps(data)
    storage = torch.ByteStorage.from_buffer(buffer)
    tensor = torch.ByteTensor(storage).to("cuda")
    # obtain Tensor size of each rank
    local_size = torch.tensor([tensor.numel()], device="cuda")
    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
    dist.all_gather(size_list, local_size)
    size_list = [int(size.item()) for size in size_list]
    max_size = max(size_list)
    # receiving Tensor from all ranks
    # we pad the tensor because torch all_gather does not support
    # gathering tensors of different shapes
    tensor_list = []
    for _ in size_list:
        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
    if local_size != max_size:
        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
        tensor = torch.cat((tensor, padding), dim=0)
    dist.all_gather(tensor_list, tensor)
    data_list = []
    for size, tensor in zip(size_list, tensor_list):
        buffer = tensor.cpu().numpy().tobytes()[:size]
        data_list.append(pickle.loads(buffer))
    return data_list
 def reduce_dict(input_dict, average=True):
    """
    Args:
        input_dict (dict): all the values will be reduced
        average (bool): whether to do average or sum
    Reduce the values in the dictionary from all processes so that all processes
    have the averaged results. Returns a dict with the same fields as
    input_dict, after reduction.
    """
    world_size = get_world_size()
    if world_size < 2:
        return input_dict
    with torch.no_grad():
        names = []
        values = []
        # sort the keys so that they are consistent across processes
        for k in sorted(input_dict.keys()):
            names.append(k)
            values.append(input_dict[k])
        values = torch.stack(values, dim=0)
        dist.all_reduce(values)
        if average:
            values /= world_size
        reduced_dict = {k: v for k, v in zip(names, values)}
    return reduced_dict
 class MetricLogger(object):
    def __init__(self, delimiter="\t"):
        self.meters = defaultdict(SmoothedValue)
        self.delimiter = delimiter
    def update(self, **kwargs):
        for k, v in kwargs.items():
            if isinstance(v, torch.Tensor):
                v = v.item()
            assert isinstance(v, (float, int))
            self.meters[k].update(v)
    def __getattr__(self, attr):
        if attr in self.meters:
            return self.meters[attr]
        if attr in self.__dict__:
            return self.__dict__[attr]
        raise AttributeError("'{}' object has no attribute '{}'".format(
            type(self).__name__, attr))
    def __str__(self):
        loss_str = []
        for name, meter in self.meters.items():
            loss_str.append(
                "{}: {}".format(name, str(meter))
            )
        return self.delimiter.join(loss_str)
    def synchronize_between_processes(self):
        for meter in self.meters.values():
            meter.synchronize_between_processes()
    def add_meter(self, name, meter):
        self.meters[name] = meter
    def log_every(self, iterable, print_freq, header=None):
        i = 0
        if not header:
            header = ''
        start_time = time.time()
        end = time.time()
        iter_time = SmoothedValue(fmt='{avg:.4f}')
        data_time = SmoothedValue(fmt='{avg:.4f}')
        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
        if torch.cuda.is_available():
            log_msg = self.delimiter.join([
                header,
                '[{0' + space_fmt + '}/{1}]',
                'eta: {eta}',
                '{meters}',
                'time: {time}',
                'data: {data}',
                'max mem: {memory:.0f}'
            ])
        else:
            log_msg = self.delimiter.join([
                header,
                '[{0' + space_fmt + '}/{1}]',
                'eta: {eta}',
                '{meters}',
                'time: {time}',
                'data: {data}'
            ])
        MB = 1024.0 * 1024.0
        for obj in iterable:
            data_time.update(time.time() - end)
            yield obj
            iter_time.update(time.time() - end)
            if i % print_freq == 0 or i == len(iterable) - 1:
                eta_seconds = iter_time.global_avg * (len(iterable) - i)
                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
                if torch.cuda.is_available():
                    print(log_msg.format(
                        i, len(iterable), eta=eta_string,
                        meters=str(self),
                        time=str(iter_time), data=str(data_time),
                        memory=torch.cuda.max_memory_allocated() / MB))
                else:
                    print(log_msg.format(
                        i, len(iterable), eta=eta_string,
                        meters=str(self),
                        time=str(iter_time), data=str(data_time)))
            i += 1
            end = time.time()
        total_time = time.time() - start_time
        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
        print('{} Total time: {} ({:.4f} s / it)'.format(
            header, total_time_str, total_time / len(iterable)))
 def get_sha():
    cwd = os.path.dirname(os.path.abspath(__file__))
    def _run(command):
        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()
    sha = 'N/A'
    diff = "clean"
    branch = 'N/A'
    try:
        sha = _run(['git', 'rev-parse', 'HEAD'])
        subprocess.check_output(['git', 'diff'], cwd=cwd)
        diff = _run(['git', 'diff-index', 'HEAD'])
        diff = "has uncommited changes" if diff else "clean"
        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])
    except Exception:
        pass
    message = f"sha: {sha}, status: {diff}, branch: {branch}"
    return message
 def collate_fn(batch):
    batch = list(zip(*batch))
    batch[0] = nested_tensor_from_tensor_list(batch[0])
    return tuple(batch)
 def _max_by_axis(the_list):
    # type: (List[List[int]]) -> List[int]
    maxes = the_list[0]
    for sublist in the_list[1:]:
        for index, item in enumerate(sublist):
            maxes[index] = max(maxes[index], item)
    return maxes
 class NestedTensor(object):
    def __init__(self, tensors, mask: Optional[Tensor]):
        self.tensors = tensors
        self.mask = mask
    def to(self, device):
        # type: (Device) -> NestedTensor # noqa
        cast_tensor = self.tensors.to(device)
        mask = self.mask
        if mask is not None:
            assert mask is not None
            cast_mask = mask.to(device)
        else:
            cast_mask = None
        return NestedTensor(cast_tensor, cast_mask)
    def decompose(self):
        return self.tensors, self.mask
    def __repr__(self):
        return str(self.tensors)
 def nested_tensor_from_tensor_list(tensor_list: List[Tensor]):
    # TODO make this more general
    if tensor_list[0].ndim == 3:
        if torchvision._is_tracing():
            # nested_tensor_from_tensor_list() does not export well to ONNX
            # call _onnx_nested_tensor_from_tensor_list() instead
            return _onnx_nested_tensor_from_tensor_list(tensor_list)
        # TODO make it support different-sized images
        max_size = _max_by_axis([list(img.shape) for img in tensor_list])
        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))
        batch_shape = [len(tensor_list)] + max_size
        b, c, h, w = batch_shape
        dtype = tensor_list[0].dtype
        device = tensor_list[0].device
        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)
        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)
        for img, pad_img, m in zip(tensor_list, tensor, mask):
            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
            m[: img.shape[1], :img.shape[2]] = False
    else:
        raise ValueError('not supported')
    return NestedTensor(tensor, mask)
 # _onnx_nested_tensor_from_tensor_list() is an implementation of
 # nested_tensor_from_tensor_list() that is supported by ONNX tracing.
@torch.jit.unused
 def _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> NestedTensor:
    max_size = []
    for i in range(tensor_list[0].dim()):
        max_size_i = torch.max(torch.stack([img.shape[i] for img in tensor_list]).to(torch.float32)).to(torch.int64)
        max_size.append(max_size_i)
    max_size = tuple(max_size)
    # work around for
    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
    # m[: img.shape[1], :img.shape[2]] = False
    # which is not yet supported in onnx
    padded_imgs = []
    padded_masks = []
    for img in tensor_list:
        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]
        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))
        padded_imgs.append(padded_img)
        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)
        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), "constant", 1)
        padded_masks.append(padded_mask.to(torch.bool))
    tensor = torch.stack(padded_imgs)
    mask = torch.stack(padded_masks)
    return NestedTensor(tensor, mask=mask)
 def setup_for_distributed(is_master):
    """
    This function disables printing when not in master process
    """
    import builtins as __builtin__
    builtin_print = __builtin__.print
    def print(*args, **kwargs):
        force = kwargs.pop('force', False)
        if is_master or force:
            builtin_print(*args, **kwargs)
    __builtin__.print = print
 def is_dist_avail_and_initialized():
    if not dist.is_available():
        return False
    if not dist.is_initialized():
        return False
    return True
 def get_world_size():
    if not is_dist_avail_and_initialized():
        return 1
    return dist.get_world_size()
 def get_rank():
    if not is_dist_avail_and_initialized():
        return 0
    return dist.get_rank()
 def is_main_process():
    return get_rank() == 0
 def save_on_master(*args, **kwargs):
    if is_main_process():
        torch.save(*args, **kwargs)
 def init_distributed_mode(args):
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return
    args.distributed = True
    torch.cuda.set_device(args.gpu)
    args.dist_backend = 'nccl'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                         world_size=args.world_size, rank=args.rank)
    torch.distributed.barrier()
    setup_for_distributed(args.rank == 0)
@torch.no_grad()
 def accuracy(output, target, topk=(1,)):
    """Computes the precision@k for the specified values of k"""
    if target.numel() == 0:
        return [torch.zeros([], device=output.device)]
    maxk = max(topk)
    batch_size = target.size(0)
    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    res = []
    for k in topk:
        correct_k = correct[:k].view(-1).float().sum(0)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res
 def interpolate(input, size=None, scale_factor=None, mode="nearest", align_corners=None):
    # type: (Tensor, Optional[List[int]], Optional[float], str, Optional[bool]) -> Tensor
    """
    Equivalent to nn.functional.interpolate, but with support for empty batch sizes.
    This will eventually be supported natively by PyTorch, and this
    class can go away.
    """
    if version.parse(torchvision.__version__) < version.parse('0.7'):
        if input.numel() > 0:
            return torch.nn.functional.interpolate(
                input, size, scale_factor, mode, align_corners
            )
        output_shape = _output_size(2, input, size, scale_factor)
        output_shape = list(input.shape[:-2]) + list(output_shape)
        return _new_empty_tensor(input, output_shape)
    else:
        return torchvision.ops.misc.interpolate(input, size, scale_factor, mode, align_corners)
--- a/roboimi/ddt/util/plot_utils.py
+++ b/roboimi/ddt/util/plot_utils.py
@@ -0,0 +1,107 @@
 """
 Plotting utilities to visualize training logs.
 """
 import torch
 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 from pathlib import Path, PurePath
 def plot_logs(logs, fields=('class_error', 'loss_bbox_unscaled', 'mAP'), ewm_col=0, log_name='log.txt'):
    '''
    Function to plot specific fields from training log(s). Plots both training and test results.
    :: Inputs - logs = list containing Path objects, each pointing to individual dir with a log file
              - fields = which results to plot from each log file - plots both training and test for each field.
              - ewm_col = optional, which column to use as the exponential weighted smoothing of the plots
              - log_name = optional, name of log file if different than default 'log.txt'.
    :: Outputs - matplotlib plots of results in fields, color coded for each log file.
               - solid lines are training results, dashed lines are test results.
    '''
    func_name = "plot_utils.py::plot_logs"
    # verify logs is a list of Paths (list[Paths]) or single Pathlib object Path,
    # convert single Path to list to avoid 'not iterable' error
    if not isinstance(logs, list):
        if isinstance(logs, PurePath):
            logs = [logs]
            print(f"{func_name} info: logs param expects a list argument, converted to list[Path].")
        else:
            raise ValueError(f"{func_name} - invalid argument for logs parameter.\n \
            Expect list[Path] or single Path obj, received {type(logs)}")
    # Quality checks - verify valid dir(s), that every item in list is Path object, and that log_name exists in each dir
    for i, dir in enumerate(logs):
        if not isinstance(dir, PurePath):
            raise ValueError(f"{func_name} - non-Path object in logs argument of {type(dir)}: \n{dir}")
        if not dir.exists():
            raise ValueError(f"{func_name} - invalid directory in logs argument:\n{dir}")
        # verify log_name exists
        fn = Path(dir / log_name)
        if not fn.exists():
            print(f"-> missing {log_name}.  Have you gotten to Epoch 1 in training?")
            print(f"--> full path of missing log file: {fn}")
            return
    # load log file(s) and plot
    dfs = [pd.read_json(Path(p) / log_name, lines=True) for p in logs]
    fig, axs = plt.subplots(ncols=len(fields), figsize=(16, 5))
    for df, color in zip(dfs, sns.color_palette(n_colors=len(logs))):
        for j, field in enumerate(fields):
            if field == 'mAP':
                coco_eval = pd.DataFrame(
                    np.stack(df.test_coco_eval_bbox.dropna().values)[:, 1]
                ).ewm(com=ewm_col).mean()
                axs[j].plot(coco_eval, c=color)
            else:
                df.interpolate().ewm(com=ewm_col).mean().plot(
                    y=[f'train_{field}', f'test_{field}'],
                    ax=axs[j],
                    color=[color] * 2,
                    style=['-', '--']
                )
    for ax, field in zip(axs, fields):
        ax.legend([Path(p).name for p in logs])
        ax.set_title(field)
 def plot_precision_recall(files, naming_scheme='iter'):
    if naming_scheme == 'exp_id':
        # name becomes exp_id
        names = [f.parts[-3] for f in files]
    elif naming_scheme == 'iter':
        names = [f.stem for f in files]
    else:
        raise ValueError(f'not supported {naming_scheme}')
    fig, axs = plt.subplots(ncols=2, figsize=(16, 5))
    for f, color, name in zip(files, sns.color_palette("Blues", n_colors=len(files)), names):
        data = torch.load(f)
        # precision is n_iou, n_points, n_cat, n_area, max_det
        precision = data['precision']
        recall = data['params'].recThrs
        scores = data['scores']
        # take precision for all classes, all areas and 100 detections
        precision = precision[0, :, :, 0, -1].mean(1)
        scores = scores[0, :, :, 0, -1].mean(1)
        prec = precision.mean()
        rec = data['recall'][0, :, 0, -1].mean()
        print(f'{naming_scheme} {name}: mAP@50={prec * 100: 05.1f}, ' +
              f'score={scores.mean():0.3f}, ' +
              f'f1={2 * prec * rec / (prec + rec + 1e-8):0.3f}'
              )
        axs[0].plot(recall, precision, c=color)
        axs[1].plot(recall, scores, c=color)
    axs[0].set_title('Precision / Recall')
    axs[0].legend(names)
    axs[1].set_title('Scores / Recall')
    axs[1].legend(names)
    return fig, axs
--- a/roboimi/demos/config.yaml
+++ b/roboimi/demos/config.yaml
@@ -8,7 +8,8 @@ temporal_agg: false
 # policy_class: "ACT"
 # backbone: 'resnet18'
-policy_class: "GR00T"
+policy_class: "ACTTV"
 # policy_class: "DDT"
 backbone: 'dino_v2'
 seed: 0
@@ -38,13 +39,8 @@ episode_len:      # leave empty here by default
 camera_names: []  # leave empty here by default
 xml_dir:          # leave empty here by default
 # action smoothing settings (for GR00T)
 use_action_smoothing: true
 smooth_method: "ema"     # Options: "ema", "moving_avg", "lowpass", "none"
 smooth_alpha: 0.3        # Smoothing factor (0-1), smaller = smoother
 # transformer settings
-batch_size: 10                          
+batch_size: 32                          
 state_dim: 16            
 action_dim: 16      
 lr_backbone: 0.00001        
@@ -56,25 +52,15 @@ nheads: 8
 qpos_noise_std: 0
 DT: 0.02
 gr00t:
  action_dim: 16
  state_dim: 16
  embed_dim: 1536
  hidden_dim: 1024
  num_queries: 8
  nheads: 32
  mlp_ratio: 4
  dropout: 0.2
  num_layers: 16
 # DO NOT CHANGE IF UNNECESSARY
 lr: 0.00001             
 kl_weight: 100              
 chunk_size: 10            
 hidden_dim: 512           
-dim_feedforward: 3200   
+dim_feedforward: 3200
 # DDT 特有参数
 num_blocks: 12              # Transformer blocks 数量
 mlp_ratio: 4.0              # MLP 维度比例
 num_inference_steps: 10     # 扩散推理步数   
--- a/roboimi/demos/diana_eval.py
+++ b/roboimi/demos/diana_eval.py
@@ -0,0 +1,119 @@
 import torch
 import os
 import numpy as np
 import matplotlib.pyplot as plt
 from tqdm import tqdm
 from einops import rearrange
 from roboimi.utils.utils import set_seed
 from roboimi.utils.io_utils import IOUtils
 from roboimi.utils.model_interface import ModelInterface
 from roboimi.envs.double_pos_ctrl_env import make_sim_env
 # from visualize_episodes import save_videos
 from roboimi.utils.act_ex_utils import sample_transfer_pose
 #should be added into IOUtils
 def get_image(obs,camera_names):
    curr_images = []
    for cam_name in camera_names:
        curr_image = rearrange(obs['images'][cam_name], 'h w c -> c h w')
        curr_images.append(curr_image)
    curr_image = np.stack(curr_images, axis=0)
    curr_image = torch.from_numpy(curr_image / 255.0).float().cuda().unsqueeze(0)
    return curr_image
 def eval_bc(config, ckpt_name='policy_best.ckpt', save_episode=True):
    set_seed(1)
    model_interface = ModelInterface(config)
    model_interface.setup()
    policy = IOUtils.load_policy(config, ckpt_name)
    stats = IOUtils.load_stats(config['ckpt_dir'])
    num_rollouts = 3
    episode_returns = []
    highest_rewards = []
    run_episode(config, policy, stats,
                save_episode,num_rollouts)
    # episode_return, episode_highest_reward = run_episode(config, policy, stats,
    #                                                           save_episode,num_rollouts)
 def run_episode(config, policy, stats, save_episode,num_rollouts):
    if 'sim_transfer' in config['task_name']:
        task_name =  'sim_transfer'  #config['task_name']
        env = make_sim_env(task_name)
    max_timesteps = config['episode_len']
    max_timesteps = int(max_timesteps * 1)
    pre_process = lambda s_qpos: (s_qpos - stats['qpos_mean']) / stats['qpos_std']
    post_process = lambda a: a * stats['action_std'] + stats['action_mean']
    box_pos = sample_transfer_pose()
    for rollout_id in range(num_rollouts):
        print("\nrollout_id===",rollout_id,"\n")
        image_list = []
        rewards = []
        query_frequency = config['policy_config'].get('num_queries', 1)
        print("query_freq =====",query_frequency)
        env.reset(box_pos)
        with torch.inference_mode():
            for t in range(700):
                image_list.append(env._get_image_obs()['images'] if 'images' in env._get_image_obs() else {print("img error")})
                qpos_numpy = np.array(env._get_qpos_obs()['qpos'])
                qpos = pre_process(qpos_numpy)
                qpos = torch.from_numpy(qpos).float().cuda().unsqueeze(0)
                curr_image = get_image(env._get_image_obs(), config['camera_names'])
                if config['policy_class'] == "ACT" or "ACTTV":
                    if t % query_frequency == 0:
                        all_actions = policy(qpos, curr_image)
                    raw_action = all_actions[:, t % query_frequency]
                    # raw_action = all_actions[:, t % 1]
                    raw_action = raw_action.squeeze(0).cpu().numpy()
                elif config['policy_class'] == "CNNMLP":
                    raw_action = policy(qpos, curr_image)
                else:
                    raise NotImplementedError
                action = post_process(raw_action)
                print("action == ",action)
                env.step_jnt(action)
                rewards.append(env.rew)
                env.render()
    rewards = np.array(rewards)
    # episode_return = np.sum(rewards[rewards != None])
    # episode_highest_reward = np.max(rewards)
    # env.viewer.close()
    # del env
    # return episode_return, episode_highest_reward
 def test_env():
    try:
        env = make_sim_env('sim_transfer')
        env.reset()
        while True: pass
    except KeyboardInterrupt:
        del env
        print("stop")
 if __name__ == '__main__':
    # test_env()
    io_utils = IOUtils()
    config = io_utils.load_config()
    eval_bc(config)
--- a/roboimi/demos/diana_policy.py
+++ b/roboimi/demos/diana_policy.py
@@ -104,8 +104,8 @@ class TestPickAndTransferPolicy(PolicyBase):
            {"t": 1, "xyz": init_mocap_pose_right[:3], "quat": init_mocap_pose_right[3:], "gripper": -100}, # sleep
            {"t": 75, "xyz": np.array([(0.8+box_xyz[0])*0.5,(1.0+box_xyz[1])*0.5,init_mocap_pose_right[2]]), "quat": gripper_approach_quat.elements, "gripper": 100},
            {"t": 225, "xyz": box_xyz + np.array([0, 0, 0.3]), "quat": gripper_pick_quat.elements, "gripper": 100}, # approach the cube
-            {"t": 275, "xyz": box_xyz + np.array([0, 0, 0.11]), "quat": gripper_pick_quat.elements, "gripper": 100}, # go down
+            {"t": 275, "xyz": box_xyz + np.array([0, 0, 0.12]), "quat": gripper_pick_quat.elements, "gripper": 100}, # go down
-            {"t": 280, "xyz": box_xyz + np.array([0, 0, 0.11]), "quat": gripper_pick_quat.elements, "gripper": -100}, # close gripper
+            {"t": 280, "xyz": box_xyz + np.array([0, 0, 0.12]), "quat": gripper_pick_quat.elements, "gripper": -100}, # close gripper
            {"t": 450, "xyz": init_mocap_pose_right[:3], "quat": init_mocap_pose_right[3:], "gripper": -100},# approach wait position
            {"t": 500, "xyz": meet_xyz + np.array([0.1, 0, 0.0]), "quat": meet_right_quat.elements, "gripper": -100},# approach meet position
            {"t": 510, "xyz": meet_xyz + np.array([0.1, 0, 0.0]), "quat": meet_right_quat.elements, "gripper": 100}, # open gripper
@@ -116,8 +116,8 @@ class TestPickAndTransferPolicy(PolicyBase):
        self.left_trajectory = [
            {"t": 1, "xyz": init_mocap_pose_left[:3], "quat": init_mocap_pose_left[3:], "gripper": -100},# sleep
            {"t": 250, "xyz": meet_xyz + np.array([-0.5, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # approach meet position
-            {"t": 500, "xyz": meet_xyz + np.array([-0.14, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # move to meet position
+            {"t": 500, "xyz": meet_xyz + np.array([-0.15, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # move to meet position
-            {"t": 505, "xyz": meet_xyz + np.array([-0.14, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # close gripper
+            {"t": 505, "xyz": meet_xyz + np.array([-0.15, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # close gripper
            {"t": 675, "xyz": meet_xyz + np.array([-0.3, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # move left
            {"t": 700, "xyz": meet_xyz + np.array([-0.3, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # stay
        ]
--- a/roboimi/demos/diana_record_sim_episodes.py
+++ b/roboimi/demos/diana_record_sim_episodes.py
@@ -1,11 +1,11 @@
 import time
-import os
+import os,collections,sys
 import numpy as np
 import h5py
 from roboimi.envs.double_pos_ctrl_env import make_sim_env
 from diana_policy import TestPickAndTransferPolicy
 import cv2
 from roboimi.utils.act_ex_utils import sample_transfer_pose
 from roboimi.utils.streaming_episode_writer import StreamingEpisodeWriter
 import pathlib
 HOME_PATH = str(pathlib.Path(__file__).parent.resolve())
@@ -16,12 +16,14 @@ def main():
    task_name = 'sim_transfer'
    dataset_dir = DATASET_DIR + '/sim_transfer'   #SIM_TASK_CONFIGS[task_name]['dataset_dir']
    num_episodes = 100       #SIM_TASK_CONFIGS[task_name]['num_episodes']
    onscreen_render = None  #config['onscreen_render']
    inject_noise = False
    render_cam_name = 'angle'
    episode_len  = 700      #SIM_TASK_CONFIGS[task_name]['episode_len']
-    camera_names = ['angle','r_vis', 'top', 'front']   #SIM_TASK_CONFIGS[task_name]['camera_names']
+    camera_names = ['angle','r_vis', 'top']   #SIM_TASK_CONFIGS[task_name]['camera_names']
    image_size = (256, 256)
    if task_name == 'sim_transfer':
        policy = TestPickAndTransferPolicy(inject_noise)
        print(task_name)
    else:
        raise NotImplementedError
@@ -30,45 +32,63 @@ def main():
    env = make_sim_env(task_name)
    policy = TestPickAndTransferPolicy(inject_noise)
    # 等待osmesa完全启动后再开始收集数据
    print("等待osmesa线程启动...")
    time.sleep(60)
    print("osmesa已就绪，开始收集数据...")
    for episode_idx in range(num_episodes):
-        sum_reward = 0.0
+        obs = []
-        max_reward = float('-inf')
+        reward_ee = []
        print(f'\n{episode_idx=}')
        print('Rollout out EE space scripted policy')
        box_pos = sample_transfer_pose()
        env.reset(box_pos)
        episode_writer = StreamingEpisodeWriter(
            dataset_path=os.path.join(dataset_dir, f'episode_{episode_idx}.hdf5'),
            max_timesteps=episode_len,
            camera_names=camera_names,
            image_size=image_size,
        )
        for step in range(episode_len):
-            raw_action = policy.predict(box_pos,step)
+            
-            env.step(raw_action)
+            
            action = policy.predict(box_pos,step)
            env.step(action)
            env.render()
-            sum_reward += env.rew
+            reward_ee.append(env.rew)
-            max_reward = max(max_reward, env.rew)
+            obs.append(env.obs)
-            episode_writer.append(
+        sum_reward = np.sum(reward_ee)
-                qpos=env.obs['qpos'],
+        max_reward = np.max(reward_ee)
                action=raw_action,
                images=env.obs['images'],
            )
        if max_reward == env.max_reward:
            success.append(1)
            print(f"{episode_idx=} Successful, {sum_reward=}")
-            episode_writer.commit()
+            t0 = time.time()
            data_dict = {
            '/observations/qpos': [],
            '/action': [],
            }
            for cam_name in camera_names:
                data_dict[f'/observations/images/{cam_name}'] = []
            for i in range(episode_len):
                print("type qpos==",obs[i]['qpos'])
                data_dict['/observations/qpos'].append(obs[i]['qpos'])
                data_dict['/action'].append(obs[i]['action'])
                for cam_name in camera_names:
                    data_dict[f'/observations/images/{cam_name}'].append(obs[i]['images'][cam_name])
            dataset_path = os.path.join(dataset_dir, f'episode_{episode_idx}')
            with h5py.File(dataset_path + '.hdf5', 'w', rdcc_nbytes=1024 ** 2 * 2) as root:
                max_timesteps = episode_len
                root.attrs['sim'] = True
                obs_ = root.create_group('observations')
                image = obs_.create_group('images')
                for cam_name in camera_names:
                    _ = image.create_dataset(cam_name, (max_timesteps, 480, 640, 3), dtype='uint8',
                                         chunks=(1, 480, 640, 3), )
                qpos = obs_.create_dataset('qpos', (max_timesteps, 16))
                action = root.create_dataset('action', (max_timesteps, 16))
                for name, array in data_dict.items():
                    root[name][...] = np.array(array)              
        else:
            success.append(0)
            print(f"{episode_idx=} Failed")
            print(max_reward)
-            episode_writer.discard()
+        del obs
        del reward_ee
        del sum_reward
        del max_reward
        # del policy
        # env.viewer.close()
@@ -82,4 +102,4 @@ def main():
 if __name__ == '__main__':
-    main()
+    main()
--- a/roboimi/demos/eval.py
+++ b/roboimi/demos/eval.py
@@ -0,0 +1,152 @@
 import torch
 import os
 import numpy as np
 import matplotlib.pyplot as plt
 from tqdm import tqdm
 from einops import rearrange
 from roboimi.utils.utils import set_seed
 from roboimi.utils.io_utils import IOUtils
 from roboimi.utils.model_interface import ModelInterface
 from roboimi.envs.vx300s_jnt import make_sim_env
 import time
 # from visualize_episodes import save_videos
 from roboimi.utils.utils import sample_box_pose, sample_insertion_pose
 #should be added into IOUtils
 def get_image(obs,camera_names):
    curr_images = []
    for cam_name in camera_names:
        curr_image = rearrange(obs['images'][cam_name], 'h w c -> c h w')
        curr_images.append(curr_image)
    curr_image = np.stack(curr_images, axis=0)
    curr_image = torch.from_numpy(curr_image / 255.0).float().cuda().unsqueeze(0)
    return curr_image
 def eval_bc(config, ckpt_name='policy_best.ckpt', save_episode=True):
    set_seed(1)
    model_interface = ModelInterface(config)
    task_name = 'sim_insertion' #config['task_name']
    model_interface.setup()
    policy = IOUtils.load_policy(config, ckpt_name)
    stats = IOUtils.load_stats(config['ckpt_dir'])
    num_rollouts = 3
    episode_returns = []
    highest_rewards = []
    for rollout_id in range(num_rollouts):
        episode_return, episode_highest_reward = run_episode(config, policy, stats,
                                                              save_episode,rollout_id)
 def run_episode(config, policy, stats, save_episode,rollout_id):
    print("\nrollout_id===",rollout_id,"\n")
    pre_process = lambda s_qpos: (s_qpos - stats['qpos_mean']) / stats['qpos_std']
    post_process = lambda a: a * stats['action_std'] + stats['action_mean']
    if 'sim_insertion' in config['task_name']:
        peg_pose, socket_pose = sample_insertion_pose()
        box_pose = np.hstack((peg_pose[:3],socket_pose[:3])) # used in sim reset
    task_name =  'sim_insertion'  #config['task_name']
    env = make_sim_env(task_name)
    env.reset(box_pose)
    max_timesteps = config['episode_len']
    max_timesteps = int(max_timesteps * 1)
    image_list = []
    rewards = []
    query_frequency = config['policy_config'].get('num_queries', 1)
    with torch.inference_mode():
        for t in range(700):
            # print("obs_img",env.obs['images'])
            image_list.append(env.obs['images'] if 'images' in env.obs else {print("img error")})
            qpos_numpy = np.array(env.obs['qpos'])
            qpos = pre_process(qpos_numpy)
            qpos = torch.from_numpy(qpos).float().cuda().unsqueeze(0)
            curr_image = get_image(env.obs, config['camera_names'])
            if config['policy_class'] == "ACT" or "ACTTV":
                if t % query_frequency == 0:
                    all_actions = policy(qpos, curr_image)
            elif config['policy_class'] == "CNNMLP":
                raw_action = policy(qpos, curr_image)
            else:
                raise NotImplementedError
            raw_action = all_actions[:, t % query_frequency]
            raw_action = raw_action.squeeze(0).cpu().numpy()
            action = post_process(raw_action)
            env.step(action)
            rewards.append(env.rew)
            env.render()
    rewards = np.array(rewards)
    episode_return = np.sum(rewards[rewards != None])
    episode_highest_reward = np.max(rewards)
    env.viewer.close()
    del env
    return episode_return, episode_highest_reward
 def test_env():
    try:
        env = make_sim_env('sim_insertion')
        box_pos = np.concatenate(sample_insertion_pose())
        env.reset(box_pos)
        while True: pass
    except KeyboardInterrupt:
        del env
        print("stop")
 if __name__ == '__main__':
    test_env()
    # io_utils = IOUtils()
    # config = io_utils.load_config()
    # eval_bc(config)
 # config===== {'onscreen_render': False,
 #  'eval': 1, 
 # 'ckpt_dir': 'ckpt_models', 
 # 'num_epochs': 3000, 
 # 'temporal_agg': False, 
 # 'policy_class': 'ACT', 
 # 'backbone': 'resnet18', 
 # 'seed': 0, 'real_robot': 0,
 #  'task_name': 'sim_insertion', 
 # 'images_render_height': 480, 
 # 'images_render_width': 640, 
 # 'left_arm_DOF_number': 6, 
 # 'right_arm_DOF_number': 6, 
 # 'left_qpos_raw': 8, 
 # 'right_qpos_raw': 8, 
 # 'left_qvel_raw': 8, 
 # 'right_qvel_raw': 8, 
 # 'dataset_dir': '/home/arm/lzd/act_env/dataset/sim_insertion', 
 # 'num_episodes': 7, 
 # 'episode_len': 400, 
 # 'camera_names': ['top'], 
 # 'xml_dir': None, 
 # 'batch_size': 8, 
 # 'state_dim': 14, 
 # 'action_dim': 14, 
 # 'lr_backbone': 1e-05, 
 # 'enc_layers': 4, 
 # 'dec_layers': 7, 
 # 'nheads': 8, 
 # 'qpos_noise_std': 0, 
 # 'DT': 0.02, 
 # 'lr': 1e-05, 
 # 'kl_weight': 10, 
 # 'chunk_size': 100, 
 # 'hidden_dim': 512, 
 # 'dim_feedforward': 3200, 
 # 'policy_config': {'lr': 1e-05, 'num_queries': 100, 'kl_weight': 10, 'hidden_dim': 512, 'dim_feedforward': 3200, 'lr_backbone': 1e-05, 'backbone': 'resnet18', 'enc_layers': 4, 'dec_layers': 7, 'nheads': 8, 'camera_names': ['top']}} 
--- a/roboimi/demos/training.py
+++ b/roboimi/demos/training.py
@@ -0,0 +1,179 @@
 import torch
 import os
 from tqdm import tqdm
 import numpy as np
 from copy import deepcopy
 from itertools import repeat
 import matplotlib.pyplot as plt
 import time
 from roboimi.utils.utils import set_seed, compute_dict_mean, detach_dict, load_data
 from roboimi.utils.io_utils import IOUtils
 from roboimi.utils.model_interface import ModelInterface
 import matplotlib.pyplot as plt
 def train_bc(config):
    num_epochs = config['num_epochs']
    ckpt_dir = config['ckpt_dir']
    seed = config['seed']
    os.makedirs(ckpt_dir, exist_ok=True)
    set_seed(seed)
    model_interface = ModelInterface(config)
    model_interface.setup()
    policy = model_interface.make_policy()
    policy.cuda()
    optimizer = model_interface.make_optimizer(policy)
    # print("cam names=====",config['camera_names'])
    train_dataloader, val_dataloader, stats, _ = load_data(
        config['dataset_dir'], 
        config['num_episodes'], 
        config['camera_names'], 
        config['batch_size'], 
        config['batch_size'])
    IOUtils.save_stats(ckpt_dir, stats)
    train_history = []
    validation_history = []
    min_val_loss = np.inf
    min_train_loss = np.inf
    best_ckpt_info = None
    plt.ion()
    fig, ax = plt.subplots()
    train_losses, val_losses = [], []
    train_line, = ax.plot([], [], label='Train Loss')
    val_line, = ax.plot([], [], label='Validation Loss')
    ax.autoscale_view()
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.legend()
    ax.grid(True)
    train_annotation = ax.annotate('', xy=(0, 0), textcoords='offset points')
    val_annotation = ax.annotate('', xy=(0, 0), textcoords='offset points')
    min_train_text = ax.text(0.85, 0.5, '', transform=ax.transAxes, fontsize=10, verticalalignment='center', horizontalalignment='left', bbox=dict(facecolor='white', alpha=0.5))
    min_val_text = ax.text(0.85, 0.45, '', transform=ax.transAxes, fontsize=10, verticalalignment='center', horizontalalignment='left', bbox=dict(facecolor='white', alpha=0.5))
    for epoch in tqdm(range(num_epochs)):
        print(f'\nEpoch {epoch}')
        # Validation
        epoch_val_loss, epoch_summary = validate(policy, val_dataloader)
        validation_history.append(epoch_summary)
        val_losses.append(epoch_val_loss.cpu().item()) 
        if epoch_val_loss < min_val_loss:
            min_val_loss = epoch_val_loss
            min_val_epoch = epoch
            best_ckpt_info = (epoch, min_val_loss,
                              deepcopy(policy.state_dict()))
        print(f'Val loss:   {epoch_val_loss:.5f}')
        print_summary(epoch_summary)
        # Training
        epoch_train_loss, epoch_summary = train_epoch(
            policy, optimizer, train_dataloader)
        train_history.append(epoch_summary)
        train_losses.append(epoch_train_loss.cpu().item()) 
        if epoch_train_loss < min_train_loss:
            min_train_loss = epoch_train_loss
            min_train_epoch = epoch
        print(f'Train loss: {epoch_train_loss:.5f}')
        print_summary(epoch_summary)
        # Update the plot with the new data
        train_line.set_xdata(range(len(train_losses)))
        train_line.set_ydata(train_losses)
        val_line.set_xdata(range(len(val_losses)))
        val_line.set_ydata(val_losses)
        # Update annotations with the latest loss values at their respective positions
        train_annotation.set_position((len(train_losses)-1, train_losses[-1]))
        train_annotation.xy = (len(train_losses)-1, train_losses[-1])
        train_annotation.set_text(f'{train_losses[-1]:.5f}')
        val_annotation.set_position((len(val_losses)-1, val_losses[-1]))
        val_annotation.xy = (len(val_losses)-1, val_losses[-1])
        val_annotation.set_text(f'{val_losses[-1]:.5f}')
        # Update text objects with the minimum loss values, fixed on the right side
        min_train_text.set_text(f'Min Train Loss: {min_train_loss:.5f} (Epoch {min_train_epoch})')
        min_val_text.set_text(f'Min Val Loss: {min_val_loss:.5f} (Epoch {min_val_epoch})')
        ax.relim()
        ax.autoscale_view()
        plt.draw()
        plt.pause(0.1)
    plt.ioff() 
    IOUtils.save_checkpoint(policy, 'last', ckpt_dir, seed, 'last')
    best_epoch, min_val_loss, best_state_dict = best_ckpt_info
    IOUtils.save_checkpoint(best_state_dict, best_epoch,
                            ckpt_dir, seed, 'best', min_val_loss)
    print(
        f'Training finished:\nSeed {seed}, val loss {min_val_loss:.6f} at epoch {best_epoch}')
    IOUtils.plot_history(train_history, validation_history,
                         num_epochs, ckpt_dir, seed)
    return best_ckpt_info
 def validate(policy, dataloader):
    policy.eval()
    epoch_dicts = []
    with torch.inference_mode():
        for data in dataloader:
            forward_dict = forward_pass(data, policy)
            epoch_dicts.append(forward_dict)
    epoch_summary = compute_dict_mean(epoch_dicts)
    return epoch_summary['loss'], epoch_summary
 def train_epoch(policy, optimizer, dataloader):
    policy.train()
    epoch_dicts = []
    for data in dataloader:
        optimizer.zero_grad()
        forward_dict = forward_pass(data, policy)
        loss = forward_dict['loss']
        loss.backward()
        optimizer.step()
        epoch_dicts.append(detach_dict(forward_dict))
    epoch_summary = compute_dict_mean(epoch_dicts)
    return epoch_summary['loss'], epoch_summary
 def forward_pass(data, policy):
    image_data, qpos_data, action_data, is_pad = data
    image_data, qpos_data, action_data, is_pad = image_data.cuda(
    ), qpos_data.cuda(), action_data.cuda(), is_pad.cuda()
    return policy(qpos_data, image_data, action_data, is_pad)
 def print_summary(summary):
    summary_string = ' '.join(
        [f'{k}: {v.item():.3f}' for k, v in summary.items()])
    print(summary_string)
 if __name__ == '__main__':
    io_utils = IOUtils()
    config = io_utils.load_config()
    train_bc(config)
--- a/roboimi/demos/view_raw_action_trajectory.py
+++ b/roboimi/demos/view_raw_action_trajectory.py
@@ -1,36 +0,0 @@
 import argparse
 import numpy as np
 from roboimi.utils.raw_action_trajectory_viewer import launch_raw_action_trajectory_viewer
 def parse_args():
    parser = argparse.ArgumentParser(description="Launch an interactive MuJoCo viewer with raw-action trajectory overlay.")
    parser.add_argument("trajectory_path", help="Path to raw_action.npy or trajectory.npz")
    parser.add_argument("--task-name", default="sim_transfer")
    parser.add_argument("--line-radius", type=float, default=0.004)
    parser.add_argument("--max-markers", type=int, default=1500)
    parser.add_argument(
        "--box-pos",
        type=float,
        nargs=3,
        default=None,
        help="Optional box xyz to use when resetting the environment",
    )
    return parser.parse_args()
 def main():
    args = parse_args()
    box_pos = np.asarray(args.box_pos, dtype=np.float32) if args.box_pos is not None else None
    launch_raw_action_trajectory_viewer(
        args.trajectory_path,
        task_name=args.task_name,
        line_radius=args.line_radius,
        max_markers=args.max_markers,
        box_pos=box_pos,
    )
 if __name__ == "__main__":
    main()
--- a/roboimi/demos/vla_scripts/eval_vla.py
+++ b/roboimi/demos/vla_scripts/eval_vla.py
@@ -1,956 +0,0 @@
 """
 VLA 策略评估脚本（简化版）
 该脚本使用 agent 内置的队列管理来评估训练好的 VLA 策略。
 无需单独的评估器类 - agent 处理一切！
 使用方法:
    python roboimi/demos/eval_vla_simple.py
    python roboimi/demos/eval_vla_simple.py eval.ckpt_path=checkpoints/vla_model_final.pt
    python roboimi/demos/eval_vla_simple.py eval.ckpt_path=checkpoints/vla_model_best.pt
 """
 import sys
 import os
 import json
 import logging
 import time
 import torch
 import numpy as np
 import hydra
 from pathlib import Path
 from typing import Any, Dict, Optional
 from tqdm import tqdm
 from omegaconf import DictConfig, OmegaConf
 from hydra.utils import instantiate
 from einops import rearrange
 from roboimi.utils.act_ex_utils import sample_transfer_pose
 from roboimi.vla.eval_utils import execute_policy_action
 sys.path.append(os.getcwd())
 log = logging.getLogger(__name__)
 if not OmegaConf.has_resolver("len"):
    OmegaConf.register_new_resolver("len", lambda x: len(x))
 def _configure_headless_mujoco_gl(eval_cfg: DictConfig) -> None:
    if not bool(eval_cfg.get('headless', False)):
        return
    if os.environ.get('MUJOCO_GL'):
        return
    os.environ['MUJOCO_GL'] = 'egl'
    log.info('headless eval detected; set MUJOCO_GL=egl')
 def make_sim_env(task_name: str, headless: bool = False):
    from roboimi.envs.double_pos_ctrl_env import make_sim_env as _make_sim_env_impl
    return _make_sim_env_impl(task_name, headless=headless)
 def load_checkpoint(
    ckpt_path: str,
    agent_cfg: DictConfig,
    device: str = 'cuda'
 ) -> torch.nn.Module:
    """
    从检查点加载训练好的 VLA 模型，使用 Hydra agent 配置。
    Args:
        ckpt_path: 检查点文件路径 (.pt)
        agent_cfg: Hydra agent 配置，用于实例化
        device: 加载模型的设备
    Returns:
        加载后的 VLAAgent 模型
    """
    from pathlib import Path as PathLib
    ckpt_path = PathLib(ckpt_path).absolute()
    if not ckpt_path.exists():
        raise FileNotFoundError(f"检查点未找到: {ckpt_path}")
    log.info(f"从 {ckpt_path} 加载检查点")
    checkpoint = torch.load(ckpt_path, map_location=device, weights_only=False)
    log.info(f"检查点键值: {checkpoint.keys()}")
    # 加载数据集统计信息用于归一化
    stats = checkpoint.get('dataset_stats', None)
    # 使用数据集统计信息从 Hydra 配置实例化 agent
    log.info("从配置实例化 agent...")
    agent = instantiate(agent_cfg, dataset_stats=stats)
    # 加载模型状态
    agent.load_state_dict(checkpoint['model_state_dict'])
    log.info(f"✅ 模型状态已加载 (步数: {checkpoint.get('step', 'unknown')})")
    if stats is not None:
        log.info(f"✅ 数据集统计信息已加载 (归一化: {stats.get('normalization_type', 'gaussian')})")
    else:
        # 后备方案：尝试从外部 JSON 文件加载（兼容旧检查点）
        stats_path = ckpt_path.parent / 'dataset_stats.json'
        if stats_path.exists():
            with open(stats_path, 'r') as f:
                stats = json.load(f)
            log.info("✅ 数据集统计信息已从外部 JSON 加载（旧版本兼容）")
        else:
            log.warning("⚠️  未找到数据集统计信息。动作将无法反归一化！")
    agent.eval()
    agent.to(device)
    log.info(f"✅ 模型已成功加载到 {device}")
    return agent, stats
 def prepare_observation(
    obs: Dict,
    camera_names: list,
    image_resize_shape: Optional[tuple[int, int]] = (224, 224),
 ) -> Dict:
    """
    将环境观测转换为 agent 格式。
    Args:
        obs: 环境观测字典，包含图像和 qpos
        camera_names: 摄像头名称列表
    Returns:
        agent 格式的观测字典
    """
    # 转换图像: numpy -> tensor, HWC -> CHW
    images = {}
    for cam_name in camera_names:
        img = obs['images'][cam_name]
        if image_resize_shape is not None:
            import cv2
            img = cv2.resize(img, tuple(image_resize_shape), interpolation=cv2.INTER_LINEAR)
        img = rearrange(img, 'h w c -> c h w')
        img = torch.from_numpy(img / 255.0).float()
        images[cam_name] = img
    # 转换 qpos: numpy -> tensor
    qpos = torch.from_numpy(obs['qpos']).float()
    return {'qpos': qpos, 'images': images}
 def _to_numpy_action(action: Any) -> np.ndarray:
    if isinstance(action, torch.Tensor):
        return action.detach().cpu().numpy().astype(np.float32, copy=True)
    return np.asarray(action, dtype=np.float32).copy()
 def _mean_or_zero(values: list[float]) -> float:
    return float(np.mean(values)) if values else 0.0
 def _stats_or_zero(values: list[float]) -> dict[str, float]:
    if not values:
        return {
            'mean': 0.0,
            'std': 0.0,
            'min': 0.0,
            'max': 0.0,
        }
    array = np.asarray(values, dtype=np.float64)
    return {
        'mean': float(array.mean()),
        'std': float(array.std()),
        'min': float(array.min()),
        'max': float(array.max()),
    }
 def _summarize_timing_breakdown(
    all_timings: dict[str, list[float]],
    model_forward_flags: list[bool],
 ) -> dict[str, Any]:
    model_forward_flags = [bool(flag) for flag in model_forward_flags]
    return {
        'count': int(len(model_forward_flags)),
        'model_forward_count': int(sum(model_forward_flags)),
        'all_steps_ms': {
            stage: _stats_or_zero(values)
            for stage, values in all_timings.items()
        },
        'model_forward_steps_ms': {
            stage: _stats_or_zero(
                [value for value, should_keep in zip(values, model_forward_flags) if should_keep]
            )
            for stage, values in all_timings.items()
        },
    }
 def _json_friendly(value: Any) -> Any:
    if isinstance(value, dict):
        return {str(key): _json_friendly(item) for key, item in value.items()}
    if isinstance(value, (list, tuple)):
        return [_json_friendly(item) for item in value]
    if isinstance(value, Path):
        return str(value)
    if isinstance(value, np.ndarray):
        return value.tolist()
    if isinstance(value, (np.integer, np.floating)):
        return value.item()
    return value
 def _resolve_artifact_paths(eval_cfg: DictConfig) -> dict[str, Optional[str]]:
    save_timing = bool(eval_cfg.get('save_timing', False))
    save_trajectory = bool(
        eval_cfg.get('save_trajectory', False) or eval_cfg.get('save_trajectory_npz', False)
    )
    save_trajectory_image = bool(eval_cfg.get('save_trajectory_image', False))
    wants_artifacts = any([
        bool(eval_cfg.get('save_artifacts', False)),
        save_timing,
        save_trajectory,
        save_trajectory_image,
        bool(eval_cfg.get('record_video', False)),
    ])
    output_dir: Optional[Path] = None
    if wants_artifacts:
        artifact_dir = eval_cfg.get('artifact_dir', None)
        if artifact_dir:
            output_dir = Path(str(artifact_dir)).expanduser().resolve()
        else:
            ckpt_stem = Path(str(eval_cfg.ckpt_path)).stem or 'rollout'
            timestamp = time.strftime('%Y%m%d-%H%M%S')
            output_dir = (Path.cwd() / 'rollout_artifacts' / f'{ckpt_stem}-{timestamp}').resolve()
        output_dir.mkdir(parents=True, exist_ok=True)
    video_camera_name = None
    if bool(eval_cfg.get('record_video', False)):
        configured_camera_name = eval_cfg.get('video_camera_name', None)
        if configured_camera_name is None:
            configured_camera_name = eval_cfg.get('video_camera', None)
        if configured_camera_name is not None:
            video_camera_name = str(configured_camera_name)
        elif eval_cfg.get('camera_names'):
            video_camera_name = str(eval_cfg.camera_names[0])
        else:
            raise ValueError('record_video=true requires eval.video_camera_name or a non-empty eval.camera_names')
    trajectory_image_camera_name = None
    if save_trajectory_image:
        configured_camera_name = eval_cfg.get('trajectory_image_camera_name', None)
        if configured_camera_name is None:
            configured_camera_name = eval_cfg.get('trajectory_image_camera', None)
        if configured_camera_name is not None:
            trajectory_image_camera_name = str(configured_camera_name)
        elif eval_cfg.get('camera_names'):
            camera_names = [str(name) for name in eval_cfg.camera_names]
            trajectory_image_camera_name = 'front' if 'front' in camera_names else camera_names[0]
        else:
            raise ValueError(
                'save_trajectory_image=true requires eval.trajectory_image_camera_name '
                'or a non-empty eval.camera_names'
            )
    return {
        'output_dir': str(output_dir) if output_dir is not None else None,
        'summary_json': (
            str(output_dir / 'rollout_summary.json')
            if output_dir is not None and bool(eval_cfg.get('save_summary_json', False))
            else None
        ),
        'timing_json': (
            str(output_dir / 'timing.json')
            if output_dir is not None and save_timing
            else None
        ),
        'trajectory_npz': (
            str(output_dir / 'trajectory.npz')
            if output_dir is not None and save_trajectory
            else None
        ),
        'video_mp4': (
            str(output_dir / f'rollout_{video_camera_name}.mp4')
            if output_dir is not None and bool(eval_cfg.get('record_video', False))
            and video_camera_name is not None
            else None
        ),
        'video_camera_name': video_camera_name,
        'trajectory_image_camera_name': trajectory_image_camera_name,
    }
 def _get_video_frame(obs: Dict, camera_name: Optional[str]) -> Optional[np.ndarray]:
    if camera_name is None:
        return None
    frame = obs['images'][camera_name]
    frame = np.asarray(frame)
    if frame.ndim != 3 or frame.shape[2] != 3:
        raise ValueError(
            f'Video frame for camera {camera_name} must have shape (H, W, 3), got {frame.shape}'
        )
    if frame.dtype != np.uint8:
        frame = np.clip(frame, 0, 255).astype(np.uint8)
    return frame
 def _open_video_writer(output_path: str, frame_size: tuple[int, int], fps: int):
    import cv2
    output_path = str(output_path)
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    writer = cv2.VideoWriter(output_path, fourcc, float(fps), frame_size)
    if not writer.isOpened():
        raise RuntimeError(f'无法打开视频输出: {output_path}')
    return writer
 def _episode_trajectory_image_path(
    artifact_paths: dict[str, Optional[str]],
    episode_idx: int,
 ) -> Optional[str]:
    output_dir = artifact_paths.get('output_dir')
    camera_name = artifact_paths.get('trajectory_image_camera_name')
    if output_dir is None or camera_name is None:
        return None
    return str(Path(output_dir) / f'rollout_{camera_name}_ep{episode_idx + 1:02d}_trajectory.png')
 def _build_action_trajectory_positions(raw_actions: list[np.ndarray]) -> dict[str, np.ndarray]:
    if not raw_actions:
        empty = np.zeros((0, 3), dtype=np.float32)
        return {'left': empty, 'right': empty}
    raw_action_array = np.asarray(raw_actions, dtype=np.float32)
    return {
        'left': raw_action_array[:, :3].astype(np.float32, copy=True),
        'right': raw_action_array[:, 7:10].astype(np.float32, copy=True),
    }
 def _append_capsule_markers_to_scene(scene, markers: list[dict]) -> None:
    import mujoco
    for marker in markers:
        if scene.ngeom >= scene.maxgeom:
            break
        geom = scene.geoms[scene.ngeom]
        mujoco.mjv_initGeom(
            geom,
            mujoco.mjtGeom.mjGEOM_CAPSULE,
            np.zeros(3, dtype=np.float64),
            np.zeros(3, dtype=np.float64),
            np.eye(3, dtype=np.float64).reshape(-1),
            np.asarray(marker['rgba'], dtype=np.float32),
        )
        mujoco.mjv_connector(
            geom,
            mujoco.mjtGeom.mjGEOM_CAPSULE,
            float(marker['radius']),
            np.asarray(marker['from'], dtype=np.float64),
            np.asarray(marker['to'], dtype=np.float64),
        )
        scene.ngeom += 1
 def _save_rollout_trajectory_image(
    env,
    output_path: Optional[str],
    raw_actions: list[np.ndarray],
    camera_name: Optional[str],
    *,
    line_radius: float = 0.004,
    max_markers: int = 1500,
 ) -> Optional[str]:
    if output_path is None or camera_name is None:
        return None
    # IMPORTANT:
    # Keep this import lazy so headless rollout can set MUJOCO_GL=egl before
    # anything imports mujoco. Importing this helper at module import time would
    # pull in mujoco too early on remote headless hosts and make rollout fail
    # with gladLoadGL / missing DISPLAY errors.
    from roboimi.utils.raw_action_trajectory_viewer import build_trajectory_capsule_markers
    output_path = str(output_path)
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    frame = None
    owned_renderer = None
    positions = _build_action_trajectory_positions(raw_actions)
    markers = build_trajectory_capsule_markers(
        positions,
        max_markers=max_markers,
        radius=line_radius,
    )
    try:
        renderer = None
        if callable(getattr(env, '_get_or_create_offscreen_renderer', None)):
            renderer = env._get_or_create_offscreen_renderer()
        elif hasattr(env, 'mj_model') and hasattr(env, 'mj_data'):
            import mujoco
            renderer = mujoco.Renderer(env.mj_model, height=480, width=640)
            owned_renderer = renderer
        if renderer is not None and hasattr(env, 'mj_data'):
            renderer.update_scene(env.mj_data, camera=str(camera_name))
            if markers:
                _append_capsule_markers_to_scene(renderer.scene, markers)
            frame = renderer.render()[:, :, ::-1]
    finally:
        if owned_renderer is not None:
            owned_renderer.close()
    if frame is None and callable(getattr(env, '_get_image_obs', None)):
        obs = env._get_image_obs()
        frame = _get_video_frame(obs, str(camera_name))
    if frame is None:
        return None
    import cv2
    cv2.imwrite(output_path, frame)
    return output_path
 class _RolloutVideoRecorder:
    def __init__(self, output_path: Optional[str], fps: int):
        self.output_path = output_path
        self.fps = int(fps)
        self.writer = None
    def write(self, frame: Optional[np.ndarray]):
        if self.output_path is None or frame is None:
            return
        if self.writer is None:
            frame_size = (int(frame.shape[1]), int(frame.shape[0]))
            self.writer = _open_video_writer(self.output_path, frame_size, self.fps)
        self.writer.write(frame)
    def close(self):
        if self.writer is not None:
            self.writer.release()
            self.writer = None
 def _read_body_pose(env, body_name: str):
    try:
        if callable(getattr(env, 'getBodyPos', None)) and callable(getattr(env, 'getBodyQuat', None)):
            pos = env.getBodyPos(body_name)
            quat = env.getBodyQuat(body_name)
        else:
            body = env.mj_data.body(body_name)
            pos = body.xpos
            quat = body.xquat
    except Exception:
        return None
    return {
        'pos': np.asarray(pos, dtype=np.float32).copy(),
        'quat': np.asarray(quat, dtype=np.float32).copy(),
    }
 def _get_executed_ee_poses(env) -> dict[str, np.ndarray]:
    candidates = {
        'left_link7': ('left_link7', 'eef_left'),
        'right_link7': ('right_link7', 'eef_right'),
        'eef_left': ('eef_left', 'left_link7'),
        'eef_right': ('eef_right', 'right_link7'),
    }
    poses = {}
    for body_key, body_names in candidates.items():
        pose = None
        for body_name in body_names:
            pose = _read_body_pose(env, body_name)
            if pose is not None:
                break
        if pose is None:
            pose = {
                'pos': np.full(3, np.nan, dtype=np.float32),
                'quat': np.full(4, np.nan, dtype=np.float32),
            }
        poses[f'{body_key}_pos'] = pose['pos']
        poses[f'{body_key}_quat'] = pose['quat']
    return poses
 def _empty_rollout_trajectory() -> dict[str, list]:
    return {
        'episode_index': [],
        'step': [],
        'reward': [],
        'raw_action': [],
        'applied_action': [],
        'executed_left_link7_pos': [],
        'executed_left_link7_quat': [],
        'executed_right_link7_pos': [],
        'executed_right_link7_quat': [],
        'executed_eef_left_pos': [],
        'executed_eef_left_quat': [],
        'executed_eef_right_pos': [],
        'executed_eef_right_quat': [],
        'model_inference_triggered': [],
        'obs_read_time_ms': [],
        'preprocess_time_ms': [],
        'inference_time_ms': [],
        'env_step_time_ms': [],
        'total_time_ms': [],
    }
 def _append_rollout_step(
    storage: dict[str, list],
    episode_index: int,
    timestep: int,
    reward: Optional[float],
    raw_action: np.ndarray,
    executed_action: np.ndarray,
    executed_poses: dict[str, np.ndarray],
    timing_ms: dict[str, float],
    model_inference_triggered: bool,
 ):
    storage['episode_index'].append(int(episode_index))
    storage['step'].append(int(timestep))
    storage['reward'].append(float(reward) if reward is not None else np.nan)
    storage['raw_action'].append(raw_action.astype(np.float32, copy=True))
    storage['applied_action'].append(executed_action.astype(np.float32, copy=True))
    storage['executed_left_link7_pos'].append(executed_poses['left_link7_pos'])
    storage['executed_left_link7_quat'].append(executed_poses['left_link7_quat'])
    storage['executed_right_link7_pos'].append(executed_poses['right_link7_pos'])
    storage['executed_right_link7_quat'].append(executed_poses['right_link7_quat'])
    storage['executed_eef_left_pos'].append(executed_poses['eef_left_pos'])
    storage['executed_eef_left_quat'].append(executed_poses['eef_left_quat'])
    storage['executed_eef_right_pos'].append(executed_poses['eef_right_pos'])
    storage['executed_eef_right_quat'].append(executed_poses['eef_right_quat'])
    storage['model_inference_triggered'].append(bool(model_inference_triggered))
    for key, value in timing_ms.items():
        storage[key].append(float(value))
 def _save_rollout_trajectory_npz(output_path: str, storage: dict[str, list]):
    step = np.asarray(storage['step'], dtype=np.int32)
    raw_action = np.asarray(storage['raw_action'], dtype=np.float32)
    applied_action = np.asarray(storage['applied_action'], dtype=np.float32)
    executed_left_link7_pos = np.asarray(storage['executed_left_link7_pos'], dtype=np.float32)
    executed_left_link7_quat = np.asarray(storage['executed_left_link7_quat'], dtype=np.float32)
    executed_right_link7_pos = np.asarray(storage['executed_right_link7_pos'], dtype=np.float32)
    executed_right_link7_quat = np.asarray(storage['executed_right_link7_quat'], dtype=np.float32)
    executed_eef_left_pos = np.asarray(storage['executed_eef_left_pos'], dtype=np.float32)
    executed_eef_left_quat = np.asarray(storage['executed_eef_left_quat'], dtype=np.float32)
    executed_eef_right_pos = np.asarray(storage['executed_eef_right_pos'], dtype=np.float32)
    executed_eef_right_quat = np.asarray(storage['executed_eef_right_quat'], dtype=np.float32)
    np.savez_compressed(
        output_path,
        episode_index=np.asarray(storage['episode_index'], dtype=np.int32),
        step=step,
        timestep=step,
        reward=np.asarray(storage['reward'], dtype=np.float32),
        raw_action=raw_action,
        raw_predicted_ee_action=raw_action,
        applied_action=applied_action,
        executed_ee_action=applied_action,
        executed_left_link7_pos=executed_left_link7_pos,
        executed_left_link7_quat=executed_left_link7_quat,
        executed_right_link7_pos=executed_right_link7_pos,
        executed_right_link7_quat=executed_right_link7_quat,
        executed_eef_left_pos=executed_eef_left_pos,
        executed_eef_left_quat=executed_eef_left_quat,
        executed_eef_right_pos=executed_eef_right_pos,
        executed_eef_right_quat=executed_eef_right_quat,
        left_ee_pos=executed_eef_left_pos,
        left_ee_quat=executed_eef_left_quat,
        right_ee_pos=executed_eef_right_pos,
        right_ee_quat=executed_eef_right_quat,
        model_inference_triggered=np.asarray(storage['model_inference_triggered'], dtype=bool),
        obs_read_time_ms=np.asarray(storage['obs_read_time_ms'], dtype=np.float32),
        preprocess_time_ms=np.asarray(storage['preprocess_time_ms'], dtype=np.float32),
        inference_time_ms=np.asarray(storage['inference_time_ms'], dtype=np.float32),
        env_step_time_ms=np.asarray(storage['env_step_time_ms'], dtype=np.float32),
        total_time_ms=np.asarray(storage['total_time_ms'], dtype=np.float32),
    )
 def _save_summary_json(output_path: str, summary: dict[str, Any]):
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(_json_friendly(summary), f, ensure_ascii=False, indent=2)
 class ActionSmoother:
    """
    动作平滑器（指数移动平均）
    用于平滑执行动作以获得更稳定的控制
    """
    def __init__(self, alpha: float = 0.3):
        """
        Args:
            alpha: 平滑系数 (0-1)，值越大越重视当前动作
        """
        self.alpha = alpha
        self.prev_action = None
    def smooth(self, action: np.ndarray) -> np.ndarray:
        """
        平滑动作
        Args:
            action: 当前动作
        Returns:
            平滑后的动作
        """
        if self.prev_action is None:
            smoothed = action
        else:
            smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
        self.prev_action = smoothed
        return smoothed
    def reset(self):
        """重置平滑器状态"""
        self.prev_action = None
 def _close_env(env):
    if env is None:
        return
    if hasattr(env, 'exit_flag'):
        env.exit_flag = True
    cam_thread = getattr(env, 'cam_thread', None)
    if cam_thread is not None and hasattr(cam_thread, 'join'):
        cam_thread.join(timeout=1.0)
    viewer = getattr(env, 'viewer', None)
    if viewer is not None and hasattr(viewer, 'close'):
        viewer.close()
 def _run_eval(cfg: DictConfig):
    """
    使用 agent 内置队列管理的简化版 VLA 评估
    所有评估参数来自 vla/conf/eval.yaml，合并到 cfg 中。
    命令行覆盖: python eval_vla_simple.py eval.ckpt_path=... eval.num_episodes=5
    """
    # 打印配置
    print("=" * 80)
    print("VLA 评估配置:")
    print("=" * 80)
    print(OmegaConf.to_yaml(cfg))
    print("=" * 80)
    eval_cfg = cfg.eval
    _configure_headless_mujoco_gl(eval_cfg)
    device = eval_cfg.device
    camera_names = list(eval_cfg.camera_names)
    artifact_paths = _resolve_artifact_paths(eval_cfg)
    video_recorder = _RolloutVideoRecorder(
        output_path=artifact_paths['video_mp4'],
        fps=int(eval_cfg.get('video_fps', 30)),
    )
    rollout_trajectory = _empty_rollout_trajectory()
    global_obs_read_times_ms = []
    global_preprocess_times_ms = []
    global_inference_times_ms = []
    global_env_step_times_ms = []
    global_total_times_ms = []
    global_model_forward_flags = []
    # =========================================================================
    # 加载模型
    # =========================================================================
    log.info(f"🚀 从 {eval_cfg.ckpt_path} 加载模型...")
    agent, dataset_stats = load_checkpoint(
        ckpt_path=eval_cfg.ckpt_path,
        agent_cfg=cfg.agent,
        device=device
    )
    vision_encoder = getattr(agent, 'vision_encoder', None)
    image_resize_shape = getattr(vision_encoder, 'eval_image_resize_shape', (224, 224))
    # 重置 agent 的队列
    agent.reset()
    # 可选：动作平滑器
    smoother = ActionSmoother(alpha=eval_cfg.smooth_alpha) if eval_cfg.use_smoothing else None
    # =========================================================================
    # 创建环境
    # =========================================================================
    env = make_sim_env(eval_cfg.task_name, headless=eval_cfg.headless)
    # =========================================================================
    # 运行评估回合
    # =========================================================================
    all_stats = []
    episode_rewards = []
    episode_max_rewards = []
    try:
        for episode_idx in range(eval_cfg.num_episodes):
            print(f"\n{'='*60}")
            print(f"回合 {episode_idx + 1}/{eval_cfg.num_episodes}")
            print(f"{'='*60}\n")
            box_pos = sample_transfer_pose()
            env.reset(box_pos)
            # 为新回合重置 agent 队列
            agent.reset()
            if smoother:
                smoother.reset()
            # 计时统计
            obs_read_times_ms = []
            preprocess_times_ms = []
            inference_times_ms = []
            env_step_times_ms = []
            total_times_ms = []
            model_forward_flags = []
            episode_reward = 0.0
            episode_max_reward = float('-inf')
            episode_raw_actions: list[np.ndarray] = []
            with torch.inference_mode():
                for t in tqdm(range(eval_cfg.max_timesteps), desc=f"回合 {episode_idx + 1}"):
                    start_total = time.perf_counter()
                    # 从环境获取观测
                    obs = env._get_image_obs()
                    qpos_obs = env._get_qpos_obs()
                    obs['qpos'] = qpos_obs['qpos']
                    end_obs_read = time.perf_counter()
                    video_frame = _get_video_frame(obs, artifact_paths['video_camera_name'])
                    video_recorder.write(video_frame)
                    # 准备给 agent 的观测
                    observation = prepare_observation(
                        obs,
                        camera_names,
                        image_resize_shape=image_resize_shape,
                    )
                    end_preprocess = time.perf_counter()
                    # 选择动作（agent 内部处理队列管理）
                    action_queue = getattr(agent, '_queues', {}).get('action', None)
                    model_inference_triggered = len(action_queue) == 0 if action_queue is not None else True
                    start_inference = time.perf_counter()
                    action = agent.select_action(observation)
                    if str(device).startswith('cuda') and torch.cuda.is_available():
                        torch.cuda.synchronize()
                    end_inference = time.perf_counter()
                    # 转换为 numpy
                    raw_action = _to_numpy_action(action)
                    episode_raw_actions.append(raw_action.astype(np.float32, copy=True))
                    # 调试：打印当前时间步的动作（由配置控制）
                    if eval_cfg.get('verbose_action', False):
                        print(f"\n[Step {t:3d}] 预测动作: {raw_action}")
                        print(f"  - 动作形状: {raw_action.shape}")
                        print(f"  - 动作范围: [{raw_action.min():.4f}, {raw_action.max():.4f}]")
                        print(f"  - 动作均值: {raw_action.mean():.4f}, 标准差: {raw_action.std():.4f}")
                    # 可选：平滑动作
                    executed_action = raw_action.copy()
                    if smoother:
                        executed_action = smoother.smooth(executed_action)
                    # 执行动作
                    start_env_step = time.perf_counter()
                    execute_policy_action(env, executed_action)
                    end_env_step = time.perf_counter()
                    executed_poses = _get_executed_ee_poses(env)
                    reward = getattr(env, 'rew', None)
                    if reward is not None:
                        reward = float(reward)
                        episode_reward += reward
                        episode_max_reward = max(episode_max_reward, reward)
                    if not eval_cfg.headless:
                        env.render()
                    end_total = time.perf_counter()
                    step_timing_ms = {
                        'obs_read_time_ms': (end_obs_read - start_total) * 1000.0,
                        'preprocess_time_ms': (end_preprocess - end_obs_read) * 1000.0,
                        'inference_time_ms': (end_inference - start_inference) * 1000.0,
                        'env_step_time_ms': (end_env_step - start_env_step) * 1000.0,
                        'total_time_ms': (end_total - start_total) * 1000.0,
                    }
                    # 记录计时
                    obs_read_times_ms.append(step_timing_ms['obs_read_time_ms'])
                    preprocess_times_ms.append(step_timing_ms['preprocess_time_ms'])
                    inference_times_ms.append(step_timing_ms['inference_time_ms'])
                    env_step_times_ms.append(step_timing_ms['env_step_time_ms'])
                    total_times_ms.append(step_timing_ms['total_time_ms'])
                    model_forward_flags.append(bool(model_inference_triggered))
                    global_obs_read_times_ms.append(step_timing_ms['obs_read_time_ms'])
                    global_preprocess_times_ms.append(step_timing_ms['preprocess_time_ms'])
                    global_inference_times_ms.append(step_timing_ms['inference_time_ms'])
                    global_env_step_times_ms.append(step_timing_ms['env_step_time_ms'])
                    global_total_times_ms.append(step_timing_ms['total_time_ms'])
                    global_model_forward_flags.append(bool(model_inference_triggered))
                    if artifact_paths['trajectory_npz'] is not None:
                        _append_rollout_step(
                            rollout_trajectory,
                            episode_index=episode_idx,
                            timestep=t,
                            reward=reward,
                            raw_action=raw_action,
                            executed_action=executed_action,
                            executed_poses=executed_poses,
                            timing_ms=step_timing_ms,
                            model_inference_triggered=model_inference_triggered,
                        )
            # =========================================================================
            # 打印回合统计
            # =========================================================================
            avg_obs_read_time_ms = _mean_or_zero(obs_read_times_ms)
            avg_preprocess_time_ms = _mean_or_zero(preprocess_times_ms)
            avg_inference_time_ms = _mean_or_zero(inference_times_ms)
            avg_env_step_time_ms = _mean_or_zero(env_step_times_ms)
            avg_total_time_ms = _mean_or_zero(total_times_ms)
            timing_breakdown = _summarize_timing_breakdown(
                {
                    'obs_read': obs_read_times_ms,
                    'preprocess': preprocess_times_ms,
                    'inference': inference_times_ms,
                    'env_step': env_step_times_ms,
                    'loop_total': total_times_ms,
                },
                model_forward_flags,
            )
            episode_artifact_paths = {
                'video': artifact_paths['video_mp4'],
                'trajectory': artifact_paths['trajectory_npz'],
                'trajectory_image': _save_rollout_trajectory_image(
                    env,
                    _episode_trajectory_image_path(artifact_paths, episode_idx),
                    episode_raw_actions,
                    artifact_paths['trajectory_image_camera_name'],
                ),
                'timing': artifact_paths['timing_json'] or artifact_paths['summary_json'],
            }
            stats = {
                'inference_fps': 1000.0 / avg_inference_time_ms if avg_inference_time_ms > 0 else 0.0,
                'control_fps': 1000.0 / avg_total_time_ms if avg_total_time_ms > 0 else 0.0,
                'avg_obs_read_time_ms': avg_obs_read_time_ms,
                'avg_preprocess_time_ms': avg_preprocess_time_ms,
                'avg_inference_time_ms': avg_inference_time_ms,
                'avg_env_step_time_ms': avg_env_step_time_ms,
                'avg_total_time_ms': avg_total_time_ms,
                'num_inferences': int(sum(model_forward_flags)),
                'num_model_forwards': int(sum(model_forward_flags)),
                'num_steps': len(total_times_ms),
                'episode_reward': float(episode_reward),
                'episode_max_reward': (
                    float(episode_max_reward) if episode_max_reward != float('-inf') else None
                ),
                'artifact_paths': episode_artifact_paths,
                'timing_breakdown_ms': timing_breakdown['all_steps_ms'],
                'timing_summary': timing_breakdown,
            }
            all_stats.append(stats)
            episode_rewards.append(float(episode_reward))
            if episode_max_reward != float('-inf'):
                episode_max_rewards.append(float(episode_max_reward))
            print(f"\n回合 {episode_idx + 1} 完成 ({eval_cfg.max_timesteps} 时间步)")
            print(f"  模型推理 FPS: {stats['inference_fps']:.2f} Hz")
            print(f"  控制循环 FPS: {stats['control_fps']:.2f} Hz")
            print(f"  平均读观测时间: {stats['avg_obs_read_time_ms']:.2f} ms")
            print(f"  平均预处理时间: {stats['avg_preprocess_time_ms']:.2f} ms")
            print(f"  平均推理时间: {stats['avg_inference_time_ms']:.2f} ms")
            print(f"  平均环境步进时间: {stats['avg_env_step_time_ms']:.2f} ms")
            print(f"  平均总时间: {stats['avg_total_time_ms']:.2f} ms")
            print(f"  总推理次数: {stats['num_inferences']}")
            print(f"  回合累计奖励: {stats['episode_reward']:.2f}")
        # =========================================================================
        # 总体统计
        # =========================================================================
        print(f"\n{'='*60}")
        print("评估完成!")
        print(f"{'='*60}")
        summary = {
            'num_episodes': int(eval_cfg.num_episodes),
            'episode_rewards': episode_rewards,
            'episode_max_rewards': episode_max_rewards,
            'avg_reward': float(np.mean(episode_rewards)) if episode_rewards else 0.0,
            'avg_max_reward': float(np.mean(episode_max_rewards)) if episode_max_rewards else 0.0,
            'episodes': all_stats,
            'artifact_dir': artifact_paths['output_dir'],
            'artifacts': artifact_paths,
        }
        if all_stats:
            avg_inference_fps = np.mean([s['inference_fps'] for s in all_stats])
            avg_control_fps = np.mean([s['control_fps'] for s in all_stats])
            avg_obs_read_time = _mean_or_zero(global_obs_read_times_ms)
            avg_preprocess_time = _mean_or_zero(global_preprocess_times_ms)
            avg_inference_time = _mean_or_zero(global_inference_times_ms)
            avg_env_step_time = _mean_or_zero(global_env_step_times_ms)
            avg_total_time = _mean_or_zero(global_total_times_ms)
            summary.update({
                'avg_inference_fps': float(avg_inference_fps),
                'avg_control_fps': float(avg_control_fps),
                'avg_obs_read_time_ms': float(avg_obs_read_time),
                'avg_preprocess_time_ms': float(avg_preprocess_time),
                'avg_inference_time_ms': float(avg_inference_time),
                'avg_env_step_time_ms': float(avg_env_step_time),
                'avg_total_time_ms': float(avg_total_time),
                'timing_summary': _summarize_timing_breakdown(
                    {
                        'obs_read': global_obs_read_times_ms,
                        'preprocess': global_preprocess_times_ms,
                        'inference': global_inference_times_ms,
                        'env_step': global_env_step_times_ms,
                        'loop_total': global_total_times_ms,
                    },
                    global_model_forward_flags,
                ),
            })
            print(f"\n总体统计 ({eval_cfg.num_episodes} 个回合):")
            print(f"  平均模型推理 FPS: {avg_inference_fps:.2f} Hz")
            print(f"  平均控制循环 FPS: {avg_control_fps:.2f} Hz")
            print(f"  平均读观测时间: {avg_obs_read_time:.2f} ms")
            print(f"  平均预处理时间: {avg_preprocess_time:.2f} ms")
            print(f"  平均推理时间: {avg_inference_time:.2f} ms")
            print(f"  平均环境步进时间: {avg_env_step_time:.2f} ms")
            print(f"  平均总时间: {avg_total_time:.2f} ms")
            print(f"  平均累计奖励: {summary['avg_reward']:.2f}")
        if artifact_paths['trajectory_npz'] is not None:
            _save_rollout_trajectory_npz(artifact_paths['trajectory_npz'], rollout_trajectory)
        if artifact_paths['summary_json'] is not None:
            _save_summary_json(artifact_paths['summary_json'], summary)
        if artifact_paths['timing_json'] is not None:
            _save_summary_json(artifact_paths['timing_json'], summary.get('timing_summary', {}))
        print()
        return _json_friendly(summary)
    finally:
        video_recorder.close()
        _close_env(env)
@hydra.main(version_base=None, config_path="../../vla/conf", config_name="config")
 def main(cfg: DictConfig):
    return _run_eval(cfg)
 if __name__ == '__main__':
    main()
--- a/roboimi/demos/vla_scripts/train_vla.py
+++ b/roboimi/demos/vla_scripts/train_vla.py
@@ -1,986 +0,0 @@
 import sys
 import os
 import logging
 import json
 import pickle
 import importlib
 import hydra
 import torch
 import re
 from tqdm import tqdm
 from omegaconf import DictConfig, OmegaConf
 from torch.utils.data import DataLoader, random_split
 from torch.optim import AdamW
 from torch.optim.lr_scheduler import LambdaLR
 from pathlib import Path
 # 确保正确的导入路径（不能依赖 cwd，因为 Hydra 会在运行时切换 cwd）
 def _ensure_repo_root_on_syspath():
    repo_root = Path(__file__).resolve().parents[3]
    repo_root_str = str(repo_root)
    if repo_root_str in sys.path:
        sys.path.remove(repo_root_str)
    sys.path.insert(0, repo_root_str)
    return repo_root
 _PROBLEMATIC_LD_PRELOAD_SUBSTRINGS = ('/usr/NX/lib/libnxegl.so', 'libnxegl.so')
 def _clean_ld_preload_value(value: str | None):
    if not value:
        return value, False
    entries = [entry for entry in value.split() if entry]
    filtered = [
        entry for entry in entries
        if not any(marker in entry for marker in _PROBLEMATIC_LD_PRELOAD_SUBSTRINGS)
    ]
    changed = filtered != entries
    cleaned = ' '.join(filtered) if filtered else None
    return cleaned, changed
 def _maybe_reexec_without_problematic_ld_preload():
    if __name__ != '__main__':
        return False
    if os.environ.get('_ROBOIMI_LD_PRELOAD_SANITIZED') == '1':
        return False
    cleaned, changed = _clean_ld_preload_value(os.environ.get('LD_PRELOAD'))
    if not changed:
        return False
    new_env = dict(os.environ)
    new_env['_ROBOIMI_LD_PRELOAD_SANITIZED'] = '1'
    if cleaned:
        new_env['LD_PRELOAD'] = cleaned
    else:
        new_env.pop('LD_PRELOAD', None)
    print(
        'Detected problematic LD_PRELOAD entry for CUDA/cuDNN; re-executing train_vla.py without it.',
        file=sys.stderr,
        flush=True,
    )
    os.execvpe(sys.executable, [sys.executable, *sys.argv], new_env)
 _REPO_ROOT = _ensure_repo_root_on_syspath()
 from hydra.utils import instantiate
 log = logging.getLogger(__name__)
 # 注册列表长度解析器（用于配置中如 ${len:${data.camera_names}}）
 if not OmegaConf.has_resolver("len"):
    OmegaConf.register_new_resolver("len", lambda x: len(x))
 def _resolve_run_output_dir() -> Path:
    try:
        from hydra.core.hydra_config import HydraConfig
        if HydraConfig.initialized():
            output_dir = HydraConfig.get().runtime.output_dir
            if output_dir:
                return Path(output_dir).resolve()
    except Exception:
        pass
    return Path.cwd().resolve()
 _maybe_reexec_without_problematic_ld_preload()
 def _configure_cuda_runtime(cfg):
    """Apply process-level CUDA runtime switches required by this environment."""
    if str(cfg.train.device).startswith('cuda') and bool(cfg.train.get('disable_cudnn', False)):
        torch.backends.cudnn.enabled = False
        log.warning('⚠️  已按配置禁用 cuDNN；GPU 卷积将回退到非-cuDNN 实现')
 def recursive_to_device(data, device):
    """
    递归地将嵌套字典/列表中的张量移动到指定设备。
    Args:
        data: 字典、列表或张量
        device: 目标设备 (例如 'cuda', 'cpu')
    Returns:
        所有张量已移动到指定设备的数据结构
    """
    if isinstance(data, torch.Tensor):
        return data.to(device)
    elif isinstance(data, dict):
        return {k: recursive_to_device(v, device) for k, v in data.items()}
    elif isinstance(data, list):
        return [recursive_to_device(v, device) for v in data]
    return data
 def resolve_resume_checkpoint(resume_ckpt, checkpoint_dir):
    """
    解析恢复训练用的 checkpoint 路径。
    Args:
        resume_ckpt: 配置中的 resume_ckpt，支持路径或 "auto"
        checkpoint_dir: 默认检查点目录
    Returns:
        Path 或 None
    """
    if resume_ckpt is None:
        return None
    if str(resume_ckpt).lower() != "auto":
        return Path(resume_ckpt)
    pattern = re.compile(r"vla_model_step_(\d+)\.pt$")
    candidates = []
    for ckpt_path in checkpoint_dir.glob("vla_model_step_*.pt"):
        match = pattern.search(ckpt_path.name)
        if match:
            candidates.append((int(match.group(1)), ckpt_path))
    if not candidates:
        return None
    return max(candidates, key=lambda x: x[0])[1]
 def get_lr_schedule_with_warmup(optimizer, warmup_steps, max_steps, scheduler_type='cosine', min_lr=0):
    """
    创建带预热的学习率调度器。
    Args:
        optimizer: PyTorch 优化器
        warmup_steps: 预热步数
        max_steps: 总训练步数
        scheduler_type: 预热后的调度器类型 ('cosine' 或 'constant')
        min_lr: 最小学习率（用于余弦衰减）
    Returns:
        LambdaLR 调度器
    """
    import math
    # 在 LambdaLR 修改前捕获初始学习率
    base_lr = optimizer.param_groups[0]['lr']
    min_lr_ratio = min_lr / base_lr if base_lr > 0 else 0.0
    def lr_lambda(step):
        # 预热阶段：从 0 线性增加到 1
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        # 预热后阶段
        if scheduler_type == 'cosine':
            # 从 1 到 min_lr_ratio 的余弦退火
            progress = float(step - warmup_steps) / float(max(1, max_steps - warmup_steps))
            cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
            return max(min_lr_ratio, cosine_decay)
        else:
            # 恒定学习率
            return 1.0
    return LambdaLR(optimizer, lr_lambda)
 def build_training_optimizer(agent, lr, weight_decay):
    """为训练脚本构建优化器，优先复用任意 head 自带的参数分组。"""
    trainable_params = [param for param in agent.parameters() if param.requires_grad]
    noise_pred_net = getattr(agent, 'noise_pred_net', None)
    get_optim_groups = getattr(noise_pred_net, 'get_optim_groups', None)
    use_head_groups = callable(get_optim_groups)
    if not use_head_groups:
        return AdamW(trainable_params, lr=lr, weight_decay=weight_decay)
    head_groups = []
    grouped_param_ids = set()
    for group in get_optim_groups(weight_decay=weight_decay):
        params = [param for param in group['params'] if param.requires_grad]
        if not params:
            continue
        normalized_group = dict(group)
        normalized_group['params'] = params
        head_groups.append(normalized_group)
        for param in params:
            param_id = id(param)
            if param_id in grouped_param_ids:
                raise ValueError('Head optimizer groups contain duplicate parameters')
            grouped_param_ids.add(param_id)
    head_trainable_param_ids = {
        id(param) for param in noise_pred_net.parameters() if param.requires_grad
    }
    missing_head_param_ids = head_trainable_param_ids - grouped_param_ids
    if missing_head_param_ids:
        raise ValueError('Head optimizer groups missed trainable head parameters')
    remaining_params = [
        param for param in trainable_params
        if id(param) not in grouped_param_ids
    ]
    optim_groups = head_groups
    if remaining_params:
        optim_groups = optim_groups + [{
            'params': remaining_params,
            'weight_decay': weight_decay,
        }]
        grouped_param_ids.update(id(param) for param in remaining_params)
    all_trainable_param_ids = {id(param) for param in trainable_params}
    if grouped_param_ids != all_trainable_param_ids:
        raise ValueError('Optimizer parameter groups must include each trainable parameter exactly once')
    return AdamW(optim_groups, lr=lr, weight_decay=weight_decay)
 def _init_swanlab(cfg):
    """按需初始化 SwanLab，并在缺少依赖或认证失败时快速失败。"""
    if not bool(cfg.train.get('use_swanlab', False)):
        return None
    try:
        swanlab = importlib.import_module("swanlab")
    except ImportError as exc:
        raise RuntimeError(
            "SwanLab logging is enabled, but the 'swanlab' package could not be imported."
        ) from exc
    def _to_plain_config(value):
        if isinstance(value, dict):
            return {key: _to_plain_config(val) for key, val in value.items()}
        if isinstance(value, list):
            return [_to_plain_config(item) for item in value]
        if isinstance(value, tuple):
            return tuple(_to_plain_config(item) for item in value)
        items_method = getattr(value, 'items', None)
        if callable(items_method):
            try:
                return {key: _to_plain_config(val) for key, val in items_method()}
            except Exception:
                pass
        return value
    swanlab_config = {
        key: _to_plain_config(cfg[key])
        for key in ('train', 'data', 'agent')
        if key in cfg
    }
    init_kwargs = {
        'project': cfg.train.get('swanlab_project', 'roboimi-vla'),
        'config': swanlab_config,
    }
    run_name = cfg.train.get('swanlab_run_name', None)
    if run_name:
        init_kwargs['experiment_name'] = run_name
    try:
        swanlab.init(**init_kwargs)
    except Exception as exc:
        raise RuntimeError(
            f"SwanLab logging is enabled, but SwanLab init/login failed: {exc}"
        ) from exc
    return swanlab
 def _log_to_swanlab(swanlab_module, payload, step=None):
    if swanlab_module is None:
        return
    try:
        swanlab_module.log(payload, step=step)
    except Exception as exc:
        log.warning(f"SwanLab log failed at step {step}: {exc}")
 def _log_rollout_trajectory_images_to_swanlab(
    swanlab_module,
    rollout_stats,
    step=None,
    context_label: str = 'rollout',
 ):
    if swanlab_module is None or not rollout_stats:
        return
    image_factory = getattr(swanlab_module, 'Image', None)
    if image_factory is None:
        return
    payload = {}
    for fallback_episode_index, episode in enumerate(rollout_stats.get('episodes', [])):
        if not isinstance(episode, dict):
            continue
        artifact_paths = episode.get('artifact_paths', {})
        if not isinstance(artifact_paths, dict):
            continue
        trajectory_image = artifact_paths.get('trajectory_image')
        if not trajectory_image:
            continue
        episode_index = int(episode.get('episode_index', fallback_episode_index))
        caption = f'{context_label} trajectory image - episode {episode_index} (front)'
        try:
            payload[f'rollout/trajectory_image_episode_{episode_index}'] = image_factory(
                str(trajectory_image),
                caption=caption,
            )
        except Exception as exc:
            log.warning(
                f"SwanLab rollout trajectory image upload prep failed at step {step}: {exc}"
            )
    if payload:
        _log_to_swanlab(swanlab_module, payload, step=step)
 def _finish_swanlab(swanlab_module):
    if swanlab_module is None:
        return
    try:
        swanlab_module.finish()
    except Exception as exc:
        log.warning(f"SwanLab finish failed: {exc}")
 def _run_training(cfg: DictConfig):
    """
    VLA 训练脚本（ResNet 骨干网络 + Diffusion 策略）
    该脚本功能：
    1. 从 HDF5 文件加载数据集
    2. 实例化带 ResNet 视觉编码器的 VLAAgent
    3. 训练基于扩散的动作预测模型
    4. 定期保存检查点
    """
    # 打印配置
    print("=" * 80)
    print("VLA 训练配置:")
    print("=" * 80)
    print(OmegaConf.to_yaml(cfg))
    print("=" * 80)
    log.info(f"🚀 开始 VLA 训练 (设备: {cfg.train.device})")
    _configure_cuda_runtime(cfg)
    swanlab_module = _init_swanlab(cfg)
    try:
        # 创建检查点目录
        run_output_dir = _resolve_run_output_dir()
        checkpoint_dir = run_output_dir / "checkpoints"
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        default_best_model_path = checkpoint_dir / "vla_model_best.pt"
        # =========================================================================
        # 1. 实例化数据集与 DataLoader
        # =========================================================================
        log.info("📦 加载数据集...")
        try:
            dataset_image_resize_shape = cfg.data.get('image_resize_shape', (224, 224))
            vision_backbone_cfg = cfg.agent.get('vision_backbone', None)
            if vision_backbone_cfg is not None and 'dataset_image_resize_shape' in vision_backbone_cfg:
                dataset_image_resize_shape = vision_backbone_cfg.get('dataset_image_resize_shape')
            dataset = instantiate(
                cfg.data,
                image_resize_shape=dataset_image_resize_shape,
            )
            log.info(f"✅ 数据集加载成功。总样本数: {len(dataset)}")
        except Exception as e:
            log.error(f"❌ 数据集加载失败: {e}")
            raise
        # 训练/验证集划分
        val_split = float(cfg.train.get('val_split', 0.1))
        seed = int(cfg.train.get('seed', 42))
        val_size = int(len(dataset) * val_split)
        train_size = len(dataset) - val_size
        if val_size > 0:
            train_dataset, val_dataset = random_split(
                dataset,
                [train_size, val_size],
                generator=torch.Generator().manual_seed(seed)
            )
            log.info(f"✅ 数据集划分: 训练集={train_size}, 验证集={val_size} (验证比例={val_split})")
        else:
            train_dataset, val_dataset = dataset, None
            log.info("✅ 数据集划分: 全部用于训练, 验证集=0 (验证比例=0)")
        train_batch_size = int(cfg.train.batch_size)
        train_drop_last = len(train_dataset) >= train_batch_size
        if not train_drop_last:
            log.warning(
                "⚠️  训练集样本数 (%s) 小于 batch_size (%s)，将保留最后一个不完整批次以避免空训练加载器",
                len(train_dataset),
                train_batch_size,
            )
        train_loader = DataLoader(
            train_dataset,
            batch_size=train_batch_size,
            shuffle=True,
            num_workers=cfg.train.num_workers,
            pin_memory=(cfg.train.device != "cpu"),
            persistent_workers=False,
            drop_last=train_drop_last
        )
        val_loader = None
        if val_dataset is not None:
            val_loader = DataLoader(
                val_dataset,
                batch_size=train_batch_size,
                shuffle=False,
                num_workers=cfg.train.num_workers,
                pin_memory=(cfg.train.device != "cpu"),
                persistent_workers=False,
                drop_last=False
            )
        log.info(f"✅ 训练加载器每轮批次数: {len(train_loader)}")
        if val_loader is not None:
            log.info(f"✅ 验证加载器每轮批次数: {len(val_loader)}")
        # =========================================================================
        # 2. 加载数据集统计信息（将传递给 agent）
        # =========================================================================
        log.info("💾 加载数据集统计信息...")
        dataset_stats = None
        try:
            dataset_dir = cfg.data.get('dataset_dir', 'roboimi/demos/dataset/sim_transfer')
            stats_path = Path(dataset_dir) / 'dataset_stats.pkl'
            if stats_path.exists():
                with open(stats_path, 'rb') as f:
                    stats = pickle.load(f)
                # 扁平化stats字典（嵌套结构→扁平结构）以匹配NormalizationModule的期望格式
                dataset_stats = {
                    'action_mean': stats['action_mean'].tolist(),
                    'action_std': stats['action_std'].tolist(),
                    'action_min': stats['action_min'].tolist(),
                    'action_max': stats['action_max'].tolist(),
                    'qpos_mean': stats['qpos_mean'].tolist(),
                    'qpos_std': stats['qpos_std'].tolist(),
                    'qpos_min': stats['qpos_min'].tolist(),
                    'qpos_max': stats['qpos_max'].tolist(),
                }
                log.info(f"✅ 数据集统计信息加载完成 (归一化: {cfg.agent.normalization_type})")
            else:
                log.warning(f"⚠️  统计文件未找到: {stats_path}")
                log.warning("⚠️  推理时动作将无法反归一化！")
        except Exception as e:
            log.warning(f"⚠️  统计信息加载失败: {e}")
            log.warning("⚠️  训练将继续，但推理可能无法正常工作")
        # =========================================================================
        # 3. 实例化 VLA Agent
        # =========================================================================
        log.info("🤖 初始化 VLA Agent...")
        try:
            # 将 dataset_stats 和 normalization_type 传递给 agent
            agent = instantiate(cfg.agent, dataset_stats=dataset_stats)
            agent.to(cfg.train.device)
            agent.train()
            log.info(f"✅ Agent 初始化完成并已移至 {cfg.train.device}")
            # 统计参数量
            total_params = sum(p.numel() for p in agent.parameters())
            trainable_params = sum(p.numel() for p in agent.parameters() if p.requires_grad)
            log.info(f"📊 总参数量: {total_params:,}")
            log.info(f"📊 可训练参数量: {trainable_params:,}")
        except Exception as e:
            log.error(f"❌ Agent 初始化失败: {e}")
            raise
        # =========================================================================
        # 3.1 从预训练 checkpoint 加载权重（微调）
        # =========================================================================
        pretrained_ckpt = cfg.train.get('pretrained_ckpt', None)
        if pretrained_ckpt is not None:
            ckpt_path = Path(pretrained_ckpt)
            if ckpt_path.exists():
                log.info(f"🔄 [Finetune] 从预训练 checkpoint 加载权重: {ckpt_path}")
                try:
                    checkpoint = torch.load(ckpt_path, map_location=cfg.train.device)
                    # 只加载模型权重（不加载 optimizer、scheduler）
                    missing_keys, unexpected_keys = agent.load_state_dict(
                        checkpoint['model_state_dict'],
                        strict=False  # 允许部分加载（结构不完全匹配时）
                    )
                    log.info(f"✅ [Finetune] 模型权重加载成功")
                    if missing_keys:
                        log.warning(f"⚠️  [Finetune] 缺少的键 ({len(missing_keys)} 个): {missing_keys[:5]}...")
                    if unexpected_keys:
                        log.warning(f"⚠️  [Finetune] 多余的键 ({len(unexpected_keys)} 个): {unexpected_keys[:5]}...")
                    log.info(f"📊 [Finetune] 预训练信息: 步骤={checkpoint.get('step', 'N/A')}, 损失={checkpoint.get('loss', 'N/A')}")
                    log.info(f"📈 [Finetune] 使用新的训练配置（lr={cfg.train.lr}, max_steps={cfg.train.max_steps}）")
                except Exception as e:
                    log.error(f"❌ [Finetune] 加载 checkpoint 失败: {e}")
                    log.warning("⚠️  将从头开始训练")
            else:
                log.error(f"❌ [Finetune] Checkpoint 文件不存在: {ckpt_path}")
                log.warning("⚠️  将从头开始训练")
        # =========================================================================
        # 4. 设置优化器与学习率调度器
        # =========================================================================
        weight_decay = float(cfg.train.get('weight_decay', 1e-5))
        grad_clip = float(cfg.train.get('grad_clip', 1.0))
        optimizer = build_training_optimizer(agent, lr=cfg.train.lr, weight_decay=weight_decay)
        log.info(f"🔧 优化器: AdamW (学习率={cfg.train.lr}, weight_decay={weight_decay})")
        # 设置带预热的学習率调度器
        warmup_steps = int(cfg.train.get('warmup_steps', 500))
        scheduler_type = cfg.train.get('scheduler_type', 'cosine')
        min_lr = float(cfg.train.get('min_lr', 1e-6))
        scheduler = get_lr_schedule_with_warmup(
            optimizer,
            warmup_steps=warmup_steps,
            max_steps=cfg.train.max_steps,
            scheduler_type=scheduler_type,
            min_lr=min_lr
        )
        log.info(f"📈 学习率调度器: {scheduler_type}，{warmup_steps} 步预热 (最小学习率={min_lr})")
        # =========================================================================
        # 4.1 断点续训（恢复模型、优化器、调度器、步数）
        # =========================================================================
        def extract_checkpoint_metric_baseline(checkpoint):
            checkpoint_loss = checkpoint.get('loss', None)
            checkpoint_val_loss = checkpoint.get('val_loss', None)
            checkpoint_rollout_reward = checkpoint.get('rollout_avg_reward', None)
            baseline_loss = float('inf')
            baseline_rollout_reward = float('-inf')
            if checkpoint_rollout_reward is not None:
                baseline_rollout_reward = float(checkpoint_rollout_reward)
            if checkpoint_val_loss is not None:
                baseline_loss = float(checkpoint_val_loss)
            elif checkpoint_loss is not None:
                baseline_loss = float(checkpoint_loss)
            return baseline_loss, baseline_rollout_reward
        start_step = 0
        resume_loss = None
        resume_best_loss = float('inf')
        resume_best_rollout_reward = float('-inf')
        best_model_path = None
        resume_ckpt = cfg.train.get('resume_ckpt', None)
        resume_path = resolve_resume_checkpoint(resume_ckpt, checkpoint_dir)
        if resume_ckpt is not None:
            if pretrained_ckpt is not None:
                log.warning("⚠️  [Resume] 同时设置了 pretrained_ckpt 与 resume_ckpt，将优先使用 resume_ckpt 进行断点续训")
            if resume_path is None:
                log.warning("⚠️  [Resume] 未找到可恢复的 checkpoint，将从头开始训练")
            elif not resume_path.exists():
                log.error(f"❌ [Resume] Checkpoint 文件不存在: {resume_path}")
                log.warning("⚠️  将从头开始训练")
            else:
                log.info(f"🔄 [Resume] 从 checkpoint 恢复训练: {resume_path}")
                try:
                    checkpoint = torch.load(resume_path, map_location=cfg.train.device)
                    agent.load_state_dict(checkpoint['model_state_dict'], strict=True)
                    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
                    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
                    resume_step = int(checkpoint['step'])
                    start_step = resume_step + 1
                    loaded_loss = checkpoint.get('loss', None)
                    resume_loss = float(loaded_loss) if loaded_loss is not None else None
                    resume_best_loss, resume_best_rollout_reward = extract_checkpoint_metric_baseline(checkpoint)
                    if (
                        resume_best_rollout_reward != float('-inf')
                        or resume_best_loss != float('inf')
                    ):
                        best_model_path = resume_path
                    if default_best_model_path.exists():
                        try:
                            best_checkpoint = torch.load(default_best_model_path, map_location=cfg.train.device)
                            _, best_checkpoint_rollout_reward = (
                                extract_checkpoint_metric_baseline(best_checkpoint)
                            )
                            if best_checkpoint_rollout_reward != float('-inf'):
                                resume_best_rollout_reward = best_checkpoint_rollout_reward
                                best_model_path = default_best_model_path
                                log.info(
                                    "📈 [Resume] 从最佳 checkpoint 恢复最佳 rollout 基线: %s",
                                    default_best_model_path,
                                )
                        except Exception as e:
                            log.warning(
                                f"⚠️  [Resume] 读取最佳 checkpoint 失败，将回退到恢复 checkpoint 的验证基线: {e}"
                            )
                    log.info(f"✅ [Resume] 恢复成功: 上次步骤={resume_step}, 本次从步骤 {start_step} 开始")
                    log.info(f"📈 [Resume] 当前学习率: {optimizer.param_groups[0]['lr']:.2e}")
                except Exception as e:
                    log.error(f"❌ [Resume] 恢复失败: {e}")
                    log.warning("⚠️  将从头开始训练")
                    start_step = 0
                    resume_loss = None
                    resume_best_loss = float('inf')
                    resume_best_rollout_reward = float('-inf')
        # =========================================================================
        # 5. 训练循环
        # =========================================================================
        log.info("🏋️ 开始训练循环...")
        def build_agent_input(batch_data):
            """构建 agent 输入格式"""
            images = {}
            # SimpleRobotDataset 返回 observation.{cam_name} 格式
            for cam_name in cfg.data.camera_names:
                key = f"observation.{cam_name}"
                if key in batch_data:
                    images[cam_name] = batch_data[key]
            return {
                'images': images,
                'qpos': batch_data['observation.state'],  # SimpleRobotDataset 使用 observation.state
                'action': batch_data['action'],
                'action_is_pad': batch_data.get('action_is_pad', None)  # 传递padding mask
            }
        def save_checkpoint(checkpoint_path: Path, step: int, loss_value, val_loss=None, rollout_avg_reward=None):
            agent_stats = agent.get_normalization_stats()
            torch.save({
                'step': step,
                'model_state_dict': agent.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'loss': loss_value,
                'val_loss': val_loss,
                'rollout_avg_reward': rollout_avg_reward,
                'dataset_stats': agent_stats,  # 保存agent的统计信息
                'current_lr': optimizer.param_groups[0]['lr'],
            }, checkpoint_path)
            return checkpoint_path
        def run_validation():
            """运行验证"""
            if val_loader is None:
                return None
            agent.eval()
            # 设置确定性种子以获得可重现的损失
            # 这确保验证损失在不同步骤之间可比较
            torch.manual_seed(42)
            if torch.cuda.is_available():
                torch.cuda.manual_seed(42)
            total_loss = 0.0
            num_batches = 0
            with torch.no_grad():
                for val_batch in val_loader:
                    val_batch = recursive_to_device(val_batch, cfg.train.device)
                    val_input = build_agent_input(val_batch)
                    val_loss = agent.compute_loss(val_input)
                    total_loss += val_loss.item()
                    num_batches += 1
            agent.train()
            return total_loss / max(num_batches, 1)
        def run_rollout_validation(checkpoint_path: Path):
            from roboimi.demos.vla_scripts import eval_vla
            rollout_cfg = OmegaConf.create(OmegaConf.to_container(cfg, resolve=False))
            rollout_cfg.eval.ckpt_path = str(checkpoint_path)
            rollout_cfg.eval.num_episodes = int(cfg.train.get('rollout_num_episodes', 1))
            rollout_cfg.eval.headless = True
            rollout_cfg.eval.device = 'cpu'
            rollout_cfg.eval.verbose_action = False
            rollout_cfg.eval.record_video = False
            rollout_cfg.eval.save_trajectory_image = True
            rollout_cfg.eval.trajectory_image_camera_name = 'front'
            rollout_cfg.eval.save_summary_json = True
            rollout_cfg.eval.artifact_dir = str(
                (run_output_dir / 'rollout_artifacts' / checkpoint_path.stem).resolve()
            )
            log.info(
                "🎯 开始 checkpoint rollout 验证: %s (episodes=%s, headless=True)",
                checkpoint_path,
                rollout_cfg.eval.num_episodes,
            )
            return eval_vla._run_eval(rollout_cfg)
        def run_checkpoint_rollout_validation(checkpoint_path: Path):
            if not bool(cfg.train.get('rollout_validate_on_checkpoint', False)):
                return None
            return run_rollout_validation(checkpoint_path)
        data_iter = iter(train_loader)
        pbar = tqdm(range(start_step, cfg.train.max_steps), desc="训练中", ncols=100)
        steps_per_epoch = len(train_loader)
        rollout_val_freq_epochs = int(cfg.train.get('rollout_val_freq_epochs', 0) or 0)
        rollout_validation_enabled = rollout_val_freq_epochs > 0
        best_loss = resume_best_loss
        best_rollout_reward = resume_best_rollout_reward
        last_loss = resume_loss
        if start_step >= cfg.train.max_steps:
            log.warning(
                f"⚠️  [Resume] start_step={start_step} 已达到/超过 max_steps={cfg.train.max_steps}，跳过训练循环"
            )
        for step in pbar:
            try:
                batch = next(data_iter)
            except StopIteration:
                # 轮次结束时重启迭代器
                data_iter = iter(train_loader)
                batch = next(data_iter)
            # =====================================================================
            # 将批次移至设备
            # =====================================================================
            batch = recursive_to_device(batch, cfg.train.device)
            # =====================================================================
            # 准备 agent 输入
            # =====================================================================
            # 数据集返回: {action, qpos, image_<cam_name>, ...}
            # Agent 期望: {images: dict, qpos: tensor, action: tensor}
            # 准备 agent 输入
            agent_input = build_agent_input(batch)
            # =====================================================================
            # 前向传播与损失计算
            # =====================================================================
            try:
                loss = agent.compute_loss(agent_input)
            except Exception as e:
                log.error(f"❌ 步骤 {step} 前向传播失败: {e}")
                raise
            last_loss = loss.item()
            # =====================================================================
            # 反向传播与优化
            # =====================================================================
            optimizer.zero_grad()
            loss.backward()
            # 梯度裁剪以稳定训练
            torch.nn.utils.clip_grad_norm_(agent.parameters(), max_norm=grad_clip)
            optimizer.step()
            scheduler.step()
            # =====================================================================
            # 日志记录
            # =====================================================================
            if step % cfg.train.log_freq == 0:
                current_lr = optimizer.param_groups[0]['lr']
                best_loss_to_log = best_loss if best_loss != float('inf') else loss.item()
                pbar.set_postfix({
                    "loss": f"{loss.item():.4f}",
                    "lr": f"{current_lr:.2e}",
                    "best_loss": f"{best_loss_to_log:.4f}"
                })
                log.info(f"步骤 {step}/{cfg.train.max_steps} | 损失: {loss.item():.4f} | 学习率: {current_lr:.2e}")
                _log_to_swanlab(
                    swanlab_module,
                    {
                        'train/loss': loss.item(),
                        'train/lr': current_lr,
                        'train/best_loss': best_loss_to_log,
                        'train/step': step,
                    },
                    step=step,
                )
            # =====================================================================
            # 检查点保存与验证
            # =====================================================================
            checkpoint_path = None
            val_loss = None
            if step > 0 and step % cfg.train.save_freq == 0:
                # 运行验证
                val_loss = run_validation()
                if val_loss is not None:
                    log.info(f"步骤 {step}/{cfg.train.max_steps} | 验证损失: {val_loss:.4f}")
                    _log_to_swanlab(
                        swanlab_module,
                        {'val/loss': val_loss},
                        step=step,
                    )
                checkpoint_path = checkpoint_dir / f"vla_model_step_{step}.pt"
                save_checkpoint(
                    checkpoint_path,
                    step,
                    loss.item(),
                    val_loss=val_loss,
                )
                log.info(f"💾 检查点已保存: {checkpoint_path}")
                # 在首次拿到 rollout 平均奖励之前，使用损失作为最佳模型回退指标
                if best_rollout_reward == float('-inf'):
                    eval_loss = val_loss if val_loss is not None else loss.item()
                    if eval_loss < best_loss:
                        best_loss = eval_loss
                        best_model_path = default_best_model_path
                        save_checkpoint(
                            best_model_path,
                            step,
                            loss.item(),
                            val_loss=val_loss,
                        )
                        log.info(f"🌟 最佳模型已更新: {best_model_path} (验证损失: {best_loss:.4f})")
                checkpoint_rollout_stats = run_checkpoint_rollout_validation(checkpoint_path)
                checkpoint_rollout_avg_reward = (
                    checkpoint_rollout_stats.get('avg_reward')
                    if checkpoint_rollout_stats is not None else None
                )
                if checkpoint_rollout_avg_reward is not None:
                    log.info(
                        f"步骤 {step}/{cfg.train.max_steps} | checkpoint rollout 平均奖励: "
                        f"{checkpoint_rollout_avg_reward:.4f}"
                    )
                    _log_to_swanlab(
                        swanlab_module,
                        {'rollout/avg_reward': checkpoint_rollout_avg_reward},
                        step=step,
                    )
                    if checkpoint_rollout_avg_reward > best_rollout_reward:
                        best_rollout_reward = checkpoint_rollout_avg_reward
                        best_model_path = default_best_model_path
                        save_checkpoint(
                            best_model_path,
                            step,
                            loss.item(),
                            val_loss=val_loss,
                            rollout_avg_reward=checkpoint_rollout_avg_reward,
                        )
                        log.info(
                            f"🌟 最佳模型已更新: {best_model_path} "
                            f"(checkpoint rollout 平均奖励: {best_rollout_reward:.4f})"
                        )
            completed_steps = step + 1
            completed_epoch = (
                completed_steps // steps_per_epoch
                if steps_per_epoch > 0 else 0
            )
            should_run_epoch_rollout = (
                rollout_validation_enabled
                and steps_per_epoch > 0
                and completed_steps % steps_per_epoch == 0
                and completed_epoch > 0
                and completed_epoch % rollout_val_freq_epochs == 0
            )
            if should_run_epoch_rollout:
                if checkpoint_path is None:
                    checkpoint_path = checkpoint_dir / f"vla_model_step_{step}.pt"
                    save_checkpoint(
                        checkpoint_path,
                        step,
                        loss.item(),
                        val_loss=val_loss,
                    )
                    log.info(f"💾 Epoch rollout 验证前检查点已保存: {checkpoint_path}")
                rollout_stats = run_rollout_validation(checkpoint_path)
                rollout_avg_reward = (
                    rollout_stats.get('avg_reward')
                    if rollout_stats is not None else None
                )
                if rollout_avg_reward is not None:
                    log.info(
                        f"步骤 {step}/{cfg.train.max_steps} | Epoch {completed_epoch} "
                        f"rollout 平均奖励: {rollout_avg_reward:.4f}"
                    )
                    _log_to_swanlab(
                        swanlab_module,
                        {
                            'rollout/avg_reward': rollout_avg_reward,
                            'rollout/epoch': completed_epoch,
                        },
                        step=step,
                    )
                    _log_rollout_trajectory_images_to_swanlab(
                        swanlab_module,
                        rollout_stats,
                        step=step,
                        context_label=f'epoch {completed_epoch} rollout',
                    )
                    if rollout_avg_reward > best_rollout_reward:
                        best_rollout_reward = rollout_avg_reward
                        best_model_path = default_best_model_path
                        save_checkpoint(
                            best_model_path,
                            step,
                            loss.item(),
                            val_loss=val_loss,
                            rollout_avg_reward=rollout_avg_reward,
                        )
                        log.info(
                            f"🌟 最佳模型已更新: {best_model_path} "
                            f"(Epoch {completed_epoch} rollout 平均奖励: {best_rollout_reward:.4f})"
                        )
        # =========================================================================
        # 6. 保存最终模型
        # =========================================================================
        final_model_path = checkpoint_dir / "vla_model_final.pt"
        save_checkpoint(
            final_model_path,
            cfg.train.max_steps,
            last_loss,
        )
        log.info(f"💾 最终模型已保存: {final_model_path}")
        _log_to_swanlab(
            swanlab_module,
            {
                'final/checkpoint_path': str(final_model_path),
                'final/best_checkpoint_path': (
                    str(best_model_path) if best_model_path is not None else ''
                ),
            },
            step=cfg.train.max_steps,
        )
        log.info("✅ 训练成功完成!")
        if last_loss is not None:
            log.info(f"📊 最终损失: {last_loss:.4f}")
        else:
            log.info("📊 最终损失: N/A（未执行训练步）")
        if best_rollout_reward != float('-inf'):
            log.info(f"📊 最佳 rollout 平均奖励: {best_rollout_reward:.4f}")
        elif best_loss != float('inf'):
            log.info(f"📊 最佳损失: {best_loss:.4f}")
        else:
            log.info("📊 最佳验证指标: N/A（无有效 rollout/验证损失）")
    finally:
        _finish_swanlab(swanlab_module)
@hydra.main(version_base=None, config_path="../../vla/conf", config_name="config")
 def main(cfg: DictConfig):
    _run_training(cfg)
 if __name__ == "__main__":
    main()
--- a/roboimi/detr/LICENSE
+++ b/roboimi/detr/LICENSE
@@ -0,0 +1,201 @@
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/
   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
   1. Definitions.
      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.
      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.
      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.
      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.
      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.
      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.
      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).
      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.
      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."
      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.
   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.
   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.
   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:
      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and
      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and
      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and
      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.
      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.
   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.
   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.
   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.
   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.
   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.
   END OF TERMS AND CONDITIONS
   APPENDIX: How to apply the Apache License to your work.
      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.
   Copyright 2020 - present, Facebook, Inc
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
--- a/roboimi/detr/README.md
+++ b/roboimi/detr/README.md
@@ -0,0 +1,9 @@
 This part of the codebase is modified from DETR https://github.com/facebookresearch/detr under APACHE 2.0.
    @article{Carion2020EndtoEndOD,
      title={End-to-End Object Detection with Transformers},
      author={Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko},
      journal={ArXiv},
      year={2020},
      volume={abs/2005.12872}
    }
--- a/Show More
+++ b/Show More
		`@@ -1,3 +0,0 @@`
			`from .gr00t import build_gr00t_model`

			`__all__ = ['build_gr00t_model']`
		`@@ -1 +0,0 @@`
			`*.safetensors filter=lfs diff=lfs merge=lfs -text`
		`@@ -0,0 +1 @@`
							`# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved`