diff --git a/.gitignore b/.gitignore
index 6e9a55d..7f6fcf4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -123,4 +123,9 @@ CLAUDE.md
GEMINI.md
# Copilot
-.github/copilot-instructions.md
\ No newline at end of file
+.github/copilot-instructions.md
+
+.hydra/
+
+# Local git worktrees
+.worktrees/
diff --git a/README.en.md b/README.en.md
deleted file mode 100644
index 024238a..0000000
--- a/README.en.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# robo-imi-act
-
-#### Description
-{**When you're done, you can delete the content in this README and update the file with details for others getting started with your repository**}
-
-#### Software Architecture
-Software architecture description
-
-#### Installation
-
-1. xxxx
-2. xxxx
-3. xxxx
-
-#### Instructions
-
-1. xxxx
-2. xxxx
-3. xxxx
-
-#### Contribution
-
-1. Fork the repository
-2. Create Feat_xxx branch
-3. Commit your code
-4. Create Pull Request
-
-
-#### Gitee Feature
-
-1. You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md
-2. Gitee blog [blog.gitee.com](https://blog.gitee.com)
-3. Explore open source project [https://gitee.com/explore](https://gitee.com/explore)
-4. The most valuable open source project [GVP](https://gitee.com/gvp)
-5. The manual of Gitee [https://gitee.com/help](https://gitee.com/help)
-6. The most popular members [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
diff --git a/README.md b/README.md
index b72b112..b5d0744 100644
--- a/README.md
+++ b/README.md
@@ -1,39 +1,208 @@
-# robo-imi-act
+# RoboIMI
-#### 介绍
-{**以下是 Gitee 平台说明,您可以替换此简介**
-Gitee 是 OSCHINA 推出的基于 Git 的代码托管平台(同时支持 SVN)。专为开发者提供稳定、高效、安全的云端软件开发协作平台
-无论是个人、团队、或是企业,都能够用 Gitee 实现代码托管、项目管理、协作开发。企业项目请看 [https://gitee.com/enterprises](https://gitee.com/enterprises)}
+基于 MuJoCo 的机器人仿真与模仿学习框架,实现了使用扩散策略的视觉-语言-动作(VLA)模型,用于机器人操作任务。
-#### 软件架构
-软件架构说明
+## 主要特性
+- **多机器人平台支持**:支持 Diana 和 vx300s 机械臂,可扩展至其他机器人
+- **扩散策略**:采用最先进的扩散模型(DDPM/DDIM)进行动作序列预测
+- **视觉-语言-动作模型**:使用 ResNet-18 视觉骨干网络和空间 softmax 进行视觉特征提取
+- **灵活的控制模式**:支持关节空间和末端执行器(笛卡尔)控制
+- **Hydra 配置系统**:模块化配置系统,便于实验
+- **HDF5 数据集格式**:高效存储和加载演示数据
+- **单臂和双臂任务**:支持单臂和双臂操作任务
-#### 安装教程
+## 安装
-1. xxxx
-2. xxxx
-3. xxxx
+### 环境要求
-#### 使用说明
+- Python 3.8+
+- 支持 CUDA 的 GPU(训练时推荐)
+- Conda 或 Miniconda
-1. xxxx
-2. xxxx
-3. xxxx
+### 安装步骤
-#### 参与贡献
+```bash
+# 克隆仓库
+git clone
+cd robo-imi-act
-1. Fork 本仓库
-2. 新建 Feat_xxx 分支
-3. 提交代码
-4. 新建 Pull Request
+# 创建并激活 conda 环境
+conda env create -f environment.yml
+conda activate roboimi
+# 以开发模式安装包
+pip install -e .
+```
-#### 特技
+## 快速开始
-1. 使用 Readme\_XXX.md 来支持不同的语言,例如 Readme\_en.md, Readme\_zh.md
-2. Gitee 官方博客 [blog.gitee.com](https://blog.gitee.com)
-3. 你可以 [https://gitee.com/explore](https://gitee.com/explore) 这个地址来了解 Gitee 上的优秀开源项目
-4. [GVP](https://gitee.com/gvp) 全称是 Gitee 最有价值开源项目,是综合评定出的优秀开源项目
-5. Gitee 官方提供的使用手册 [https://gitee.com/help](https://gitee.com/help)
-6. Gitee 封面人物是一档用来展示 Gitee 会员风采的栏目 [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
+### 1. 数据采集
+
+在仿真环境中记录演示轨迹:
+
+```bash
+# 为 vx300s 机器人记录轨迹
+python roboimi/demos/record_sim_episodes.py
+
+# 为 Diana 机器人记录轨迹
+python roboimi/demos/diana_record_sim_episodes.py
+```
+
+轨迹数据以 HDF5 文件格式保存,包含机器人状态、动作和相机观测。
+
+### 2. 计算数据集统计信息
+
+训练前需要计算归一化统计数据:
+
+```bash
+python roboimi/vla/scripts/calculate_stats.py
+```
+
+该命令会生成 `data_stats.pkl` 文件,包含动作和观测的均值/标准差或最小值/最大值。
+
+### 3. 训练 VLA 模型
+
+使用采集的数据训练视觉-语言-动作模型:
+
+```bash
+# 使用默认配置训练
+python roboimi/demos/vla_scripts/train_vla.py
+
+# 覆盖特定参数
+python roboimi/demos/vla_scripts/train_vla.py train.batch_size=32 train.lr=5e-5 train.max_steps=50000
+
+# 使用不同的模型架构
+python roboimi/demos/vla_scripts/train_vla.py agent=resnet_diffusion data=resnet_dataset
+```
+
+训练输出保存至 `outputs/<日期>/<时间>/`,模型检查点保存至 `checkpoints/`。
+
+### 4. 评估模型
+
+在仿真环境中评估训练好的模型:
+
+```bash
+# 使用默认配置评估(使用最佳检查点)
+python roboimi/demos/vla_scripts/eval_vla.py
+
+# 指定检查点和评估轮数
+python roboimi/demos/vla_scripts/eval_vla.py eval.ckpt_path=checkpoints/vla_model_step_8000.pt eval.num_episodes=5
+
+# 启用动作平滑以获得更流畅的执行
+python roboimi/demos/vla_scripts/eval_vla.py eval.use_smoothing=true eval.smooth_alpha=0.5
+```
+
+## 项目结构
+
+```
+robo-imi-act/
+├── roboimi/
+│ ├── assets/ # 机器人模型和资源
+│ │ ├── models/manipulators/ # URDF 和 MuJoCo XML 文件
+│ │ └── robots/ # 机器人抽象类
+│ ├── envs/ # 仿真环境
+│ │ ├── mujoco_base.py # MuJoCo 环境基类
+│ │ ├── single_base.py # 单臂任务基类
+│ │ └── double_base.py # 双臂任务基类
+│ ├── vla/ # 视觉-语言-动作模型
+│ │ ├── agent.py # VLAAgent(训练与推理)
+│ │ ├── models/
+│ │ │ ├── backbones/ # 视觉编码器(ResNet 等)
+│ │ │ └── heads/ # 策略头(扩散 UNet1D)
+│ │ ├── conf/ # Hydra 配置文件
+│ │ └── scripts/ # 训练和工具脚本
+│ └── demos/ # 演示脚本和示例
+├── checkpoints/ # 保存的模型检查点
+├── outputs/ # 训练输出(Hydra)
+├── environment.yml # Conda 环境配置
+└── CLAUDE.md # Claude Code 开发指南
+```
+
+## 架构设计
+
+### VLA 训练流程
+
+```
+HDF5 轨迹数据 → Dataset → DataLoader → VLAAgent → 模型检查点
+```
+
+**模型组件**:
+- **视觉骨干网络**:ResNet-18 + 空间 softmax,用于从相机图像中提取视觉特征
+- **扩散头**:条件 UNet1D,使用 DDPM/DDIM 预测动作序列
+- **VLAAgent**:组合视觉编码器和扩散策略,处理训练和推理
+
+### 配置系统
+
+基于 Hydra 的配置文件位于 `roboimi/vla/conf/`:
+- `config.yaml`:主要训练配置(批次大小、学习率、设备)
+- `agent/resnet_diffusion.yaml`:模型架构(动作维度、观测维度、时间窗口)
+- `data/resnet_dataset.yaml`:数据集路径、相机名称、归一化类型
+- `eval/eval.yaml`:评估设置(检查点路径、轮数、平滑参数)
+
+使用配置插值保持一致性:`${agent.obs_horizon}`
+
+### 数据集格式
+
+HDF5 轨迹文件(`episode_*.hdf5`)包含:
+- `action`:机器人动作 `[T, action_dim]`
+- `observations/qpos`:关节位置 `[T, obs_dim]`
+- `observations/images/`:相机图像 `[T, H, W, C]`
+
+统计文件(`data_stats.pkl`)存储归一化参数(最小值/最大值/均值/标准差)。
+
+## 开发指南
+
+### 添加新机器人
+
+1. 在 `roboimi/assets/models/manipulators//` 创建 URDF/XML 文件
+2. 在 `roboimi/assets/robots/.py` 定义机器人类(继承自 `arm_base.py`)
+3. 在 `roboimi/envs/_*.py` 创建环境类
+4. 如需要,在常量中注册机器人
+
+### 修改 VLA 架构
+
+1. **自定义骨干网络**:在 `roboimi/vla/models/backbones/` 创建新类,继承 `VLABackbone`
+2. **自定义头部**:在 `roboimi/vla/models/heads/` 创建新类,继承 `VLAHead`
+3. **更新配置**:在 `roboimi/vla/conf/agent/` 添加新的 YAML 文件
+4. **接口定义**:参考 `roboimi/vla/core/interfaces.py` 的抽象基类
+
+### 训练最佳实践
+
+- 采集新数据后务必运行 `calculate_stats.py`
+- 训练时会归一化输入/输出;推理时使用检查点中保存的统计信息进行反归一化
+- 模型预测 `pred_horizon` 步,但只执行前 `action_horizon` 步
+- 推理使用 DDIM(10 步)快速采样;训练使用 DDPM(100 步)
+- 监控验证损失以防止过拟合
+
+## 技术细节
+
+- **坐标系**:关节空间(qpos)或末端执行器空间(xyz + rpy + 夹爪)
+- **动作时间窗口**:`obs_horizon` 为观测窗口,`pred_horizon` 为预测窗口,`action_horizon` 为执行窗口
+- **归一化**:对稳定训练至关重要 - 训练前务必计算统计信息
+- **推理加速**:使用 DDIM 调度器,比训练时的 DDPM 快 10 倍
+- **设备配置**:通过 `train.device` 配置(cuda/cpu)
+
+## 许可证
+
+[在此添加许可证信息]
+
+## 引用
+
+如果您在研究中使用了本代码库,请引用:
+
+```bibtex
+[在此添加引用信息]
+```
+
+## 贡献
+
+欢迎贡献!请随时提交 Pull Request 或开启 Issue。
+
+## 致谢
+
+本项目基于以下开源项目构建:
+- [MuJoCo](https://mujoco.org/) - 物理仿真引擎
+- [PyTorch](https://pytorch.org/) - 深度学习框架
+- [Hydra](https://hydra.cc/) - 配置管理系统
+- [Diffusers](https://github.com/huggingface/diffusers) - 扩散模型库
diff --git a/check_all_episodes.py b/check_all_episodes.py
new file mode 100644
index 0000000..2734216
--- /dev/null
+++ b/check_all_episodes.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+"""
+检查所有 episode 的重复帧情况
+
+找出哪些 episode 有问题,需要删除或重新收集
+"""
+import os
+import h5py
+import glob
+import numpy as np
+
+
+def check_all_episodes():
+ """检查所有 episode 的质量"""
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+ episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+ print("="*80)
+ print("所有 Episode 质量检查")
+ print("="*80)
+
+ good_episodes = []
+ bad_episodes = []
+
+ for ep_idx, ep_file in enumerate(episode_files):
+ ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+ try:
+ with h5py.File(ep_file, 'r') as f:
+ img_path = '/observations/images/top'
+ if img_path not in f:
+ continue
+
+ images = f[img_path][:]
+
+ # 检查前 50 帧的重复情况
+ check_frames = min(50, len(images))
+ duplicate_count = 0
+
+ for i in range(check_frames - 1):
+ img1 = images[i]
+ img2 = images[i + 1]
+ diff = np.mean(np.abs(img1.astype(float) - img2.astype(float)))
+
+ if diff < 1.0: # 重复
+ duplicate_count += 1
+
+ duplicate_rate = duplicate_count / check_frames * 100
+
+ # 判断质量
+ if duplicate_rate > 10: # 超过10%重复
+ bad_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
+ status = "❌"
+ else:
+ good_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
+ status = "✅"
+
+ print(f"{status} Episode {ep_idx:2d}: {duplicate_rate:5.1f}% 重复 ({duplicate_count:2d}/{check_frames}) - {ep_name}")
+
+ except Exception as e:
+ print(f"❌ Episode {ep_idx}: 错误 - {e}")
+
+ # 总结
+ print("\n" + "="*80)
+ print("总结")
+ print("="*80)
+ print(f"总共检查: {len(episode_files)} 个 episodes")
+ print(f"正常的: {len(good_episodes)} 个 ✅")
+ print(f"有问题的: {len(bad_episodes)} 个 ❌")
+
+ if bad_episodes:
+ print(f"\n有问题的 episodes:")
+ for ep_idx, ep_name, rate, count in bad_episodes:
+ print(f" - episode_{ep_idx}.hdf5: {rate:.1f}% 重复")
+
+ print(f"\n删除命令:")
+ ep_names = [name for _, name, _, _ in bad_episodes]
+ print(f" rm " + " ".join([f"{dataset_dir}/{name}.hdf5" for name in ep_names]))
+
+ print(f"\n建议:")
+ if len(bad_episodes) > 0:
+ print(f" 1. 删除有问题的 {len(bad_episodes)} 个 episodes")
+ print(f" 2. 重新收集数据,或使用剩余的 {len(good_episodes)} 个正常 episodes")
+ else:
+ print(f" ✅ 所有 episodes 都正常,可以直接使用!")
+
+
+if __name__ == "__main__":
+ check_all_episodes()
diff --git a/check_specific_frames.py b/check_specific_frames.py
new file mode 100644
index 0000000..ce93d35
--- /dev/null
+++ b/check_specific_frames.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""
+检查特定帧的图像 - 用于验证数据记录问题
+
+功能:
+1. 提取每个 episode 的第 0、1、2 帧图像
+2. 对比不同 episode 的相同帧号
+3. 保存图像供人工检查
+"""
+import os
+import h5py
+import glob
+import cv2
+import numpy as np
+
+
+def check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10):
+ """
+ 检查特定帧的图像和 qpos
+
+ Args:
+ frame_indices: 要检查的帧索引列表
+ camera: 相机名称
+ num_episodes: 要检查的 episode 数量
+ """
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+ # 按数字排序
+ episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+ # 创建输出目录
+ output_dir = f'/tmp/dataset_frames'
+ os.makedirs(output_dir, exist_ok=True)
+
+ print(f"检查前 {min(num_episodes, len(episode_files))} 个 episode 的特定帧")
+ print(f"帧索引: {frame_indices}")
+ print(f"相机: {camera}")
+ print("="*80)
+
+ # 收集所有数据
+ for ep_idx in range(min(num_episodes, len(episode_files))):
+ ep_file = episode_files[ep_idx]
+ ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+ try:
+ with h5py.File(ep_file, 'r') as f:
+ # 读取 qpos
+ qpos = f['/observations/qpos'][:]
+
+ # 读取图像
+ img_path = f'/observations/images/{camera}'
+ if img_path not in f:
+ print(f"Episode {ep_name}: 相机 {camera} 不存在")
+ continue
+
+ images = f[img_path][:]
+
+ print(f"\nEpisode {ep_name}:")
+ print(f" 总帧数: {len(images)}")
+
+ # 保存指定帧
+ for frame_idx in frame_indices:
+ if frame_idx >= len(images):
+ print(f" 帧 {frame_idx}: 超出范围")
+ continue
+
+ # 保存图像
+ img = images[frame_idx]
+ filename = f"{output_dir}/ep{ep_idx:02d}_frame{frame_idx:03d}.png"
+ cv2.imwrite(filename, img)
+
+ # 打印 qpos
+ q = qpos[frame_idx]
+ print(f" 帧 {frame_idx}: qpos[0:3]=[{q[0]:6.2f}, {q[1]:6.2f}, {q[2]:6.2f}], qpos[3]={q[3]:6.2f} → {filename}")
+
+ except Exception as e:
+ print(f"Episode {ep_name}: 错误 - {e}")
+
+ print("\n" + "="*80)
+ print(f"✅ 所有图像已保存到: {output_dir}")
+ print(f"\n查看方法:")
+ print(f" eog {output_dir}/*.png")
+ print(f" ")
+ print(f" # 或对比特定帧:")
+ print(f" eog {output_dir}/*_frame000.png # 所有 episode 的第 0 帧")
+ print(f" eog {output_dir}/*_frame001.png # 所有 episode 的第 1 帧")
+ print(f" eog {output_dir}/*_frame002.png # 所有 episode 的第 2 帧")
+
+
+def compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10):
+ """
+ 并排对比所有 episode 的某一帧
+
+ 生成一个大的对比图,包含所有 episode 的指定帧
+ """
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+ episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+ num_compare = min(num_episodes, len(episode_files))
+ cols = 5 # 每行 5 个
+ rows = (num_compare + cols - 1) // cols
+
+ # 创建输出目录
+ output_dir = f'/tmp/dataset_frames'
+ os.makedirs(output_dir, exist_ok=True)
+
+ print(f"生成对比图: 所有 Episode 的第 {frame_idx} 帧")
+ print("="*80)
+
+ # 收集图像
+ images_compare = []
+ qpos_list = []
+
+ for ep_idx in range(num_compare):
+ ep_file = episode_files[ep_idx]
+ ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+ try:
+ with h5py.File(ep_file, 'r') as f:
+ qpos = f['/observations/qpos'][:]
+ img_path = f'/observations/images/{camera}'
+
+ if img_path in f and frame_idx < f[img_path].shape[0]:
+ img = f[img_path][frame_idx]
+ images_compare.append(img)
+ qpos_list.append(qpos[frame_idx])
+ print(f"Episode {ep_name}: qpos[0:3]=[{qpos[frame_idx][0]:.2f}, {qpos[frame_idx][1]:.2f}, {qpos[frame_idx][2]:.2f}]")
+
+ except Exception as e:
+ print(f"Episode {ep_name}: 错误 - {e}")
+
+ if not images_compare:
+ print("❌ 没有收集到图像")
+ return
+
+ # 获取图像尺寸
+ h, w = images_compare[0].shape[:2]
+
+ # 创建对比图
+ compare_img = np.zeros((rows * h + 50, cols * w, 3), dtype=np.uint8)
+
+ for i, (img, qpos) in enumerate(zip(images_compare, qpos_list)):
+ row = i // cols
+ col = i % cols
+
+ y_start = row * h + 30
+ y_end = y_start + h
+ x_start = col * w
+ x_end = x_start + w
+
+ # 调整大小(如果需要)
+ if img.shape[:2] != (h, w):
+ img = cv2.resize(img, (w, h))
+
+ compare_img[y_start:y_end, x_start:x_end] = img
+
+ # 添加信息
+ ep_name = f"Ep {i}"
+ cv2.putText(compare_img, ep_name, (x_start + 10, row * h + 20),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
+ cv2.putText(compare_img, f"qpos[3]={qpos[3]:.2f}", (x_start + 10, y_end - 10),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
+
+ # 保存对比图
+ output_path = f"{output_dir}/compare_frame{frame_idx:03d}.png"
+ cv2.imwrite(output_path, compare_img)
+
+ print(f"\n✅ 对比图已保存: {output_path}")
+ print(f" 查看方法: eog {output_path}")
+
+
+if __name__ == "__main__":
+ import sys
+
+ print("="*80)
+ print("特定帧检查工具")
+ print("="*80)
+
+ if len(sys.argv) > 1:
+ frame_idx = int(sys.argv[1])
+ compare_frame_across_episodes(frame_idx=frame_idx, camera='top', num_episodes=10)
+ else:
+ # 默认检查第 0、1、2 帧
+ check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10)
+
+ print("\n" + "="*80)
+ print("生成对比图...")
+ print("="*80)
+
+ # 生成第 0 帧的对比图
+ compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10)
+ compare_frame_across_episodes(frame_idx=1, camera='top', num_episodes=10)
+ compare_frame_across_episodes(frame_idx=2, camera='top', num_episodes=10)
+
+ print("\n" + "="*80)
+ print("其他用法:")
+ print(" python check_specific_frames.py 0 # 只检查第 0 帧")
+ print(" python check_specific_frames.py 1 # 只检查第 1 帧")
+ print("="*80)
diff --git a/diffusion/configuration_diffusion.py b/diffusion/configuration_diffusion.py
new file mode 100644
index 0000000..5456943
--- /dev/null
+++ b/diffusion/configuration_diffusion.py
@@ -0,0 +1,238 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import NormalizationMode
+from lerobot.optim.optimizers import AdamConfig
+from lerobot.optim.schedulers import DiffuserSchedulerConfig
+
+
+@PreTrainedConfig.register_subclass("diffusion")
+@dataclass
+class DiffusionConfig(PreTrainedConfig):
+ """Configuration class for DiffusionPolicy.
+
+ Defaults are configured for training with PushT providing proprioceptive and single camera observations.
+
+ The parameters you will most likely need to change are the ones which depend on the environment / sensors.
+ Those are: `input_shapes` and `output_shapes`.
+
+ Notes on the inputs and outputs:
+ - "observation.state" is required as an input key.
+ - Either:
+ - At least one key starting with "observation.image is required as an input.
+ AND/OR
+ - The key "observation.environment_state" is required as input.
+ - If there are multiple keys beginning with "observation.image" they are treated as multiple camera
+ views. Right now we only support all images having the same shape.
+ - "action" is required as an output key.
+
+ Args:
+ n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
+ current step and additional steps going back).
+ horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
+ n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
+ See `DiffusionPolicy.select_action` for more details.
+ input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+ the input data name, and the value is a list indicating the dimensions of the corresponding data.
+ For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+ indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+ include batch dimension or temporal dimension.
+ output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+ the output data name, and the value is a list indicating the dimensions of the corresponding data.
+ For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+ Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
+ input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
+ and the value specifies the normalization mode to apply. The two available modes are "mean_std"
+ which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
+ [-1, 1] range.
+ output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
+ original scale. Note that this is also used for normalizing the training targets.
+ vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
+ crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
+ within the image size. If None, no cropping is done.
+ crop_is_random: Whether the crop should be random at training time (it's always a center crop in eval
+ mode).
+ pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
+ `None` means no pretrained weights.
+ use_group_norm: Whether to replace batch normalization with group normalization in the backbone.
+ The group sizes are set to be about 16 (to be precise, feature_dim // 16).
+ spatial_softmax_num_keypoints: Number of keypoints for SpatialSoftmax.
+ use_separate_rgb_encoders_per_camera: Whether to use a separate RGB encoder for each camera view.
+ down_dims: Feature dimension for each stage of temporal downsampling in the diffusion modeling Unet.
+ You may provide a variable number of dimensions, therefore also controlling the degree of
+ downsampling.
+ kernel_size: The convolutional kernel size of the diffusion modeling Unet.
+ n_groups: Number of groups used in the group norm of the Unet's convolutional blocks.
+ diffusion_step_embed_dim: The Unet is conditioned on the diffusion timestep via a small non-linear
+ network. This is the output dimension of that network, i.e., the embedding dimension.
+ use_film_scale_modulation: FiLM (https://huggingface.co/papers/1709.07871) is used for the Unet conditioning.
+ Bias modulation is used be default, while this parameter indicates whether to also use scale
+ modulation.
+ noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].
+ num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.
+ beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.
+ beta_start: Beta value for the first forward-diffusion step.
+ beta_end: Beta value for the last forward-diffusion step.
+ prediction_type: The type of prediction that the diffusion modeling Unet makes. Choose from "epsilon"
+ or "sample". These have equivalent outcomes from a latent variable modeling perspective, but
+ "epsilon" has been shown to work better in many deep neural network settings.
+ clip_sample: Whether to clip the sample to [-`clip_sample_range`, +`clip_sample_range`] for each
+ denoising step at inference time. WARNING: you will need to make sure your action-space is
+ normalized to fit within this range.
+ clip_sample_range: The magnitude of the clipping range as described above.
+ num_inference_steps: Number of reverse diffusion steps to use at inference time (steps are evenly
+ spaced). If not provided, this defaults to be the same as `num_train_timesteps`.
+ do_mask_loss_for_padding: Whether to mask the loss when there are copy-padded actions. See
+ `LeRobotDataset` and `load_previous_and_future_frames` for more information. Note, this defaults
+ to False as the original Diffusion Policy implementation does the same.
+ """
+
+ # Inputs / output structure.
+ n_obs_steps: int = 2
+ horizon: int = 16
+ n_action_steps: int = 8
+
+ normalization_mapping: dict[str, NormalizationMode] = field(
+ default_factory=lambda: {
+ "VISUAL": NormalizationMode.MEAN_STD,
+ "STATE": NormalizationMode.MIN_MAX,
+ "ACTION": NormalizationMode.MIN_MAX,
+ }
+ )
+
+ # The original implementation doesn't sample frames for the last 7 steps,
+ # which avoids excessive padding and leads to improved training results.
+ drop_n_last_frames: int = 7 # horizon - n_action_steps - n_obs_steps + 1
+
+ # Architecture / modeling.
+ # Vision backbone.
+ vision_backbone: str = "resnet18"
+ crop_shape: tuple[int, int] | None = (84, 84)
+ crop_is_random: bool = True
+ pretrained_backbone_weights: str | None = None
+ use_group_norm: bool = True
+ spatial_softmax_num_keypoints: int = 32
+ use_separate_rgb_encoder_per_camera: bool = False
+ # Unet.
+ down_dims: tuple[int, ...] = (512, 1024, 2048)
+ kernel_size: int = 5
+ n_groups: int = 8
+ diffusion_step_embed_dim: int = 128
+ use_film_scale_modulation: bool = True
+ # Noise scheduler.
+ noise_scheduler_type: str = "DDPM"
+ num_train_timesteps: int = 100
+ beta_schedule: str = "squaredcos_cap_v2"
+ beta_start: float = 0.0001
+ beta_end: float = 0.02
+ prediction_type: str = "epsilon"
+ clip_sample: bool = True
+ clip_sample_range: float = 1.0
+
+ # Inference
+ num_inference_steps: int | None = None
+
+ # Loss computation
+ do_mask_loss_for_padding: bool = False
+
+ # Training presets
+ optimizer_lr: float = 1e-4
+ optimizer_betas: tuple = (0.95, 0.999)
+ optimizer_eps: float = 1e-8
+ optimizer_weight_decay: float = 1e-6
+ scheduler_name: str = "cosine"
+ scheduler_warmup_steps: int = 500
+
+ def __post_init__(self):
+ super().__post_init__()
+
+ """Input validation (not exhaustive)."""
+ if not self.vision_backbone.startswith("resnet"):
+ raise ValueError(
+ f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
+ )
+
+ supported_prediction_types = ["epsilon", "sample"]
+ if self.prediction_type not in supported_prediction_types:
+ raise ValueError(
+ f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
+ )
+ supported_noise_schedulers = ["DDPM", "DDIM"]
+ if self.noise_scheduler_type not in supported_noise_schedulers:
+ raise ValueError(
+ f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
+ f"Got {self.noise_scheduler_type}."
+ )
+
+ # Check that the horizon size and U-Net downsampling is compatible.
+ # U-Net downsamples by 2 with each stage.
+ downsampling_factor = 2 ** len(self.down_dims)
+ if self.horizon % downsampling_factor != 0:
+ raise ValueError(
+ "The horizon should be an integer multiple of the downsampling factor (which is determined "
+ f"by `len(down_dims)`). Got {self.horizon=} and {self.down_dims=}"
+ )
+
+ def get_optimizer_preset(self) -> AdamConfig:
+ return AdamConfig(
+ lr=self.optimizer_lr,
+ betas=self.optimizer_betas,
+ eps=self.optimizer_eps,
+ weight_decay=self.optimizer_weight_decay,
+ )
+
+ def get_scheduler_preset(self) -> DiffuserSchedulerConfig:
+ return DiffuserSchedulerConfig(
+ name=self.scheduler_name,
+ num_warmup_steps=self.scheduler_warmup_steps,
+ )
+
+ def validate_features(self) -> None:
+ if len(self.image_features) == 0 and self.env_state_feature is None:
+ raise ValueError("You must provide at least one image or the environment state among the inputs.")
+
+ if self.crop_shape is not None:
+ for key, image_ft in self.image_features.items():
+ if self.crop_shape[0] > image_ft.shape[1] or self.crop_shape[1] > image_ft.shape[2]:
+ raise ValueError(
+ f"`crop_shape` should fit within the images shapes. Got {self.crop_shape} "
+ f"for `crop_shape` and {image_ft.shape} for "
+ f"`{key}`."
+ )
+
+ # Check that all input images have the same shape.
+ if len(self.image_features) > 0:
+ first_image_key, first_image_ft = next(iter(self.image_features.items()))
+ for key, image_ft in self.image_features.items():
+ if image_ft.shape != first_image_ft.shape:
+ raise ValueError(
+ f"`{key}` does not match `{first_image_key}`, but we expect all image shapes to match."
+ )
+
+ @property
+ def observation_delta_indices(self) -> list:
+ return list(range(1 - self.n_obs_steps, 1))
+
+ @property
+ def action_delta_indices(self) -> list:
+ return list(range(1 - self.n_obs_steps, 1 - self.n_obs_steps + self.horizon))
+
+ @property
+ def reward_delta_indices(self) -> None:
+ return None
diff --git a/diffusion/modeling_diffusion.py b/diffusion/modeling_diffusion.py
new file mode 100644
index 0000000..1fdc76f
--- /dev/null
+++ b/diffusion/modeling_diffusion.py
@@ -0,0 +1,764 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
+
+TODO(alexander-soare):
+ - Remove reliance on diffusers for DDPMScheduler and LR scheduler.
+"""
+
+import math
+from collections import deque
+from collections.abc import Callable
+
+import einops
+import numpy as np
+import torch
+import torch.nn.functional as F # noqa: N812
+import torchvision
+from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from torch import Tensor, nn
+
+from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
+from lerobot.policies.pretrained import PreTrainedPolicy
+from lerobot.policies.utils import (
+ get_device_from_parameters,
+ get_dtype_from_parameters,
+ get_output_shape,
+ populate_queues,
+)
+from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
+
+
+class DiffusionPolicy(PreTrainedPolicy):
+ """
+ Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
+ (paper: https://huggingface.co/papers/2303.04137, code: https://github.com/real-stanford/diffusion_policy).
+ """
+
+ config_class = DiffusionConfig
+ name = "diffusion"
+
+ def __init__(
+ self,
+ config: DiffusionConfig,
+ **kwargs,
+ ):
+ """
+ Args:
+ config: Policy configuration class instance or None, in which case the default instantiation of
+ the configuration class is used.
+ dataset_stats: Dataset statistics to be used for normalization. If not passed here, it is expected
+ that they will be passed with a call to `load_state_dict` before the policy is used.
+ """
+ super().__init__(config)
+ config.validate_features()
+ self.config = config
+
+ # queues are populated during rollout of the policy, they contain the n latest observations and actions
+ self._queues = None
+
+ self.diffusion = DiffusionModel(config)
+
+ self.reset()
+
+ def get_optim_params(self) -> dict:
+ return self.diffusion.parameters()
+
+ def reset(self):
+ """Clear observation and action queues. Should be called on `env.reset()`"""
+ self._queues = {
+ OBS_STATE: deque(maxlen=self.config.n_obs_steps),
+ ACTION: deque(maxlen=self.config.n_action_steps),
+ }
+ if self.config.image_features:
+ self._queues[OBS_IMAGES] = deque(maxlen=self.config.n_obs_steps)
+ if self.config.env_state_feature:
+ self._queues[OBS_ENV_STATE] = deque(maxlen=self.config.n_obs_steps)
+
+ @torch.no_grad()
+ def predict_action_chunk(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+ """Predict a chunk of actions given environment observations."""
+ # stack n latest observations from the queue
+ batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
+ actions = self.diffusion.generate_actions(batch, noise=noise)
+
+ return actions
+
+ @torch.no_grad()
+ def select_action(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+ """Select a single action given environment observations.
+
+ This method handles caching a history of observations and an action trajectory generated by the
+ underlying diffusion model. Here's how it works:
+ - `n_obs_steps` steps worth of observations are cached (for the first steps, the observation is
+ copied `n_obs_steps` times to fill the cache).
+ - The diffusion model generates `horizon` steps worth of actions.
+ - `n_action_steps` worth of actions are actually kept for execution, starting from the current step.
+ Schematically this looks like:
+ ----------------------------------------------------------------------------------------------
+ (legend: o = n_obs_steps, h = horizon, a = n_action_steps)
+ |timestep | n-o+1 | n-o+2 | ..... | n | ..... | n+a-1 | n+a | ..... | n-o+h |
+ |observation is used | YES | YES | YES | YES | NO | NO | NO | NO | NO |
+ |action is generated | YES | YES | YES | YES | YES | YES | YES | YES | YES |
+ |action is used | NO | NO | NO | YES | YES | YES | NO | NO | NO |
+ ----------------------------------------------------------------------------------------------
+ Note that this means we require: `n_action_steps <= horizon - n_obs_steps + 1`. Also, note that
+ "horizon" may not the best name to describe what the variable actually means, because this period is
+ actually measured from the first observation which (if `n_obs_steps` > 1) happened in the past.
+ """
+ # NOTE: for offline evaluation, we have action in the batch, so we need to pop it out
+ if ACTION in batch:
+ batch.pop(ACTION)
+
+ if self.config.image_features:
+ batch = dict(batch) # shallow copy so that adding a key doesn't modify the original
+ batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+ # NOTE: It's important that this happens after stacking the images into a single key.
+ self._queues = populate_queues(self._queues, batch)
+
+ if len(self._queues[ACTION]) == 0:
+ actions = self.predict_action_chunk(batch, noise=noise)
+ self._queues[ACTION].extend(actions.transpose(0, 1))
+
+ action = self._queues[ACTION].popleft()
+ return action
+
+ def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, None]:
+ """Run the batch through the model and compute the loss for training or validation."""
+ if self.config.image_features:
+ batch = dict(batch) # shallow copy so that adding a key doesn't modify the original
+ batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+ loss = self.diffusion.compute_loss(batch)
+ # no output_dict so returning None
+ return loss, None
+
+
+def _make_noise_scheduler(name: str, **kwargs: dict) -> DDPMScheduler | DDIMScheduler:
+ """
+ Factory for noise scheduler instances of the requested type. All kwargs are passed
+ to the scheduler.
+ """
+ if name == "DDPM":
+ return DDPMScheduler(**kwargs)
+ elif name == "DDIM":
+ return DDIMScheduler(**kwargs)
+ else:
+ raise ValueError(f"Unsupported noise scheduler type {name}")
+
+
+class DiffusionModel(nn.Module):
+ def __init__(self, config: DiffusionConfig):
+ super().__init__()
+ self.config = config
+
+ # Build observation encoders (depending on which observations are provided).
+ global_cond_dim = self.config.robot_state_feature.shape[0]
+ if self.config.image_features:
+ num_images = len(self.config.image_features)
+ if self.config.use_separate_rgb_encoder_per_camera:
+ encoders = [DiffusionRgbEncoder(config) for _ in range(num_images)]
+ self.rgb_encoder = nn.ModuleList(encoders)
+ global_cond_dim += encoders[0].feature_dim * num_images
+ else:
+ self.rgb_encoder = DiffusionRgbEncoder(config)
+ global_cond_dim += self.rgb_encoder.feature_dim * num_images
+ if self.config.env_state_feature:
+ global_cond_dim += self.config.env_state_feature.shape[0]
+
+ self.unet = DiffusionConditionalUnet1d(config, global_cond_dim=global_cond_dim * config.n_obs_steps)
+
+ self.noise_scheduler = _make_noise_scheduler(
+ config.noise_scheduler_type,
+ num_train_timesteps=config.num_train_timesteps,
+ beta_start=config.beta_start,
+ beta_end=config.beta_end,
+ beta_schedule=config.beta_schedule,
+ clip_sample=config.clip_sample,
+ clip_sample_range=config.clip_sample_range,
+ prediction_type=config.prediction_type,
+ )
+
+ if config.num_inference_steps is None:
+ self.num_inference_steps = self.noise_scheduler.config.num_train_timesteps
+ else:
+ self.num_inference_steps = config.num_inference_steps
+
+ # ========= inference ============
+ def conditional_sample(
+ self,
+ batch_size: int,
+ global_cond: Tensor | None = None,
+ generator: torch.Generator | None = None,
+ noise: Tensor | None = None,
+ ) -> Tensor:
+ device = get_device_from_parameters(self)
+ dtype = get_dtype_from_parameters(self)
+
+ # Sample prior.
+ sample = (
+ noise
+ if noise is not None
+ else torch.randn(
+ size=(batch_size, self.config.horizon, self.config.action_feature.shape[0]),
+ dtype=dtype,
+ device=device,
+ generator=generator,
+ )
+ )
+
+ self.noise_scheduler.set_timesteps(self.num_inference_steps)
+
+ for t in self.noise_scheduler.timesteps:
+ # Predict model output.
+ model_output = self.unet(
+ sample,
+ torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
+ global_cond=global_cond,
+ )
+ # Compute previous image: x_t -> x_t-1
+ sample = self.noise_scheduler.step(model_output, t, sample, generator=generator).prev_sample
+
+ return sample
+
+ def _prepare_global_conditioning(self, batch: dict[str, Tensor]) -> Tensor:
+ """Encode image features and concatenate them all together along with the state vector."""
+ batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
+ global_cond_feats = [batch[OBS_STATE]]
+ # Extract image features.
+ if self.config.image_features:
+ if self.config.use_separate_rgb_encoder_per_camera:
+ # Combine batch and sequence dims while rearranging to make the camera index dimension first.
+ images_per_camera = einops.rearrange(batch[OBS_IMAGES], "b s n ... -> n (b s) ...")
+ img_features_list = torch.cat(
+ [
+ encoder(images)
+ for encoder, images in zip(self.rgb_encoder, images_per_camera, strict=True)
+ ]
+ )
+ # Separate batch and sequence dims back out. The camera index dim gets absorbed into the
+ # feature dim (effectively concatenating the camera features).
+ img_features = einops.rearrange(
+ img_features_list, "(n b s) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
+ )
+ else:
+ # Combine batch, sequence, and "which camera" dims before passing to shared encoder.
+ img_features = self.rgb_encoder(
+ einops.rearrange(batch[OBS_IMAGES], "b s n ... -> (b s n) ...")
+ )
+ # Separate batch dim and sequence dim back out. The camera index dim gets absorbed into the
+ # feature dim (effectively concatenating the camera features).
+ img_features = einops.rearrange(
+ img_features, "(b s n) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
+ )
+ global_cond_feats.append(img_features)
+
+ if self.config.env_state_feature:
+ global_cond_feats.append(batch[OBS_ENV_STATE])
+
+ # Concatenate features then flatten to (B, global_cond_dim).
+ return torch.cat(global_cond_feats, dim=-1).flatten(start_dim=1)
+
+ def generate_actions(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+ """
+ This function expects `batch` to have:
+ {
+ "observation.state": (B, n_obs_steps, state_dim)
+
+ "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
+ AND/OR
+ "observation.environment_state": (B, n_obs_steps, environment_dim)
+ }
+ """
+ batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
+ assert n_obs_steps == self.config.n_obs_steps
+
+ # Encode image features and concatenate them all together along with the state vector.
+ global_cond = self._prepare_global_conditioning(batch) # (B, global_cond_dim)
+
+ # run sampling
+ actions = self.conditional_sample(batch_size, global_cond=global_cond, noise=noise)
+
+ # Extract `n_action_steps` steps worth of actions (from the current observation).
+ start = n_obs_steps - 1
+ end = start + self.config.n_action_steps
+ actions = actions[:, start:end]
+
+ return actions
+
+ def compute_loss(self, batch: dict[str, Tensor]) -> Tensor:
+ """
+ This function expects `batch` to have (at least):
+ {
+ "observation.state": (B, n_obs_steps, state_dim)
+
+ "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
+ AND/OR
+ "observation.environment_state": (B, n_obs_steps, environment_dim)
+
+ "action": (B, horizon, action_dim)
+ "action_is_pad": (B, horizon)
+ }
+ """
+ # Input validation.
+ assert set(batch).issuperset({OBS_STATE, ACTION, "action_is_pad"})
+ assert OBS_IMAGES in batch or OBS_ENV_STATE in batch
+ n_obs_steps = batch[OBS_STATE].shape[1]
+ horizon = batch[ACTION].shape[1]
+ assert horizon == self.config.horizon
+ assert n_obs_steps == self.config.n_obs_steps
+
+ # Encode image features and concatenate them all together along with the state vector.
+ global_cond = self._prepare_global_conditioning(batch) # (B, global_cond_dim)
+
+ # Forward diffusion.
+ trajectory = batch[ACTION]
+ # Sample noise to add to the trajectory.
+ eps = torch.randn(trajectory.shape, device=trajectory.device)
+ # Sample a random noising timestep for each item in the batch.
+ timesteps = torch.randint(
+ low=0,
+ high=self.noise_scheduler.config.num_train_timesteps,
+ size=(trajectory.shape[0],),
+ device=trajectory.device,
+ ).long()
+ # Add noise to the clean trajectories according to the noise magnitude at each timestep.
+ noisy_trajectory = self.noise_scheduler.add_noise(trajectory, eps, timesteps)
+
+ # Run the denoising network (that might denoise the trajectory, or attempt to predict the noise).
+ pred = self.unet(noisy_trajectory, timesteps, global_cond=global_cond)
+
+ # Compute the loss.
+ # The target is either the original trajectory, or the noise.
+ if self.config.prediction_type == "epsilon":
+ target = eps
+ elif self.config.prediction_type == "sample":
+ target = batch[ACTION]
+ else:
+ raise ValueError(f"Unsupported prediction type {self.config.prediction_type}")
+
+ loss = F.mse_loss(pred, target, reduction="none")
+
+ # Mask loss wherever the action is padded with copies (edges of the dataset trajectory).
+ if self.config.do_mask_loss_for_padding:
+ if "action_is_pad" not in batch:
+ raise ValueError(
+ "You need to provide 'action_is_pad' in the batch when "
+ f"{self.config.do_mask_loss_for_padding=}."
+ )
+ in_episode_bound = ~batch["action_is_pad"]
+ loss = loss * in_episode_bound.unsqueeze(-1)
+
+ return loss.mean()
+
+
+class SpatialSoftmax(nn.Module):
+ """
+ Spatial Soft Argmax operation described in "Deep Spatial Autoencoders for Visuomotor Learning" by Finn et al.
+ (https://huggingface.co/papers/1509.06113). A minimal port of the robomimic implementation.
+
+ At a high level, this takes 2D feature maps (from a convnet/ViT) and returns the "center of mass"
+ of activations of each channel, i.e., keypoints in the image space for the policy to focus on.
+
+ Example: take feature maps of size (512x10x12). We generate a grid of normalized coordinates (10x12x2):
+ -----------------------------------------------------
+ | (-1., -1.) | (-0.82, -1.) | ... | (1., -1.) |
+ | (-1., -0.78) | (-0.82, -0.78) | ... | (1., -0.78) |
+ | ... | ... | ... | ... |
+ | (-1., 1.) | (-0.82, 1.) | ... | (1., 1.) |
+ -----------------------------------------------------
+ This is achieved by applying channel-wise softmax over the activations (512x120) and computing the dot
+ product with the coordinates (120x2) to get expected points of maximal activation (512x2).
+
+ The example above results in 512 keypoints (corresponding to the 512 input channels). We can optionally
+ provide num_kp != None to control the number of keypoints. This is achieved by a first applying a learnable
+ linear mapping (in_channels, H, W) -> (num_kp, H, W).
+ """
+
+ def __init__(self, input_shape, num_kp=None):
+ """
+ Args:
+ input_shape (list): (C, H, W) input feature map shape.
+ num_kp (int): number of keypoints in output. If None, output will have the same number of channels as input.
+ """
+ super().__init__()
+
+ assert len(input_shape) == 3
+ self._in_c, self._in_h, self._in_w = input_shape
+
+ if num_kp is not None:
+ self.nets = torch.nn.Conv2d(self._in_c, num_kp, kernel_size=1)
+ self._out_c = num_kp
+ else:
+ self.nets = None
+ self._out_c = self._in_c
+
+ # we could use torch.linspace directly but that seems to behave slightly differently than numpy
+ # and causes a small degradation in pc_success of pre-trained models.
+ pos_x, pos_y = np.meshgrid(np.linspace(-1.0, 1.0, self._in_w), np.linspace(-1.0, 1.0, self._in_h))
+ pos_x = torch.from_numpy(pos_x.reshape(self._in_h * self._in_w, 1)).float()
+ pos_y = torch.from_numpy(pos_y.reshape(self._in_h * self._in_w, 1)).float()
+ # register as buffer so it's moved to the correct device.
+ self.register_buffer("pos_grid", torch.cat([pos_x, pos_y], dim=1))
+
+ def forward(self, features: Tensor) -> Tensor:
+ """
+ Args:
+ features: (B, C, H, W) input feature maps.
+ Returns:
+ (B, K, 2) image-space coordinates of keypoints.
+ """
+ if self.nets is not None:
+ features = self.nets(features)
+
+ # [B, K, H, W] -> [B * K, H * W] where K is number of keypoints
+ features = features.reshape(-1, self._in_h * self._in_w)
+ # 2d softmax normalization
+ attention = F.softmax(features, dim=-1)
+ # [B * K, H * W] x [H * W, 2] -> [B * K, 2] for spatial coordinate mean in x and y dimensions
+ expected_xy = attention @ self.pos_grid
+ # reshape to [B, K, 2]
+ feature_keypoints = expected_xy.view(-1, self._out_c, 2)
+
+ return feature_keypoints
+
+
+class DiffusionRgbEncoder(nn.Module):
+ """Encodes an RGB image into a 1D feature vector.
+
+ Includes the ability to normalize and crop the image first.
+ """
+
+ def __init__(self, config: DiffusionConfig):
+ super().__init__()
+ # Set up optional preprocessing.
+ if config.crop_shape is not None:
+ self.do_crop = True
+ # Always use center crop for eval
+ self.center_crop = torchvision.transforms.CenterCrop(config.crop_shape)
+ if config.crop_is_random:
+ self.maybe_random_crop = torchvision.transforms.RandomCrop(config.crop_shape)
+ else:
+ self.maybe_random_crop = self.center_crop
+ else:
+ self.do_crop = False
+
+ # Set up backbone.
+ backbone_model = getattr(torchvision.models, config.vision_backbone)(
+ weights=config.pretrained_backbone_weights
+ )
+ # Note: This assumes that the layer4 feature map is children()[-3]
+ # TODO(alexander-soare): Use a safer alternative.
+ self.backbone = nn.Sequential(*(list(backbone_model.children())[:-2]))
+ if config.use_group_norm:
+ if config.pretrained_backbone_weights:
+ raise ValueError(
+ "You can't replace BatchNorm in a pretrained model without ruining the weights!"
+ )
+ self.backbone = _replace_submodules(
+ root_module=self.backbone,
+ predicate=lambda x: isinstance(x, nn.BatchNorm2d),
+ func=lambda x: nn.GroupNorm(num_groups=x.num_features // 16, num_channels=x.num_features),
+ )
+
+ # Set up pooling and final layers.
+ # Use a dry run to get the feature map shape.
+ # The dummy input should take the number of image channels from `config.image_features` and it should
+ # use the height and width from `config.crop_shape` if it is provided, otherwise it should use the
+ # height and width from `config.image_features`.
+
+ # Note: we have a check in the config class to make sure all images have the same shape.
+ images_shape = next(iter(config.image_features.values())).shape
+ dummy_shape_h_w = config.crop_shape if config.crop_shape is not None else images_shape[1:]
+ dummy_shape = (1, images_shape[0], *dummy_shape_h_w)
+ feature_map_shape = get_output_shape(self.backbone, dummy_shape)[1:]
+
+ self.pool = SpatialSoftmax(feature_map_shape, num_kp=config.spatial_softmax_num_keypoints)
+ self.feature_dim = config.spatial_softmax_num_keypoints * 2
+ self.out = nn.Linear(config.spatial_softmax_num_keypoints * 2, self.feature_dim)
+ self.relu = nn.ReLU()
+
+ def forward(self, x: Tensor) -> Tensor:
+ """
+ Args:
+ x: (B, C, H, W) image tensor with pixel values in [0, 1].
+ Returns:
+ (B, D) image feature.
+ """
+ # Preprocess: maybe crop (if it was set up in the __init__).
+ if self.do_crop:
+ if self.training: # noqa: SIM108
+ x = self.maybe_random_crop(x)
+ else:
+ # Always use center crop for eval.
+ x = self.center_crop(x)
+ # Extract backbone feature.
+ x = torch.flatten(self.pool(self.backbone(x)), start_dim=1)
+ # Final linear layer with non-linearity.
+ x = self.relu(self.out(x))
+ return x
+
+
+def _replace_submodules(
+ root_module: nn.Module, predicate: Callable[[nn.Module], bool], func: Callable[[nn.Module], nn.Module]
+) -> nn.Module:
+ """
+ Args:
+ root_module: The module for which the submodules need to be replaced
+ predicate: Takes a module as an argument and must return True if the that module is to be replaced.
+ func: Takes a module as an argument and returns a new module to replace it with.
+ Returns:
+ The root module with its submodules replaced.
+ """
+ if predicate(root_module):
+ return func(root_module)
+
+ replace_list = [k.split(".") for k, m in root_module.named_modules(remove_duplicate=True) if predicate(m)]
+ for *parents, k in replace_list:
+ parent_module = root_module
+ if len(parents) > 0:
+ parent_module = root_module.get_submodule(".".join(parents))
+ if isinstance(parent_module, nn.Sequential):
+ src_module = parent_module[int(k)]
+ else:
+ src_module = getattr(parent_module, k)
+ tgt_module = func(src_module)
+ if isinstance(parent_module, nn.Sequential):
+ parent_module[int(k)] = tgt_module
+ else:
+ setattr(parent_module, k, tgt_module)
+ # verify that all BN are replaced
+ assert not any(predicate(m) for _, m in root_module.named_modules(remove_duplicate=True))
+ return root_module
+
+
+class DiffusionSinusoidalPosEmb(nn.Module):
+ """1D sinusoidal positional embeddings as in Attention is All You Need."""
+
+ def __init__(self, dim: int):
+ super().__init__()
+ self.dim = dim
+
+ def forward(self, x: Tensor) -> Tensor:
+ device = x.device
+ half_dim = self.dim // 2
+ emb = math.log(10000) / (half_dim - 1)
+ emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+ emb = x.unsqueeze(-1) * emb.unsqueeze(0)
+ emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+ return emb
+
+
+class DiffusionConv1dBlock(nn.Module):
+ """Conv1d --> GroupNorm --> Mish"""
+
+ def __init__(self, inp_channels, out_channels, kernel_size, n_groups=8):
+ super().__init__()
+
+ self.block = nn.Sequential(
+ nn.Conv1d(inp_channels, out_channels, kernel_size, padding=kernel_size // 2),
+ nn.GroupNorm(n_groups, out_channels),
+ nn.Mish(),
+ )
+
+ def forward(self, x):
+ return self.block(x)
+
+
+class DiffusionConditionalUnet1d(nn.Module):
+ """A 1D convolutional UNet with FiLM modulation for conditioning.
+
+ Note: this removes local conditioning as compared to the original diffusion policy code.
+ """
+
+ def __init__(self, config: DiffusionConfig, global_cond_dim: int):
+ super().__init__()
+
+ self.config = config
+
+ # Encoder for the diffusion timestep.
+ self.diffusion_step_encoder = nn.Sequential(
+ DiffusionSinusoidalPosEmb(config.diffusion_step_embed_dim),
+ nn.Linear(config.diffusion_step_embed_dim, config.diffusion_step_embed_dim * 4),
+ nn.Mish(),
+ nn.Linear(config.diffusion_step_embed_dim * 4, config.diffusion_step_embed_dim),
+ )
+
+ # The FiLM conditioning dimension.
+ cond_dim = config.diffusion_step_embed_dim + global_cond_dim
+
+ # In channels / out channels for each downsampling block in the Unet's encoder. For the decoder, we
+ # just reverse these.
+ in_out = [(config.action_feature.shape[0], config.down_dims[0])] + list(
+ zip(config.down_dims[:-1], config.down_dims[1:], strict=True)
+ )
+
+ # Unet encoder.
+ common_res_block_kwargs = {
+ "cond_dim": cond_dim,
+ "kernel_size": config.kernel_size,
+ "n_groups": config.n_groups,
+ "use_film_scale_modulation": config.use_film_scale_modulation,
+ }
+ self.down_modules = nn.ModuleList([])
+ for ind, (dim_in, dim_out) in enumerate(in_out):
+ is_last = ind >= (len(in_out) - 1)
+ self.down_modules.append(
+ nn.ModuleList(
+ [
+ DiffusionConditionalResidualBlock1d(dim_in, dim_out, **common_res_block_kwargs),
+ DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
+ # Downsample as long as it is not the last block.
+ nn.Conv1d(dim_out, dim_out, 3, 2, 1) if not is_last else nn.Identity(),
+ ]
+ )
+ )
+
+ # Processing in the middle of the auto-encoder.
+ self.mid_modules = nn.ModuleList(
+ [
+ DiffusionConditionalResidualBlock1d(
+ config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
+ ),
+ DiffusionConditionalResidualBlock1d(
+ config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
+ ),
+ ]
+ )
+
+ # Unet decoder.
+ self.up_modules = nn.ModuleList([])
+ for ind, (dim_out, dim_in) in enumerate(reversed(in_out[1:])):
+ is_last = ind >= (len(in_out) - 1)
+ self.up_modules.append(
+ nn.ModuleList(
+ [
+ # dim_in * 2, because it takes the encoder's skip connection as well
+ DiffusionConditionalResidualBlock1d(dim_in * 2, dim_out, **common_res_block_kwargs),
+ DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
+ # Upsample as long as it is not the last block.
+ nn.ConvTranspose1d(dim_out, dim_out, 4, 2, 1) if not is_last else nn.Identity(),
+ ]
+ )
+ )
+
+ self.final_conv = nn.Sequential(
+ DiffusionConv1dBlock(config.down_dims[0], config.down_dims[0], kernel_size=config.kernel_size),
+ nn.Conv1d(config.down_dims[0], config.action_feature.shape[0], 1),
+ )
+
+ def forward(self, x: Tensor, timestep: Tensor | int, global_cond=None) -> Tensor:
+ """
+ Args:
+ x: (B, T, input_dim) tensor for input to the Unet.
+ timestep: (B,) tensor of (timestep_we_are_denoising_from - 1).
+ global_cond: (B, global_cond_dim)
+ output: (B, T, input_dim)
+ Returns:
+ (B, T, input_dim) diffusion model prediction.
+ """
+ # For 1D convolutions we'll need feature dimension first.
+ x = einops.rearrange(x, "b t d -> b d t")
+
+ timesteps_embed = self.diffusion_step_encoder(timestep)
+
+ # If there is a global conditioning feature, concatenate it to the timestep embedding.
+ if global_cond is not None:
+ global_feature = torch.cat([timesteps_embed, global_cond], axis=-1)
+ else:
+ global_feature = timesteps_embed
+
+ # Run encoder, keeping track of skip features to pass to the decoder.
+ encoder_skip_features: list[Tensor] = []
+ for resnet, resnet2, downsample in self.down_modules:
+ x = resnet(x, global_feature)
+ x = resnet2(x, global_feature)
+ encoder_skip_features.append(x)
+ x = downsample(x)
+
+ for mid_module in self.mid_modules:
+ x = mid_module(x, global_feature)
+
+ # Run decoder, using the skip features from the encoder.
+ for resnet, resnet2, upsample in self.up_modules:
+ x = torch.cat((x, encoder_skip_features.pop()), dim=1)
+ x = resnet(x, global_feature)
+ x = resnet2(x, global_feature)
+ x = upsample(x)
+
+ x = self.final_conv(x)
+
+ x = einops.rearrange(x, "b d t -> b t d")
+ return x
+
+
+class DiffusionConditionalResidualBlock1d(nn.Module):
+ """ResNet style 1D convolutional block with FiLM modulation for conditioning."""
+
+ def __init__(
+ self,
+ in_channels: int,
+ out_channels: int,
+ cond_dim: int,
+ kernel_size: int = 3,
+ n_groups: int = 8,
+ # Set to True to do scale modulation with FiLM as well as bias modulation (defaults to False meaning
+ # FiLM just modulates bias).
+ use_film_scale_modulation: bool = False,
+ ):
+ super().__init__()
+
+ self.use_film_scale_modulation = use_film_scale_modulation
+ self.out_channels = out_channels
+
+ self.conv1 = DiffusionConv1dBlock(in_channels, out_channels, kernel_size, n_groups=n_groups)
+
+ # FiLM modulation (https://huggingface.co/papers/1709.07871) outputs per-channel bias and (maybe) scale.
+ cond_channels = out_channels * 2 if use_film_scale_modulation else out_channels
+ self.cond_encoder = nn.Sequential(nn.Mish(), nn.Linear(cond_dim, cond_channels))
+
+ self.conv2 = DiffusionConv1dBlock(out_channels, out_channels, kernel_size, n_groups=n_groups)
+
+ # A final convolution for dimension matching the residual (if needed).
+ self.residual_conv = (
+ nn.Conv1d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()
+ )
+
+ def forward(self, x: Tensor, cond: Tensor) -> Tensor:
+ """
+ Args:
+ x: (B, in_channels, T)
+ cond: (B, cond_dim)
+ Returns:
+ (B, out_channels, T)
+ """
+ out = self.conv1(x)
+
+ # Get condition embedding. Unsqueeze for broadcasting to `out`, resulting in (B, out_channels, 1).
+ cond_embed = self.cond_encoder(cond).unsqueeze(-1)
+ if self.use_film_scale_modulation:
+ # Treat the embedding as a list of scales and biases.
+ scale = cond_embed[:, : self.out_channels]
+ bias = cond_embed[:, self.out_channels :]
+ out = scale * out + bias
+ else:
+ # Treat the embedding as biases.
+ out = out + cond_embed
+
+ out = self.conv2(out)
+ out = out + self.residual_conv(x)
+ return out
diff --git a/diffusion/processor_diffusion.py b/diffusion/processor_diffusion.py
new file mode 100644
index 0000000..a7799be
--- /dev/null
+++ b/diffusion/processor_diffusion.py
@@ -0,0 +1,92 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any
+
+import torch
+
+from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
+from lerobot.processor import (
+ AddBatchDimensionProcessorStep,
+ DeviceProcessorStep,
+ NormalizerProcessorStep,
+ PolicyAction,
+ PolicyProcessorPipeline,
+ RenameObservationsProcessorStep,
+ UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
+
+
+def make_diffusion_pre_post_processors(
+ config: DiffusionConfig,
+ dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+ PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+ PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+ """
+ Constructs pre-processor and post-processor pipelines for a diffusion policy.
+
+ The pre-processing pipeline prepares the input data for the model by:
+ 1. Renaming features.
+ 2. Normalizing the input and output features based on dataset statistics.
+ 3. Adding a batch dimension.
+ 4. Moving the data to the specified device.
+
+ The post-processing pipeline handles the model's output by:
+ 1. Moving the data to the CPU.
+ 2. Unnormalizing the output features to their original scale.
+
+ Args:
+ config: The configuration object for the diffusion policy,
+ containing feature definitions, normalization mappings, and device information.
+ dataset_stats: A dictionary of statistics used for normalization.
+ Defaults to None.
+
+ Returns:
+ A tuple containing the configured pre-processor and post-processor pipelines.
+ """
+
+ input_steps = [
+ RenameObservationsProcessorStep(rename_map={}),
+ AddBatchDimensionProcessorStep(),
+ DeviceProcessorStep(device=config.device),
+ NormalizerProcessorStep(
+ features={**config.input_features, **config.output_features},
+ norm_map=config.normalization_mapping,
+ stats=dataset_stats,
+ ),
+ ]
+ output_steps = [
+ UnnormalizerProcessorStep(
+ features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
+ ),
+ DeviceProcessorStep(device="cpu"),
+ ]
+ return (
+ PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+ steps=input_steps,
+ name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+ ),
+ PolicyProcessorPipeline[PolicyAction, PolicyAction](
+ steps=output_steps,
+ name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+ to_transition=policy_action_to_transition,
+ to_output=transition_to_policy_action,
+ ),
+ )
diff --git a/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md b/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
new file mode 100644
index 0000000..1e697c1
--- /dev/null
+++ b/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
@@ -0,0 +1,42 @@
+# Streaming HDF5 EE Action Dataset Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 将 Diana 仿真采集改为流式写入 HDF5,图像保存为 256x256 的四路相机视角,并把 `/action` 改为 IK 前的原始末端位姿动作。
+
+**Architecture:** 新增一个独立的流式 HDF5 episode writer,负责逐帧写入 qpos、原始 action 和 resize 后图像,并在 episode 成功时原子提交、失败时删除临时文件。采集脚本只负责 rollout 和把每一步观测/动作交给 writer,避免整集数据先堆在内存里。
+
+**Tech Stack:** Python, h5py, numpy, cv2, unittest, MuJoCo demo scripts
+
+---
+
+### Task 1: 为流式 writer 建立测试边界
+
+**Files:**
+- Create: `tests/test_streaming_episode_writer.py`
+- Create: `roboimi/utils/streaming_episode_writer.py`
+
+- [ ] **Step 1: Write the failing test**
+- [ ] **Step 2: Run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it fails because the writer module does not exist**
+- [ ] **Step 3: Implement the minimal streaming writer with temp-file commit/discard, per-frame append, and 256x256 image resize**
+- [ ] **Step 4: Re-run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it passes**
+
+### Task 2: 接入 Diana 采集脚本
+
+**Files:**
+- Modify: `roboimi/demos/diana_record_sim_episodes.py`
+- Reuse: `roboimi/utils/streaming_episode_writer.py`
+
+- [ ] **Step 1: Replace in-memory `data_dict` / `obs` accumulation with per-episode streaming writer lifecycle**
+- [ ] **Step 2: Keep four cameras (`angle`, `r_vis`, `top`, `front`) and resize to 256x256 before persistence**
+- [ ] **Step 3: Capture raw policy output before IK and write that to `/action`**
+- [ ] **Step 4: On success commit to `episode_{idx}.hdf5`; on failure remove temp file**
+
+### Task 3: 验证改动
+
+**Files:**
+- Verify only
+
+- [ ] **Step 1: Run unit tests for the writer**
+- [ ] **Step 2: Run one end-to-end collection episode and stop after `episode_0.hdf5` becomes readable**
+- [ ] **Step 3: Verify HDF5 keys and shapes: `action=(700,16)`, image datasets are `(700,256,256,3)`, and `/action` matches raw EE action semantics**
diff --git a/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md b/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
new file mode 100644
index 0000000..93b42c7
--- /dev/null
+++ b/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
@@ -0,0 +1,26 @@
+# Raw Action Trajectory Viewer Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 在可交互 MuJoCo 仿真窗口中,把 rollout 导出的 raw EE action 轨迹用红色轨迹标出来并启动仿真供人工查看。
+
+**Architecture:** 读取已有 trajectory artifact 中的 raw_action / step 数据,生成左右臂末端轨迹点,并在 viewer 渲染循环中持续注入红色 marker。实现尽量独立为一个可复用的小脚本,避免影响训练/评估主路径。
+
+**Tech Stack:** Python, NumPy, MuJoCo viewer, unittest/mock.
+
+---
+
+### Task 1: 抽取 raw_action 轨迹并生成可视化点集
+- [ ] 写失败测试,验证从 trajectory.npz 提取左右臂轨迹点
+- [ ] 实现最小 helper
+- [ ] 运行测试确认通过
+
+### Task 2: 在 viewer 中渲染红色轨迹并支持交互查看
+- [ ] 写失败测试,验证 marker 配置/调用
+- [ ] 实现 viewer 可视化脚本
+- [ ] 运行测试确认通过
+
+### Task 3: 启动真实仿真窗口供人工查看
+- [ ] 用现有 trajectory artifact 启动 viewer
+- [ ] 确认窗口可交互、红线出现
+- [ ] 向用户汇报启动方式与脚本路径
diff --git a/docs/superpowers/plans/2026-03-31-rollout-artifacts.md b/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
new file mode 100644
index 0000000..00f5aa8
--- /dev/null
+++ b/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
@@ -0,0 +1,44 @@
+# Rollout Artifacts Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Extend rollout evaluation so one selected checkpoint can be run once with video capture, timing breakdown, and saved EE trajectory artifacts.
+
+**Architecture:** Keep the implementation centered in `eval_vla.py` so existing training-time rollout validation remains compatible. Add config-gated artifact capture helpers, serialize outputs under the eval run directory, and add lightweight tests for helper behavior and summary wiring; default eval behavior must remain unchanged when artifact capture is off.
+
+**Tech Stack:** Python, Hydra/OmegaConf, NumPy, OpenCV, JSON, PyTorch unittest/mocking.
+
+---
+
+### Task 1: Add artifact capture configuration and helper wiring
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Modify: `roboimi/vla/conf/eval/eval.yaml`
+- Test: `tests/test_eval_vla_rollout_artifacts.py`
+
+- [ ] **Step 1: Write failing tests for optional artifact config / summary wiring**
+- [ ] **Step 2: Implement config-backed artifact flags and output paths with defaults that write nothing**
+- [ ] **Step 3: Verify existing eval call sites still work with defaults**
+
+### Task 2: Add timing breakdown, video recording, and trajectory export
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Test: `tests/test_eval_vla_rollout_artifacts.py`
+
+- [ ] **Step 1: Write failing tests for timing aggregation, trajectory serialization, and summary schema**
+- [ ] **Step 2: Implement per-step timing capture for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`**
+- [ ] **Step 3: Implement MP4 recording from a chosen camera stream and canonical `trajectory.npz` export using `left_link7/right_link7` executed poses after `env.step`**
+- [ ] **Step 4: Run focused tests and fix issues**
+
+### Task 3: Stop training safely and execute one real rollout
+
+**Files:**
+- Use: `roboimi/demos/vla_scripts/eval_vla.py`
+- Output: `runs/.../eval_artifacts/...`
+
+- [ ] **Step 1: Stop the active training process, wait for exit, and confirm the target checkpoint is readable**
+- [ ] **Step 2: Select the latest completed checkpoint if an explicit one is not provided; fall back to prior completed / best checkpoint if needed**
+- [ ] **Step 3: Run one headless rollout with artifact capture enabled**
+- [ ] **Step 4: Verify the MP4 / timing summary / trajectory files exist and summarize findings**
diff --git a/docs/superpowers/plans/2026-04-23-sim-air-insert-ring-bar.md b/docs/superpowers/plans/2026-04-23-sim-air-insert-ring-bar.md
new file mode 100644
index 0000000..54c7f28
--- /dev/null
+++ b/docs/superpowers/plans/2026-04-23-sim-air-insert-ring-bar.md
@@ -0,0 +1,311 @@
+# sim_air_insert_ring_bar Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add an independent dual-Diana MuJoCo task `sim_air_insert_ring_bar` with a square ring block, a square bar block, staged rewards, strict finite-geometry in-air insertion success detection, and a task-specific scripted policy.
+
+**Architecture:** Reuse the current dual-Diana EE-control stack and environment factory, but add a task-specific scene XML, robot asset entrypoint, sampling helpers, and a new task-specific environment module. Keep `sim_transfer` untouched while introducing pure-Python geometry helpers and focused tests so reward/success behavior can be regression tested without requiring a full MuJoCo rollout in every test.
+
+**Tech Stack:** Python, unittest, MuJoCo XML assets, existing dual-Diana environment classes, Hydra-compatible task naming/config patterns.
+
+---
+
+## File Structure / Responsibilities
+
+- **Create:** `roboimi/assets/models/manipulators/DianaMed/ring_bar_objects.xml`
+ - Defines the rigid ring body and bar body, each with a free joint and stable box-based geoms.
+- **Create:** `roboimi/assets/models/manipulators/DianaMed/bi_diana_ring_bar_ee.xml`
+ - Scene entrypoint that includes the shared world/table/robot assets plus the new object XML.
+- **Modify:** `roboimi/assets/robots/diana_med.py`
+ - Add a task-specific robot asset class for the new scene XML without changing existing `BiDianaMed` behavior.
+- **Modify:** `roboimi/utils/act_ex_utils.py`
+ - Add deterministic helpers to sample left/right planar placement regions for ring and bar objects.
+- **Modify:** `roboimi/utils/constants.py`
+ - Register the new task name and default metadata.
+- **Create:** `roboimi/envs/double_air_insert_env.py`
+ - New task-specific environment, finite-geometry success helpers, reset logic, reward logic, and task factory branch.
+- **Modify:** `roboimi/envs/double_pos_ctrl_env.py`
+ - Route `make_sim_env()` to the new task-specific environment while keeping current `sim_transfer` logic unchanged.
+- **Create:** `roboimi/demos/diana_air_insert_policy.py`
+ - Task-specific waypoint/open-loop scripted policy for grasp-lift-align-insert.
+- **Modify:** `roboimi/demos/vla_scripts/eval_vla.py`
+ - Reset the new task with the correct sampled task state instead of assuming a single transfer box pose.
+- **Create:** `tests/test_air_insert_env.py`
+ - Focused unit tests for sampling, reset helpers, reward progression, and strict success detection.
+- **Modify:** `tests/test_eval_vla_headless.py`
+ - Add coverage that headless evaluation dispatches the correct reset sampler for the new task.
+- **Modify:** `tests/test_robot_asset_paths.py`
+ - Verify the new robot asset class resolves its XML path correctly independent of cwd.
+
+---
+
+### Task 1: Add failing tests for task registration, samplers, and asset wiring
+
+**Files:**
+- Create: `tests/test_air_insert_env.py`
+- Modify: `tests/test_eval_vla_headless.py`
+- Modify: `tests/test_robot_asset_paths.py`
+- Modify: `roboimi/utils/act_ex_utils.py` (later in implementation)
+- Modify: `roboimi/utils/constants.py` (later in implementation)
+- Modify: `roboimi/assets/robots/diana_med.py` (later in implementation)
+- Modify: `roboimi/envs/double_pos_ctrl_env.py` (later in implementation)
+- Create: `roboimi/envs/double_air_insert_env.py` (minimal stub in this task)
+
+- [ ] **Step 1: Write failing tests for task config and sampling helpers**
+
+Add tests in `tests/test_air_insert_env.py` covering:
+- `SIM_TASK_CONFIGS['sim_air_insert_ring_bar']` exists
+- `sample_air_insert_ring_bar_pose()` (or equivalent helper) returns ring/bar positions with fixed z and correct left/right planar ranges
+- output structure is explicit and easy for reset/eval code to consume
+
+- [ ] **Step 2: Write failing tests for environment factory dispatch and robot asset resolution**
+
+Add tests covering:
+- `make_sim_env('sim_air_insert_ring_bar', headless=True)` dispatches to the new environment with rendering disabled
+- a new robot asset class resolves the new XML path independent of cwd, similar to the existing `BiDianaMed` test pattern
+
+- [ ] **Step 3: Write failing tests for eval reset helper dispatch**
+
+Extend `tests/test_eval_vla_headless.py` so headless eval can reset the new task using the new sampler instead of hard-coding `sample_transfer_pose()`.
+
+- [ ] **Step 4: Run the targeted tests to verify they fail for the expected missing-feature reasons**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env tests.test_eval_vla_headless tests.test_robot_asset_paths -v`
+
+Expected:
+- FAIL because the new task config/helper/class/dispatch branch does not exist yet
+
+- [ ] **Step 5: Implement the minimal production code to satisfy the new task registration and helper tests**
+
+Implement only enough to make the new tests pass:
+- add new task config entry
+- add the new placement sampler
+- add the new robot asset class
+- create a minimal importable `double_air_insert_env.py` stub and class/function surface needed for factory dispatch tests
+- add the factory dispatch branch / headless wiring
+- update eval reset dispatch for the new task
+
+- [ ] **Step 6: Re-run the targeted tests to verify they pass**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env tests.test_eval_vla_headless tests.test_robot_asset_paths -v`
+
+Expected:
+- PASS for the new registration/sampler/dispatch/asset tests
+
+- [ ] **Step 7: Commit Task 1**
+
+Run:
+`git add tests/test_air_insert_env.py tests/test_eval_vla_headless.py tests/test_robot_asset_paths.py roboimi/utils/act_ex_utils.py roboimi/utils/constants.py roboimi/assets/robots/diana_med.py roboimi/envs/double_pos_ctrl_env.py roboimi/envs/double_air_insert_env.py roboimi/demos/vla_scripts/eval_vla.py && git commit -m "feat(env): register sim air insert ring bar task"`
+
+---
+
+### Task 2: Add the MuJoCo ring+bar scene assets and reset helpers
+
+**Files:**
+- Create: `roboimi/assets/models/manipulators/DianaMed/ring_bar_objects.xml`
+- Create: `roboimi/assets/models/manipulators/DianaMed/bi_diana_ring_bar_ee.xml`
+- Create or Modify: `roboimi/envs/double_air_insert_env.py`
+- Modify: `tests/test_air_insert_env.py`
+
+- [ ] **Step 1: Write failing tests for object reset helpers and scene-specific joint naming assumptions**
+
+In `tests/test_air_insert_env.py`, add unit tests for helper functions that:
+- write ring pose to `ring_block_joint` from the named task-state mapping
+- write bar pose to `bar_block_joint` from the named task-state mapping
+- read back `env_state` as a stable 14D vector `[ring_pos, ring_quat, bar_pos, bar_quat]`
+
+Use fake `mj_data` objects so tests stay fast and deterministic.
+
+- [ ] **Step 2: Run the focused test slice and verify it fails**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- FAIL because reset/state helper functions and joint conventions are not implemented yet
+
+- [ ] **Step 3: Implement the scene XML files and reset/state helper code**
+
+Implement:
+- the object XML with one rigid ring body and one rigid bar body
+- the task scene XML entrypoint using the shared world/table/robot includes
+- reset helper(s) in `double_air_insert_env.py` that set qpos for both free joints with fixed quaternions
+- task-state accessor(s) returning both object poses in a stable structure
+
+- [ ] **Step 4: Re-run the focused test slice and verify it passes**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- PASS for reset/state helper tests
+
+- [ ] **Step 5: Commit Task 2**
+
+Run:
+`git add roboimi/assets/models/manipulators/DianaMed/ring_bar_objects.xml roboimi/assets/models/manipulators/DianaMed/bi_diana_ring_bar_ee.xml roboimi/envs/double_air_insert_env.py tests/test_air_insert_env.py && git commit -m "feat(scene): add ring and bar insertion scene assets"`
+
+---
+
+### Task 3: Implement strict reward and finite-geometry success detection
+
+**Files:**
+- Modify: `roboimi/envs/double_air_insert_env.py`
+- Modify: `tests/test_air_insert_env.py`
+
+- [ ] **Step 1: Write failing tests for reward stages and strict success detection**
+
+Add tests in `tests/test_air_insert_env.py` for:
+- left contact stage reward
+- right contact stage reward
+- ring lifted off table stage
+- bar lifted off table stage
+- positive success case where a finite bar truly passes through the aperture
+- negative case where the centerline would pass but the finite square body would clip
+- negative case where the bar has not crossed the ring thickness direction enough
+- negative case where one/both objects are still on the table
+
+Structure the tests around pure helper functions and light fake contact/state objects so the geometry logic is directly regression tested.
+
+- [ ] **Step 2: Run the focused tests and verify they fail for missing reward/success logic**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- FAIL because the staged reward and finite-geometry insertion logic are not implemented yet
+
+- [ ] **Step 3: Implement minimal strict success helpers and reward logic**
+
+Implement in `roboimi/envs/double_air_insert_env.py`:
+- pure helper(s) for transforming bar geometry into ring-local coordinates
+- finite-geometry insertion predicate (not centerline-only)
+- table-contact / airborne checks
+- staged reward function returning the highest achieved stage with `max_reward = 5`
+
+- [ ] **Step 4: Re-run the focused tests to verify the logic passes**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- PASS for reward and success-detection regression tests
+
+- [ ] **Step 5: Commit Task 3**
+
+Run:
+`git add roboimi/envs/double_air_insert_env.py tests/test_air_insert_env.py && git commit -m "feat(env): add strict air insertion reward and success logic"`
+
+---
+
+### Task 4: Add the scripted policy and integration smoke coverage
+
+**Files:**
+- Create: `roboimi/demos/diana_air_insert_policy.py`
+- Modify: `roboimi/demos/diana_record_sim_episodes.py`
+- Modify: `tests/test_air_insert_env.py`
+- Optionally Modify: `roboimi/demos/vla_scripts/eval_vla.py` (only if integration gaps remain after Task 1)
+
+- [ ] **Step 1: Write failing tests for scripted-policy action shape and basic generation**
+
+Add tests covering:
+- the new policy produces a 16D action
+- trajectory generation accepts sampled named task state without error
+- the first action is a valid open-gripper safe pose command
+- a deterministic nominal smoke path (with canonical sampled state or fake env shim) reaches the intended terminal interface contract without shape/reward mismatches
+
+Keep the tests unit-level; do not require a full MuJoCo rollout for every assertion.
+
+- [ ] **Step 2: Write failing tests for the scripted rollout entrypoint and a real headless smoke path**
+
+Add coverage for both:
+- the standard scripted rollout entrypoint (`roboimi/demos/diana_record_sim_episodes.py`) can select the new task sampler/policy instead of remaining sim_transfer-only
+- a deterministic integration/smoke test that instantiates `make_sim_env('sim_air_insert_ring_bar', headless=True)`, resets with sampled named task state, and steps a few actions or scripted-policy outputs using the real task XML and task-specific wiring
+
+- [ ] **Step 3: Run the scripted-policy tests and verify they fail**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- FAIL because the new scripted policy does not exist yet
+
+- [ ] **Step 4: Implement the waypoint-based scripted policy**
+
+Implement a conservative open-loop policy with phases:
+- safe wait pose
+- above-target approach
+- descend + grasp
+- dual lift
+- airborne meeting alignment
+- bar push-through insertion
+
+Use fixed orientations for version 1 and follow the existing repository style from `diana_policy.py`.
+
+- [ ] **Step 5: Re-run the scripted-policy tests to verify they pass**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env -v`
+
+Expected:
+- PASS for scripted-policy tests
+
+- [ ] **Step 6: Run the combined verification suite for this feature**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env tests.test_eval_vla_headless tests.test_eval_vla_rollout_artifacts tests.test_train_vla_rollout_validation tests.test_robot_asset_paths -v`
+
+Expected:
+- PASS with 0 failures
+
+- [ ] **Step 6b: Run the mandatory real headless smoke check**
+
+Run a focused smoke command that instantiates the real task, resets with sampled state, and steps a few actions using the new scripted policy or a deterministic action sequence.
+
+Example command (adjust module/test helper if needed):
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env.AirInsertEnvSmokeTest -v`
+
+Expected:
+- PASS, proving the real XML/assets/env wiring instantiate and step correctly in headless mode
+
+- [ ] **Step 7: Commit Task 4**
+
+Run:
+`git add roboimi/demos/diana_air_insert_policy.py tests/test_air_insert_env.py tests/test_eval_vla_headless.py tests/test_robot_asset_paths.py roboimi/demos/vla_scripts/eval_vla.py && git commit -m "feat(policy): add scripted air insertion policy"`
+
+---
+
+### Task 5: Final verification and implementation review
+
+**Files:**
+- Review all files touched above
+
+- [ ] **Step 1: Run fresh end-to-end verification before claiming completion**
+
+Run:
+`/home/droid/.conda/envs/roboimi/bin/python -m unittest tests.test_air_insert_env tests.test_eval_vla_headless tests.test_robot_asset_paths -v`
+
+Expected:
+- PASS with 0 failures
+
+- [ ] **Step 2: Inspect git status and recent commits**
+
+Run:
+`git status --short && git log --oneline --decorate -n 8`
+
+Expected:
+- only intended feature files modified / committed
+
+- [ ] **Step 3: Request final code review for the completed feature**
+
+Use the requesting-code-review skill against the full diff from the feature branch starting point to current HEAD.
+
+- [ ] **Step 4: Address any review findings and re-run verification if code changes**
+
+If fixes are made, repeat the unittest command from Step 1.
+
+- [ ] **Step 5: Hand off using finishing-a-development-branch**
+
+After verification and review, use the finishing-a-development-branch skill to decide merge / PR / cleanup.
diff --git a/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md b/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
new file mode 100644
index 0000000..f232125
--- /dev/null
+++ b/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
@@ -0,0 +1,241 @@
+# VLA Training + Headless Rollout + SwanLab Design
+
+**Date:** 2026-03-30
+**Branch:** feat-align-dp-transformer-ee
+
+## Goal
+在当前仓库中补齐默认 `resnet_transformer` / `Transformer1D` 路线的训练依赖,使用数据集 `/home/droid/project/diana_sim/sim_transfer` 启动训练;同时支持训练过程中的 SwanLab 标量日志上传,并为后续 rollout 验证提供 headless 模式,避免弹出 MuJoCo / OpenCV 图形界面。
+
+## Non-Goals
+- 不重写整套训练框架
+- 不引入新的 workspace / callback 框架
+- 不在本轮做复杂的视频/媒体日志上传
+- 不修改数据集格式本身
+
+## Current State
+- 默认训练配置已切到 `agent=resnet_transformer`,head 为 `Transformer1D`
+- 当前环境缺少训练所需的若干 Python 依赖:`diffusers`、`torchvision`、`einops`、`swanlab`
+- 评估环境 `make_sim_env(task_name)` 当前写死 `is_render=True`
+- 相机线程 `camera_viewer()` 默认会 `cv2.namedWindow/imshow`,即使只想拿图像也会弹窗
+- 训练脚本当前支持 train/val loss、checkpoint,但没有 SwanLab 集成
+- 数据集目录 `/home/droid/project/diana_sim/sim_transfer` 下已有 100 个 episode,但还没有 `dataset_stats.pkl`
+
+## User Requirements
+1. 在现有 mamba 环境里补齐训练依赖
+2. 在 `/home/droid/project/diana_sim/sim_transfer` 上开始训练
+3. 如果训练中需要 rollout 验证,希望支持 headless,不弹 GUI
+4. 训练指标上传到 SwanLab
+5. 默认 SwanLab project 名为 `roboimi-vla`
+
+## Proposed Approach
+采用“最小必要改造”方案:
+
+### 1. Dependency Layer
+在现有 `roboimi` 环境中补齐缺失训练依赖,并优先保持现有环境名与脚本入口不变。
+
+#### Install Plan
+- 环境:继续使用现有 mamba 环境 `roboimi`
+- 安装方式:
+ - 优先使用当前 env 的 `python -m pip install`
+ - 安装包:
+ - `diffusers`
+ - `torchvision`
+ - `einops`
+ - `swanlab`
+- 版本策略:
+ - 优先选择与当前 `torch==2.4.0` 可兼容的最新可安装版本
+ - 若出现兼容性问题,再回退到与 `torch 2.4` 对齐的稳定版本
+- 复现策略:
+ - 本轮会把**实际安装成功的 resolved versions** 补写回仓库的环境定义文件,避免后续环境漂移
+
+训练前验证以下 import:
+- `torch`
+- `hydra`
+- `omegaconf`
+- `diffusers`
+- `torchvision`
+- `einops`
+- `swanlab`
+- `cv2`
+- `h5py`
+- `mujoco`
+
+### 2. Dataset Preparation
+直接复用现有 `SimpleRobotDataset`,仅将 `data.dataset_dir` 指向:
+- `/home/droid/project/diana_sim/sim_transfer`
+
+训练前使用现有统计脚本生成:
+- `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
+
+统计文件生成命令目标为:
+- 从仓库根目录执行
+- 直接针对 `/home/droid/project/diana_sim/sim_transfer` 输出 stats
+- 训练脚本不再依赖默认数据目录
+
+### 3. SwanLab Logging
+在训练脚本中增加一个轻量 logging 集成层:
+- 通过配置决定是否启用 SwanLab,默认启用
+- 默认 project:`roboimi-vla`
+- API key 不写入仓库,不写入配置文件,只通过本地登录状态或环境变量使用
+- 当 `train.use_swanlab=true` 时:
+ - 若 `swanlab` 不可 import,训练直接 fail fast
+ - 若未登录或认证失败,训练直接 fail fast
+- 每个训练日志点上传:
+ - `train/loss`
+ - `train/lr`
+ - `train/best_loss`
+ - `train/step`
+- 每次验证时上传:
+ - `val/loss`
+- 训练结束时记录最终 checkpoint 路径与 best checkpoint 路径
+
+### 4. Headless Rollout Design
+目标是让 rollout 验证可以“拿到图像观测,但不弹任何窗口”。
+
+最小改造策略:
+- 给 `make_sim_env(...)` 增加 `headless` / `is_render` 参数
+- 给相机线程显示逻辑增加开关:
+ - headless 时继续更新 `r_vis/top/front/...` 图像缓存
+ - 但不执行 `cv2.namedWindow` / `cv2.imshow` / `cv2.waitKey`
+- 评估脚本中:
+ - headless 时不调用 `env.render()`
+ - 仍然允许 `env._get_image_obs()` 和 policy inference 正常运行
+
+#### Training-Time Rollout Scope
+- 本轮**会提供一个可选的 checkpoint-time rollout validation 路径**,默认关闭
+- 启用后,在训练保存 checkpoint 时可以调用同仓库的 rollout/eval 逻辑做少量 episode 验证
+- 此路径要求支持**唯一权威开关** `eval.headless=true`,即:
+ - 不弹 MuJoCo viewer
+ - 不执行 `cv2.namedWindow / cv2.imshow / cv2.waitKey`
+ - 仍可读取图像并完成策略推理
+- 默认情况下不增加频繁 rollout,以避免拖慢训练;只提供能力与配置开关
+
+如果验证发现相机线程强依赖 GUI,我们的降级策略是:
+- 训练主流程 + SwanLab 必须先跑通
+- rollout validation 保持为显式可选能力
+- 但本轮仍要保证至少存在可调用的 headless 验证执行路径,而不是仅停留在文档层面
+
+### 5. Training Execution Strategy
+分两步执行:
+
+#### Step A: Smoke Run
+使用较小步数启动一次 smoke training,确认:
+- 数据集可正常读取
+- 统计文件可加载
+- 模型可实例化
+- 单步前后向正常
+- checkpoint 正常写出
+- SwanLab 成功上传标量
+
+#### Step B: Real Training Run
+在 smoke run 成功后,再启动正式训练。
+
+## Execution Commands
+
+### A. Stats Generation
+从仓库根目录执行,生成:
+- `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
+
+命令模板:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/vla/scripts/calculate_stats.py \
+ --dataset_dir /home/droid/project/diana_sim/sim_transfer
+```
+
+### B. Smoke Training Command
+从仓库根目录执行,核心覆盖项包括:
+- `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
+- 较小 `train.max_steps`
+- 较高日志频率
+- 启用 SwanLab
+- 输出目录使用当前运行目录下的 `checkpoints/`
+
+命令模板:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+ train.max_steps=20 \
+ train.log_freq=1 \
+ train.save_freq=10 \
+ train.use_swanlab=true \
+ train.swanlab_project=roboimi-vla \
+ train.rollout_validate_on_checkpoint=false
+```
+
+### C. Real Training Command
+从仓库根目录执行,核心覆盖项包括:
+- `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
+- 正式 `train.max_steps`
+- 默认 project=`roboimi-vla`
+- 若启用 rollout validation,则传入 `eval.headless=true` 以及训练侧 rollout 开关
+
+命令模板:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+ train.use_swanlab=true \
+ train.swanlab_project=roboimi-vla \
+ train.rollout_validate_on_checkpoint=true \
+ eval.headless=true
+```
+
+### D. Output Behavior
+- checkpoint 输出目录:当前工作目录下的 `checkpoints/`
+- 关键文件:
+ - `checkpoints/vla_model_step_.pt`
+ - `checkpoints/vla_model_best.pt`
+ - `checkpoints/vla_model_final.pt`
+
+## File-Level Changes
+- `environment.yml`
+ - 补写新增训练依赖,保证后续可复现
+- `roboimi/demos/vla_scripts/train_vla.py`
+ - 增加 SwanLab 集成
+ - 增加更明确的数据集目录覆盖支持
+ - 增加可选 checkpoint-time rollout validation 入口
+ - 保持当前 optimizer 对齐逻辑不变
+- `roboimi/vla/conf/config.yaml`
+ - 增加/扩展训练日志、SwanLab、rollout 相关配置项
+- `roboimi/vla/conf/eval/eval.yaml`
+ - 增加 `headless` 等评估控制项
+- `roboimi/envs/double_pos_ctrl_env.py`
+ - `make_sim_env` 支持 headless / no-render
+- `roboimi/envs/double_base.py`
+ - 相机采集与 GUI 显示解耦
+- `roboimi/vla/scripts/calculate_stats.py`
+ - 改为直接支持通过命令行传入外部 `dataset_dir`
+- tests(新增)
+ - 覆盖 SwanLab 可选初始化路径
+ - 覆盖 headless 环境下“不弹窗但可取图”的关键逻辑
+
+## Validation Plan
+1. 补齐依赖后验证 import 全通过
+2. 生成 `dataset_stats.pkl`
+3. 运行训练 smoke run
+4. 确认 SwanLab dashboard 在 project `roboimi-vla` 下有标量更新
+5. 若启用 rollout 验证:确认 headless 下不弹 GUI,且 rollout 路径能真正执行
+6. 再启动正式训练
+
+## Config Contract
+本轮新增/固定的配置键以以下形式为准:
+- `train.use_swanlab: true|false`
+- `train.swanlab_project: roboimi-vla`
+- `train.rollout_validate_on_checkpoint: true|false`
+- `eval.headless: true|false`
+
+## Risks and Mitigations
+- **Risk:** GUI/相机线程与离屏渲染耦合
+ - **Mitigation:** 先解耦显示与图像更新;必要时把 rollout 验证降级为第二阶段
+- **Risk:** 现有 env 依赖不完整
+ - **Mitigation:** 先做 import 验证,再做 smoke run
+- **Risk:** 数据集过大导致 smoke run 也很慢
+ - **Mitigation:** smoke run 只跑极小步数
+- **Risk:** SwanLab API key 泄漏
+ - **Mitigation:** 不写入代码/配置,只保存在本地登录态或环境变量
+
+## Success Criteria
+- 训练脚本能在 `/home/droid/project/diana_sim/sim_transfer` 上启动
+- 能成功写出 checkpoint 到 `checkpoints/`
+- SwanLab 在 `roboimi-vla` 项目下能看到 train/val 标量
+- headless rollout 具备不弹 GUI 的执行路径
+- 若训练侧启用 rollout validation,则该路径可以在 headless 模式下被实际调用
diff --git a/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md b/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
new file mode 100644
index 0000000..1d30446
--- /dev/null
+++ b/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
@@ -0,0 +1,16 @@
+# Rollout Artifacts Design
+
+**Goal:** Add a one-off evaluation path that can record rollout video, export per-step timing breakdowns, and save executed end-effector trajectories for a selected checkpoint while preserving default eval behavior when artifact capture is disabled.
+
+**Approach:** Extend `roboimi/demos/vla_scripts/eval_vla.py` with optional evaluation-time artifact capture that stays backward compatible when disabled. Reuse existing environment observation and camera streams, record one camera stream to MP4, collect per-step timing around observation read / preprocessing / model inference / env step / total loop, and save per-step raw predicted EE actions plus executed EE poses after stepping.
+
+**Artifact contract:**
+- `video.mp4`: optional MP4 encoded from a selected camera stream (`r_vis`, `top`, `front`, etc.), written only when recording is enabled.
+- `trajectory.npz`: canonical trajectory export containing at minimum `step`, `reward`, `raw_action`, `executed_left_link7_pos`, `executed_left_link7_quat`, `executed_right_link7_pos`, `executed_right_link7_quat`, and optional duplicated tool-body poses if captured.
+- `timing.json`: JSON-serializable per-episode timing summary with millisecond units for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`, plus aggregate mean/std/min/max and counts. Raw per-step timing arrays should also be persisted in the NPZ for later analysis.
+
+**Checkpoint selection:** Prefer an explicitly requested checkpoint path. If the caller asks for “latest” or omits a path in the execution helper, select the newest fully written checkpoint file by mtime/name and fail clearly if none exists.
+
+**Stop-training / execution safety:** Before rollout, stop any active training process using the target run, wait for process exit, then verify the chosen checkpoint exists and is readable. If the most recent checkpoint is missing or mid-write, fall back to the previous completed checkpoint or `vla_model_best.pt` with the decision logged.
+
+**Backward compatibility:** With all new eval flags left at default values, `_run_eval` return shape must remain compatible with existing callers, training-time rollout validation should continue to work without passing new options, and no artifact files should be written.
diff --git a/docs/superpowers/specs/2026-04-23-sim-air-insert-ring-bar-design.md b/docs/superpowers/specs/2026-04-23-sim-air-insert-ring-bar-design.md
new file mode 100644
index 0000000..feb54b6
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-23-sim-air-insert-ring-bar-design.md
@@ -0,0 +1,316 @@
+# sim_air_insert_ring_bar Design
+
+## Summary
+
+Add a new independent MuJoCo simulation task named `sim_air_insert_ring_bar` that keeps the existing dual-Diana tabletop setup but replaces the single transfer box with two randomized objects:
+
+- a square ring block grasped by the left arm
+- a square bar block grasped by the right arm
+
+The task is to pick both objects off the table and complete an in-air insertion where the bar truly passes through the ring aperture. The existing `sim_transfer` task must remain unchanged.
+
+## Goals
+
+- Reuse the current dual-Diana EE-control simulation stack
+- Keep the same table/base robot arrangement as the existing transfer task
+- Add an independent task entrypoint and scene definition
+- Randomize planar placement of both objects within left/right task-specific regions
+- Implement reward staging for contact, lift, and successful in-air insertion
+- Add a scripted policy that performs pick, lift, align, and in-air insertion
+- Preserve compatibility with existing environment creation, evaluation, and rollout patterns
+
+## Non-Goals
+
+- No random yaw in the first version
+- No visual servoing or closed-loop insertion controller
+- No general multi-task environment framework refactor
+- No guarantee that the VLA training stack is immediately tuned for this new task
+- No replacement or behavior change for `sim_transfer`
+
+## Task Name
+
+Use a new task name:
+
+- `sim_air_insert_ring_bar`
+
+This task should be exposed alongside `sim_transfer`, not as a replacement.
+
+## Scene Geometry
+
+### Shared Base Scene
+
+Keep the dual Diana robot, the table, and the existing camera layout conceptually unchanged.
+
+### Ring Block
+
+Represent the square ring as a rigid free body composed from simple MuJoCo box geoms rather than an external mesh.
+
+Dimensions:
+
+- outer side length: 68 mm
+- inner aperture side length: 32 mm
+- thickness: 18 mm
+- ring wall width: 18 mm
+
+The ring should behave as a single object body with a single free joint.
+
+### Bar Block
+
+Represent the bar as a rigid free body with a single box geom.
+
+Dimensions:
+
+- length: 90 mm
+- cross-section: 18 mm x 18 mm
+
+The bar should also be a single free-joint body.
+
+## Initial Placement / Reset
+
+The first version uses position-only randomization with fixed orientation. Reset sampling stays **caller-driven**, matching the existing `sim_transfer` usage pattern in rollout/eval code: a helper samples task state, then callers pass that state into `env.reset(...)`.
+
+Use an explicit sampled task-state structure with named fields:
+
+- `ring_pos`: 3D position
+- `ring_quat`: fixed 4D quaternion for version 1
+- `bar_pos`: 3D position
+- `bar_quat`: fixed 4D quaternion for version 1
+
+Behavior:
+
+- ring block: randomized only in a left-side planar sampling region
+- bar block: randomized only in a right-side planar sampling region
+- both objects start flat on the table
+- both objects use fixed orientation at reset
+- no random yaw, tilt, or flip in this version
+
+The sampling regions should be chosen conservatively so that:
+
+- the left arm can comfortably reach and grasp the ring
+- the right arm can comfortably reach and grasp the bar
+- scripted open-loop pick trajectories remain feasible
+
+## Control / Action Interface
+
+Reuse the current 16D EE-space action convention already used by the dual-Diana position-control environment:
+
+- left arm EE pose: 7D (`xyz + quat`)
+- right arm EE pose: 7D (`xyz + quat`)
+- left gripper command: 1D
+- right gripper command: 1D
+
+The new task should continue using EE targets transformed through the existing IK-based control path.
+
+## Environment Structure
+
+Implement this as a new task-specific environment path while reusing the existing dual-Diana simulation base where possible.
+
+Expected responsibilities:
+
+- scene instantiation for the ring+bar setup
+- task reset for randomized object placement
+- environment-state accessors for both objects
+- reward computation
+- in-air insertion success detection
+
+The environment factory must dispatch by task name and leave the `sim_transfer` branch unchanged.
+
+## Observation / Environment State
+
+The task should retain the current observation structure style used by the dual-Diana environment:
+
+- `qpos`
+- multi-camera images
+
+For task state access, the environment should expose a stable `env_state` vector with this exact order:
+
+- `ring_pos[0:3]`
+- `ring_quat[3:7]`
+- `bar_pos[7:10]`
+- `bar_quat[10:14]`
+
+This 14D state should be sufficient for scripted-policy debugging and future rollout analysis, while reset itself remains caller-driven via the named task-state helper structure above.
+
+## Reward Design
+
+Use staged rewards in the same spirit as the current task, returning the highest achieved stage rather than accumulating one-time sparse bonuses per event.
+
+Maximum reward:
+
+- `max_reward = 5`
+
+Reward stages:
+
+1. left gripper touches the ring block
+2. right gripper touches the bar block
+3. ring block is lifted off the table
+4. bar block is lifted off the table
+5. while both objects are off the table, the bar truly passes through the ring aperture
+
+Notes:
+
+- contact rewards are intended as grasp-progress stages
+- lift rewards require the object to be off the table, not merely touched
+- final success reward only applies when both objects are airborne
+
+## Success Detection
+
+Success must **not** be based on a centerline-only check.
+
+A centerline-only test is insufficient because:
+
+- the bar has thickness, so a centerline can pass through while the body cannot
+- a square bar with imperfect orientation can have its centerline inside the aperture while its corners still collide with the ring
+
+### Required Success Semantics
+
+A successful insertion requires all of the following:
+
+1. the ring is off the table
+2. the bar is off the table
+3. the bar has actually crossed through the ring thickness direction
+4. the bar’s finite square cross-section fits through the square aperture during that crossing
+
+### Recommended Detection Approach
+
+Use a task-level geometric check in Python rather than relying on contact alone.
+
+Implementation intent:
+
+- transform the bar geometry into the ring’s local frame
+- reason about the bar as a finite oriented box (not a line)
+- verify that the bar has crossed the ring thickness direction
+- verify that the portion of the bar passing the aperture fits within the inner square opening, accounting for the bar’s cross-section and orientation
+
+This geometric check is the primary success test.
+
+### Role of Contacts
+
+Contacts may still be used for:
+
+- grasp-stage rewards
+- debugging / diagnostics
+
+But contact alone should **not** be the sole criterion for insertion success, since:
+
+- a true clean insertion may have limited aperture-wall contact
+- persistent contact can also happen while the bar is jammed and not actually inserted
+
+## Scripted Policy
+
+Add a new task-specific scripted policy for `sim_air_insert_ring_bar`.
+
+### Policy Intent
+
+The first version prioritizes a conservative, reliable open-loop demonstration rather than an optimized trajectory.
+
+### Action Phases
+
+The scripted policy should follow these phases:
+
+1. move both arms to safe initial / waiting poses with grippers open
+2. move left arm above the ring and right arm above the bar
+3. descend and grasp the assigned objects
+4. lift both objects clear of the table
+5. move both objects to an airborne meeting region above the table
+6. hold the ring stably while aligning the bar with the aperture
+7. push the bar along the intended insertion direction until the geometric success condition is met
+
+### Grasp Assignment
+
+- left arm: ring only
+- right arm: bar only
+
+### Motion Style
+
+Keep the current repository style:
+
+- waypoint-based trajectory definition
+- open-loop interpolation between waypoints
+- fixed grasp orientation in the first version
+
+No adaptive replanning is required for the first version.
+
+## Files / Integration Scope
+
+The implementation is expected to add task-specific files rather than broadly refactoring the codebase.
+
+Likely additions / changes:
+
+- a new MuJoCo scene XML for the ring+bar task
+- one or more XML fragments defining the two new objects
+- a new task-specific dual-Diana environment file
+- robot asset wiring for the new scene XML
+- reset sampling helpers for the new task
+- task registration in constants / environment factory paths
+- a new scripted policy file
+- focused tests for task creation, reset, rewards, success detection, and scripted policy shape/smoke behavior
+
+## Testing Requirements
+
+At minimum, add regression coverage for:
+
+### Environment Creation
+
+- the new task can be created via the task factory
+- the existing `sim_transfer` task remains unchanged
+
+### Reset / Sampling
+
+- ring reset positions are inside the left sampling region
+- bar reset positions are inside the right sampling region
+- reset orientation is fixed as intended
+
+### Environment State
+
+- environment-state access returns both object poses in the expected structure
+
+### Success Detection
+
+Must include both positive and negative cases.
+
+Positive case:
+
+- a configuration where the finite bar truly passes through the ring aperture is detected as success
+
+Negative cases:
+
+- centerline-inside but finite body would clip the aperture
+- not enough depth / not actually crossing the ring thickness direction
+- one or both objects still on the table
+
+### Reward Logic
+
+- left contact stage
+- right contact stage
+- ring lift stage
+- bar lift stage
+- final success stage with `max_reward = 5`
+
+### Scripted Policy
+
+At minimum:
+
+- policy emits valid 16D actions
+- trajectory generation does not error
+- rollout smoke path can step through the new environment
+
+## Risks / Constraints
+
+- MuJoCo contact naming must remain stable enough for stage rewards
+- geometric insertion checks must be strict enough to avoid false positives but not so brittle that numerically valid insertions are missed
+- scripted open-loop insertion may require conservative alignment and lift heights to keep the first version reliable
+
+## Acceptance Criteria
+
+The feature is complete when all of the following are true:
+
+- `sim_air_insert_ring_bar` is creatable as an independent task
+- the scene contains the dual Diana, table, ring block, and bar block
+- reset randomizes ring and bar positions in left/right planar regions with fixed orientation
+- the environment exposes task state for both objects
+- staged rewards progress to `max_reward = 5`
+- final success is based on finite-geometry insertion semantics, not a centerline-only shortcut
+- a new scripted policy can execute the intended pick-lift-align-insert behavior in the new environment
+- a canonical nominal smoke path (unit-level or deterministic integration-level) exists for the new scripted-policy interface so success is not judged purely by interpretation
+- existing `sim_transfer` behavior is preserved
diff --git a/environment.yml b/environment.yml
new file mode 100644
index 0000000..7f1c879
--- /dev/null
+++ b/environment.yml
@@ -0,0 +1,458 @@
+name: roboimi
+channels:
+ - conda-forge
+dependencies:
+ - _libgcc_mutex=0.1
+ - _openmp_mutex=4.5
+ - _python_abi3_support=1.0
+ - aiohappyeyeballs=2.6.1
+ - aiohttp=3.13.3
+ - aiosignal=1.4.0
+ - alsa-lib=1.2.9
+ - anyio=4.12.1
+ - aom=3.5.0
+ - async-timeout=5.0.1
+ - attr=2.5.1
+ - attrs=25.4.0
+ - aws-c-auth=0.7.22
+ - aws-c-cal=0.6.15
+ - aws-c-common=0.9.23
+ - aws-c-compression=0.2.18
+ - aws-c-event-stream=0.4.2
+ - aws-c-http=0.8.2
+ - aws-c-io=0.14.9
+ - aws-c-mqtt=0.10.4
+ - aws-c-s3=0.5.10
+ - aws-c-sdkutils=0.1.16
+ - aws-checksums=0.1.18
+ - aws-crt-cpp=0.26.12
+ - aws-sdk-cpp=1.11.329
+ - box2d-py=2.3.8
+ - brotli=1.1.0
+ - brotli-bin=1.1.0
+ - brotli-python=1.1.0
+ - bzip2=1.0.8
+ - c-ares=1.34.6
+ - ca-certificates=2026.1.4
+ - cairo=1.16.0
+ - certifi=2026.1.4
+ - cffi=1.17.1
+ - charset-normalizer=3.4.4
+ - click=8.3.1
+ - cloudpickle=3.0.0
+ - contourpy=1.3.0
+ - cpython=3.10.19
+ - cuda-cudart=12.6.68
+ - cuda-cudart_linux-64=12.6.68
+ - cuda-nvrtc=12.6.68
+ - cuda-nvtx=12.6.68
+ - cuda-version=12.6
+ - cudnn=8.9.7.29
+ - cycler=0.12.1
+ - datasets=4.0.0
+ - dav1d=1.2.1
+ - dbus=1.13.6
+ - dill=0.3.8
+ - eigen=3.4.0
+ - exceptiongroup=1.3.1
+ - expat=2.6.3
+ - farama-notifications=0.0.4
+ - filelock=3.15.4
+ - fluidsynth=2.3.3
+ - font-ttf-dejavu-sans-mono=2.37
+ - font-ttf-inconsolata=3.000
+ - font-ttf-source-code-pro=2.038
+ - font-ttf-ubuntu=0.83
+ - fontconfig=2.14.2
+ - fonts-conda-ecosystem=1
+ - fonts-conda-forge=1
+ - fonttools=4.53.1
+ - freetype=2.12.1
+ - frozenlist=1.7.0
+ - fsspec=2024.6.1
+ - gettext=0.22.5
+ - gettext-tools=0.22.5
+ - gflags=2.2.2
+ - git-lfs=3.7.1
+ - glog=0.7.1
+ - gmp=6.3.0
+ - gmpy2=2.1.5
+ - graphite2=1.3.13
+ - gym=0.26.1
+ - gym-box2d=0.26.1
+ - gym-notices=0.0.8
+ - gymnasium=0.29.1
+ - h11=0.16.0
+ - h2=4.3.0
+ - harfbuzz=7.3.0
+ - hf-xet=1.2.1
+ - hpack=4.1.0
+ - httpcore=1.0.9
+ - httpx=0.28.1
+ - huggingface_hub=1.3.5
+ - hyperframe=6.1.0
+ - icu=72.1
+ - idna=3.11
+ - jack=1.9.22
+ - jax-jumpy=1.0.0
+ - jinja2=3.1.4
+ - jpeg=9e
+ - keyutils=1.6.3
+ - kiwisolver=1.4.9
+ - krb5=1.21.3
+ - lame=3.100
+ - lcms2=2.15
+ - ld_impl_linux-64=2.40
+ - lerc=4.0.0
+ - libabseil=20240116.2
+ - libarrow=16.1.0
+ - libarrow-acero=16.1.0
+ - libarrow-dataset=16.1.0
+ - libarrow-substrait=16.1.0
+ - libasprintf=0.22.5
+ - libasprintf-devel=0.22.5
+ - libavif=0.11.1
+ - libblas=3.9.0
+ - libbrotlicommon=1.1.0
+ - libbrotlidec=1.1.0
+ - libbrotlienc=1.1.0
+ - libcap=2.69
+ - libcblas=3.9.0
+ - libcrc32c=1.1.2
+ - libcublas=12.6.1.4
+ - libcufft=11.2.6.59
+ - libcurand=10.3.7.68
+ - libcurl=8.12.1
+ - libcusolver=11.6.4.69
+ - libcusparse=12.5.3.3
+ - libdb=6.2.32
+ - libdeflate=1.17
+ - libedit=3.1.20250104
+ - libev=4.33
+ - libevent=2.1.12
+ - libexpat=2.6.3
+ - libffi=3.4.2
+ - libflac=1.4.3
+ - libgcc=14.1.0
+ - libgcc-ng=14.1.0
+ - libgcrypt=1.11.0
+ - libgettextpo=0.22.5
+ - libgettextpo-devel=0.22.5
+ - libgfortran=14.1.0
+ - libgfortran-ng=14.1.0
+ - libgfortran5=14.1.0
+ - libglib=2.80.3
+ - libgoogle-cloud=2.25.0
+ - libgoogle-cloud-storage=2.25.0
+ - libgpg-error=1.50
+ - libgrpc=1.62.2
+ - libhwloc=2.9.3
+ - libiconv=1.17
+ - libjpeg-turbo=2.1.4
+ - liblapack=3.9.0
+ - libmad=0.15.1b
+ - libmagma=2.8.0
+ - libmagma_sparse=2.8.0
+ - libnghttp2=1.67.0
+ - libnsl=2.0.1
+ - libnvjitlink=12.6.68
+ - libogg=1.3.5
+ - libopenblas=0.3.27
+ - libopus=1.3.1
+ - libparquet=16.1.0
+ - libpng=1.6.43
+ - libprotobuf=4.25.3
+ - libre2-11=2023.09.01
+ - libsndfile=1.2.2
+ - libsqlite=3.46.0
+ - libssh2=1.11.1
+ - libstdcxx=14.1.0
+ - libstdcxx-ng=14.1.0
+ - libsystemd0=256.5
+ - libthrift=0.19.0
+ - libtiff=4.5.0
+ - libtorch=2.4.0
+ - libutf8proc=2.8.0
+ - libuuid=2.38.1
+ - libuv=1.48.0
+ - libvorbis=1.3.7
+ - libwebp-base=1.4.0
+ - libxcb=1.13
+ - libxcrypt=4.4.36
+ - libxml2=2.11.5
+ - libzlib=1.3.1
+ - llvm-openmp=18.1.8
+ - lz4-c=1.9.4
+ - markupsafe=2.1.5
+ - matplotlib-base=3.9.2
+ - mkl=2023.2.0
+ - mpc=1.3.1
+ - mpfr=4.2.1
+ - mpg123=1.31.3
+ - mpmath=1.3.0
+ - multidict=6.7.0
+ - multiprocess=0.70.16
+ - munkres=1.1.4
+ - nccl=2.22.3.1
+ - ncurses=6.5
+ - networkx=3.3
+ - numpy=1.26.4
+ - openjpeg=2.5.0
+ - openssl=3.6.1
+ - opusfile=0.12
+ - orc=2.0.1
+ - orocos-kdl=1.5.1
+ - packaging=24.1
+ - pandas=2.2.2
+ - pcre2=10.44
+ - pillow=9.4.0
+ - pip=24.2
+ - pixman=0.43.2
+ - portaudio=19.6.0
+ - portmidi=2.0.4
+ - propcache=0.3.1
+ - pthread-stubs=0.4
+ - pulseaudio-client=16.1
+ - pyarrow=16.1.0
+ - pyarrow-core=16.1.0
+ - pybind11=2.13.5
+ - pybind11-global=2.13.5
+ - pycparser=2.22
+ - pygame=2.1.3
+ - pyparsing=3.1.4
+ - pysocks=1.7.1
+ - python=3.10.14
+ - python-dateutil=2.9.0
+ - python-gil=3.10.19
+ - python-orocos-kdl=1.5.1
+ - python-tzdata=2024.1
+ - python-xxhash=3.6.0
+ - python_abi=3.10
+ - pytorch=2.4.0
+ - hydra-core=1.3.2
+ - omegaconf=2.3.0
+ - einops=0.8.2
+ - diffusers=0.36.0
+ - torchvision=0.19.0
+ - pytz=2024.1
+ - pyyaml=6.0.3
+ - qhull=2020.2
+ - re2=2023.09.01
+ - readline=8.2
+ - regex=2026.1.15
+ - requests=2.32.5
+ - s2n=1.4.16
+ - safetensors=0.7.0
+ - sdl2=2.26.5
+ - sdl2_image=2.6.3
+ - sdl2_mixer=2.6.3
+ - sdl2_ttf=2.20.2
+ - setuptools=72.2.0
+ - shellingham=1.5.4
+ - six=1.16.0
+ - sleef=3.6.1
+ - snappy=1.2.2
+ - sniffio=1.3.1
+ - stable-baselines3=2.3.2
+ - sympy=1.13.2
+ - tbb=2021.11.0
+ - tk=8.6.13
+ - tokenizers=0.22.2
+ - tqdm=4.67.2
+ - transformers=5.0.0
+ - typer-slim=0.21.1
+ - typing-extensions=4.12.2
+ - typing_extensions=4.12.2
+ - tzdata=2024a
+ - unicodedata2=15.1.0
+ - urllib3=2.5.0
+ - wheel=0.44.0
+ - xorg-kbproto=1.0.7
+ - xorg-libice=1.1.1
+ - xorg-libsm=1.2.4
+ - xorg-libx11=1.8.4
+ - xorg-libxau=1.0.11
+ - xorg-libxdmcp=1.1.3
+ - xorg-libxext=1.3.4
+ - xorg-libxrender=0.9.10
+ - xorg-renderproto=0.11.1
+ - xorg-xextproto=7.3.0
+ - xorg-xproto=7.0.31
+ - xxhash=0.8.3
+ - xz=5.2.6
+ - yaml=0.2.5
+ - yarl=1.22.0
+ - zlib=1.3.1
+ - zstandard=0.23.0
+ - zstd=1.5.6
+ - pip:
+ - GitPython==3.1.46
+ - Jinja2==3.1.6
+ - MarkupSafe==3.0.3
+ - PyOpenGL==3.1.7
+ - PyYAML==6.0.3
+ - Pygments==2.19.2
+ - absl-py==2.1.0
+ - accelerate==1.12.0
+ - aiofiles==24.1.0
+ - aiohappyeyeballs==2.6.1
+ - aiohttp==3.13.3
+ - aiosignal==1.4.0
+ - annotated-doc==0.0.4
+ - annotated-types==0.7.0
+ - antlr4-python3-runtime==4.9.3
+ - anyio==4.12.1
+ - asciitree==0.3.3
+ - asttokens==3.0.1
+ - async-timeout==5.0.1
+ - attrs==25.4.0
+ - av==15.1.0
+ - brotli==1.2.0
+ - charset-normalizer==3.4.4
+ - cmake==4.1.3
+ - cmeel==0.58.0
+ - cmeel-assimp==5.4.3.1
+ - cmeel-boost==1.87.0.1
+ - cmeel-console-bridge==1.0.2.3
+ - cmeel-octomap==1.10.0
+ - cmeel-qhull==8.0.2.1
+ - cmeel-tinyxml==2.6.2.3
+ - cmeel-tinyxml2==10.0.0
+ - cmeel-urdfdom==3.1.1.1
+ - cmeel-zlib==1.3.1
+ - coal==3.0.2
+ - coal-library==3.0.1
+ - colorama==0.4.6
+ - datasets==4.5.0
+ - decorator==5.2.1
+ - deepdiff==8.6.1
+ - dill==0.4.0
+ - docstring_parser==0.17.0
+ - draccus==0.10.0
+ - eigenpy==3.10.3
+ - etils==1.7.0
+ - evdev==1.9.2
+ - exceptiongroup==1.3.1
+ - executing==2.2.1
+ - fastapi==0.128.0
+ - fasteners==0.20
+ - ffmpy==1.0.0
+ - filelock==3.20.3
+ - frozenlist==1.8.0
+ - fsspec==2025.10.0
+ - gitdb==4.0.12
+ - glfw==2.7.0
+ - gradio==6.3.0
+ - gradio_client==2.0.3
+ - groovy==0.1.2
+ - gymnasium==1.2.3
+ - h11==0.16.0
+ - h5py==3.15.1
+ - hf-xet==1.2.0
+ - hf_transfer==0.1.9
+ - httpcore==1.0.9
+ - httpx==0.28.1
+ - huggingface_hub==1.3.2
+ - imageio==2.35.1
+ - imageio-ffmpeg==0.6.0
+ - importlib_metadata==8.7.1
+ - importlib_resources==6.5.2
+ - inquirerpy==0.3.4
+ - ipython==8.38.0
+ - jedi==0.19.2
+ - jsonargparse==4.45.0
+ - jsonlines==4.0.0
+ - kiwisolver==1.4.5
+ - lerobot==0.4.2
+ - libcoal==3.0.2
+ - libpinocchio==3.8.0
+ - lightning==2.5.0.post0
+ - lightning-utilities==0.15.2
+ - lxml==5.3.0
+ - markdown-it-py==4.0.0
+ - matplotlib-inline==0.2.1
+ - mdurl==0.1.2
+ - mergedeep==1.3.4
+ - mpmath==1.3.0
+ - mujoco==3.2.2
+ - mujoco-python-viewer==0.1.4
+ - multidict==6.7.0
+ - multiprocess==0.70.18
+ - mypy_extensions==1.1.0
+ - networkx==3.4.2
+ - numcodecs==0.13.1
+ - numpy==2.2.6
+ - opencv-contrib-python==4.10.0.84
+ - opencv-python==4.13.0.90
+ - orderly-set==5.5.0
+ - orjson==3.11.5
+ - packaging==24.2
+ - pandas==2.3.3
+ - parso==0.8.5
+ - pexpect==4.9.0
+ - pfzy==0.3.4
+ - pillow==12.1.0
+ - pin==3.3.1
+ - platformdirs==4.5.1
+ - prompt_toolkit==3.0.52
+ - propcache==0.4.1
+ - protobuf==6.33.4
+ - proxsuite==0.7.2
+ - psutil==7.2.1
+ - ptyprocess==0.7.0
+ - pure_eval==0.2.3
+ - pyarrow==22.0.0
+ - pydantic==2.12.5
+ - pydantic_core==2.41.5
+ - pydub==0.25.1
+ - pynput==1.8.1
+ - pyquaternion==0.9.9
+ - pyserial==3.5
+ - python-dateutil==2.9.0.post0
+ - python-multipart==0.0.21
+ - python-xlib==0.33
+ - pytorch-lightning==2.6.0
+ - pyyaml-include==1.4.1
+ - qwen-vl-utils==0.0.14
+ - regex==2026.1.15
+ - requests==2.32.5
+ - rerun-sdk==0.26.2
+ - rich==13.9.4
+ - ruckig==0.9.2
+ - safehttpx==0.1.7
+ - safetensors==0.7.0
+ - scipy==1.14.1
+ - semantic-version==2.10.0
+ - sentry-sdk==2.49.0
+ - shellingham==1.5.4
+ - smmap==5.0.2
+ - stack-data==0.6.3
+ - starlette==0.50.0
+ - sympy==1.13.1
+ - swanlab==0.7.13
+ - termcolor==3.3.0
+ - timm==1.0.24
+ - toml==0.10.2
+ - tomli==2.4.0
+ - tomlkit==0.13.3
+ - torchcodec==0.5
+ - torchmetrics==1.8.2
+ - tqdm==4.67.1
+ - traitlets==5.14.3
+ - typer==0.21.1
+ - typer-slim==0.21.1
+ - typeshed_client==2.8.2
+ - typing-inspect==0.9.0
+ - typing-inspection==0.4.2
+ - typing_extensions==4.15.0
+ - tzdata==2025.3
+ - urdf_parser_py==0.0.4
+ - urllib3==2.6.3
+ - uv==0.9.28
+ - uvicorn==0.40.0
+ - wandb==0.24.0
+ - wcwidth==0.2.14
+ - xxhash==3.6.0
+ - yarl==1.22.0
+ - zarr==2.18.3
+ - zipp==3.20.1
diff --git a/generate_dataset_videos.py b/generate_dataset_videos.py
new file mode 100644
index 0000000..0adae9f
--- /dev/null
+++ b/generate_dataset_videos.py
@@ -0,0 +1,324 @@
+#!/usr/bin/env python3
+"""
+将 HDF5 数据集转换为视频,用于可视化检查
+
+功能:
+1. 将单个 episode 转换为视频
+2. 对比多个 episode 的视频
+3. 放慢播放速度便于观察
+"""
+import os
+import h5py
+import glob
+import cv2
+import numpy as np
+
+
+def episode_to_video(episode_file, output_path, camera='top', fps=30, slow_factor=1):
+ """
+ 将单个 episode 转换为视频
+
+ Args:
+ episode_file: HDF5 文件路径
+ output_path: 输出视频路径
+ camera: 要使用的相机名称
+ fps: 帧率
+ slow_factor: 慢放倍数(1=正常,2=半速)
+ """
+ try:
+ with h5py.File(episode_file, 'r') as f:
+ # 读取图像序列
+ img_path = f'/observations/images/{camera}'
+
+ if img_path not in f:
+ print(f" ❌ 相机 {camera} 不存在")
+ return False
+
+ images = f[img_path][:] # shape: (T, H, W, C)
+ qpos = f['/observations/qpos'][:]
+ actions = f['/action'][:]
+
+ total_frames = len(images)
+ height, width = images.shape[1], images.shape[2]
+
+ # 创建视频写入器
+ fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+ actual_fps = fps // slow_factor
+ out = cv2.VideoWriter(output_path, fourcc, actual_fps, (width, height))
+
+ # 逐帧写入
+ for i in range(total_frames):
+ frame = images[i].astype(np.uint8)
+
+ # 在图像上添加信息
+ info_text = [
+ f"Episode: {os.path.basename(episode_file).replace('.hdf5', '')}",
+ f"Frame: {i}/{total_frames}",
+ f"qpos[0:3]: [{qpos[i, 0]:.2f}, {qpos[i, 1]:.2f}, {qpos[i, 2]:.2f}]",
+ ]
+
+ for j, text in enumerate(info_text):
+ cv2.putText(frame, text, (10, 30 + j*30),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
+
+ out.write(frame)
+
+ out.release()
+ print(f" ✅ 保存: {output_path}")
+ print(f" 帧数: {total_frames}, 尺寸: {width}x{height}, FPS: {actual_fps}")
+ return True
+
+ except Exception as e:
+ print(f" ❌ 错误: {e}")
+ return False
+
+
+def generate_all_videos(camera='top', num_episodes=5, slow_factor=1):
+ """生成前 N 个 episode 的视频"""
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+ if len(episode_files) == 0:
+ print(f"❌ 没有找到数据文件: {dataset_dir}")
+ return
+
+ # 创建输出目录
+ output_dir = '/tmp/dataset_videos'
+ os.makedirs(output_dir, exist_ok=True)
+
+ print(f"找到 {len(episode_files)} 个 episode 文件")
+ print(f"将生成前 {min(num_episodes, len(episode_files))} 个 episode 的视频\n")
+
+ # 生成视频
+ for i in range(min(num_episodes, len(episode_files))):
+ ep_file = episode_files[i]
+ ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+ output_path = f"{output_dir}/{ep_name}_{camera}.mp4"
+
+ print(f"[{i+1}/{min(num_episodes, len(episode_files))}] {ep_name}")
+ episode_to_video(ep_file, output_path, camera=camera, slow_factor=slow_factor)
+ print()
+
+ print(f"✅ 所有视频已保存到: {output_dir}")
+ print(f"\n播放方法:")
+ print(f" # 播放单个视频")
+ print(f" vlc {output_dir}/*.mp4")
+ print(f" ")
+ print(f" # 或用文件管理器")
+ print(f" nautilus {output_dir}")
+
+
+def generate_multi_camera_video(episode_idx=0, slow_factor=1):
+ """生成包含多个相机的视频(分屏显示)"""
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+ if episode_idx >= len(episode_files):
+ print(f"❌ Episode {episode_idx} 不存在")
+ return
+
+ ep_file = episode_files[episode_idx]
+
+ try:
+ with h5py.File(ep_file, 'r') as f:
+ # 获取所有相机
+ cameras = []
+ for key in f.keys():
+ if 'images' in key:
+ for cam_name in f[key].keys():
+ if cam_name not in cameras:
+ cameras.append(cam_name)
+
+ print(f"Episode {episode_idx} 的相机: {cameras}")
+
+ # 读取所有相机的图像
+ all_images = {}
+ for cam in cameras:
+ img_path = f'/observations/images/{cam}'
+ if img_path in f:
+ all_images[cam] = f[img_path][:]
+
+ if not all_images:
+ print("❌ 没有找到图像数据")
+ return
+
+ # 获取第一个相机的尺寸
+ first_cam = list(all_images.keys())[0]
+ total_frames = len(all_images[first_cam])
+ height, width = all_images[first_cam].shape[1], all_images[first_cam].shape[2]
+
+ # 创建多相机布局
+ num_cams = len(all_images)
+ cols = min(2, num_cams)
+ rows = (num_cams + cols - 1) // cols
+
+ canvas_width = width * cols
+ canvas_height = height * rows
+
+ # 创建视频写入器
+ output_path = f'/tmp/dataset_videos/episode_{episode_idx}_all_cameras.mp4'
+ fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+ out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
+
+ # 逐帧合成
+ for i in range(total_frames):
+ canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
+
+ for cam_idx, cam_name in enumerate(all_images.keys()):
+ img = all_images[cam_name][i]
+
+ # 计算在画布上的位置
+ row = cam_idx // cols
+ col = cam_idx % cols
+ y_start = row * height
+ y_end = y_start + height
+ x_start = col * width
+ x_end = x_start + width
+
+ # 调整大小(如果需要)
+ if img.shape[:2] != (height, width):
+ img = cv2.resize(img, (width, height))
+
+ # 放到画布上
+ canvas[y_start:y_end, x_start:x_end] = img
+
+ # 添加相机名称
+ cv2.putText(canvas, cam_name, (x_start + 10, y_start + 30),
+ cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
+
+ # 添加帧信息
+ cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 10),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
+
+ out.write(canvas)
+
+ out.release()
+ print(f"✅ 保存多相机视频: {output_path}")
+
+ except Exception as e:
+ print(f"❌ 错误: {e}")
+
+
+def compare_episodes(camera='top', slow_factor=2):
+ """并排对比多个 episode 的视频"""
+
+ dataset_dir = "roboimi/demos/dataset/sim_transfer"
+ episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+ # 选择要对比的 episode
+ episodes_to_compare = [0, 1, 2, 3, 4] # 对比前 5 个
+
+ print(f"对比 Episodes: {episodes_to_compare}")
+
+ # 读取所有 episode 的数据
+ all_data = []
+ for ep_idx in episodes_to_compare:
+ if ep_idx >= len(episode_files):
+ continue
+
+ try:
+ with h5py.File(episode_files[ep_idx], 'r') as f:
+ img_path = f'/observations/images/{camera}'
+ if img_path in f:
+ all_data.append({
+ 'idx': ep_idx,
+ 'images': f[img_path][:],
+ 'qpos': f['/observations/qpos'][:]
+ })
+ except:
+ pass
+
+ if len(all_data) == 0:
+ print("❌ 没有数据")
+ return
+
+ # 获取参数
+ first_data = all_data[0]
+ height, width = first_data['images'].shape[1], first_data['images'].shape[2]
+ total_frames = min([d['images'].shape[0] for d in all_data])
+
+ # 创建并排布局
+ num_compare = len(all_data)
+ canvas_width = width * num_compare
+ canvas_height = height
+
+ # 创建视频
+ output_path = f'/tmp/dataset_videos/compare_{camera}.mp4'
+ fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+ out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
+
+ print(f"生成对比视频,共 {total_frames} 帧...")
+
+ # 逐帧对比
+ for i in range(total_frames):
+ canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
+
+ for j, data in enumerate(all_data):
+ img = data['images'][i]
+ qpos = data['qpos'][i]
+
+ # 调整大小(如果需要)
+ if img.shape[:2] != (height, width):
+ img = cv2.resize(img, (width, height))
+
+ # 放到画布上
+ x_start = j * width
+ x_end = x_start + width
+ canvas[:, x_start:x_end] = img
+
+ # 添加信息
+ ep_name = f"Ep {data['idx']}"
+ cv2.putText(canvas, ep_name, (x_start + 10, 30),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2)
+ cv2.putText(canvas, f"qpos[0:3]: [{qpos[0]:.2f}, {qpos[1]:.2f}, {qpos[2]:.2f}]",
+ (x_start + 10, height - 10),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
+
+ # 添加帧号
+ cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 30),
+ cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
+
+ out.write(canvas)
+
+ if i % 100 == 0:
+ print(f" 进度: {i}/{total_frames}")
+
+ out.release()
+ print(f"✅ 保存对比视频: {output_path}")
+
+
+if __name__ == "__main__":
+ import sys
+
+ print("="*60)
+ print("数据集视频生成工具")
+ print("="*60)
+
+ if len(sys.argv) > 1:
+ command = sys.argv[1]
+
+ if command == 'compare':
+ # 对比多个 episode
+ camera = sys.argv[2] if len(sys.argv) > 2 else 'top'
+ compare_episodes(camera=camera, slow_factor=2)
+
+ elif command == 'multi':
+ # 多相机视频
+ ep_idx = int(sys.argv[2]) if len(sys.argv) > 2 else 0
+ generate_multi_camera_video(episode_idx=ep_idx, slow_factor=1)
+
+ else:
+ print("未知命令")
+ else:
+ # 默认:生成前 5 个 episode 的视频
+ print("\n生成前 5 个 episode 的视频(top 相机,慢放 2x)...")
+ print("="*60 + "\n")
+ generate_all_videos(camera='top', num_episodes=5, slow_factor=2)
+
+ print("\n" + "="*60)
+ print("其他用法:")
+ print(" python generate_dataset_videos.py compare top # 对比多个 episode")
+ print(" python generate_dataset_videos.py multi 0 # 多相机视频")
+ print("="*60)
diff --git a/roboimi/gr00t/main.py b/gr00t/main.py
similarity index 100%
rename from roboimi/gr00t/main.py
rename to gr00t/main.py
diff --git a/roboimi/gr00t/models/__init__.py b/gr00t/models/__init__.py
similarity index 100%
rename from roboimi/gr00t/models/__init__.py
rename to gr00t/models/__init__.py
diff --git a/roboimi/detr/models/backbone.py b/gr00t/models/backbone.py
similarity index 100%
rename from roboimi/detr/models/backbone.py
rename to gr00t/models/backbone.py
diff --git a/roboimi/gr00t/models/dit.py b/gr00t/models/dit.py
similarity index 100%
rename from roboimi/gr00t/models/dit.py
rename to gr00t/models/dit.py
diff --git a/roboimi/gr00t/models/gr00t.py b/gr00t/models/gr00t.py
similarity index 100%
rename from roboimi/gr00t/models/gr00t.py
rename to gr00t/models/gr00t.py
diff --git a/roboimi/gr00t/models/modules.py b/gr00t/models/modules.py
similarity index 100%
rename from roboimi/gr00t/models/modules.py
rename to gr00t/models/modules.py
diff --git a/roboimi/detr/models/position_encoding.py b/gr00t/models/position_encoding.py
similarity index 100%
rename from roboimi/detr/models/position_encoding.py
rename to gr00t/models/position_encoding.py
diff --git a/roboimi/gr00t/policy.py b/gr00t/policy.py
similarity index 100%
rename from roboimi/gr00t/policy.py
rename to gr00t/policy.py
diff --git a/roboimi/.gitattributes b/roboimi/.gitattributes
new file mode 100644
index 0000000..580d310
--- /dev/null
+++ b/roboimi/.gitattributes
@@ -0,0 +1 @@
+*.safetensors filter=lfs diff=lfs merge=lfs -text
diff --git a/roboimi/__init__.py b/roboimi/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/roboimi/assets/models/manipulators/DianaMed/BiDianaMed_rethink.xml b/roboimi/assets/models/manipulators/DianaMed/BiDianaMed_rethink.xml
index 1668bc8..7b5e55a 100644
--- a/roboimi/assets/models/manipulators/DianaMed/BiDianaMed_rethink.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/BiDianaMed_rethink.xml
@@ -76,7 +76,7 @@
-
+
diff --git a/roboimi/assets/models/manipulators/DianaMed/bi_diana_socket_peg_ee.xml b/roboimi/assets/models/manipulators/DianaMed/bi_diana_socket_peg_ee.xml
new file mode 100644
index 0000000..e532054
--- /dev/null
+++ b/roboimi/assets/models/manipulators/DianaMed/bi_diana_socket_peg_ee.xml
@@ -0,0 +1,6 @@
+
+
+
+
+
+
diff --git a/roboimi/assets/models/manipulators/DianaMed/box.xml b/roboimi/assets/models/manipulators/DianaMed/box.xml
index f351cc3..c016926 100644
--- a/roboimi/assets/models/manipulators/DianaMed/box.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/box.xml
@@ -3,7 +3,7 @@
-
+
diff --git a/roboimi/assets/models/manipulators/DianaMed/socket_peg_objects.xml b/roboimi/assets/models/manipulators/DianaMed/socket_peg_objects.xml
new file mode 100644
index 0000000..642bd78
--- /dev/null
+++ b/roboimi/assets/models/manipulators/DianaMed/socket_peg_objects.xml
@@ -0,0 +1,19 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/roboimi/assets/models/manipulators/DianaMed/table_square.xml b/roboimi/assets/models/manipulators/DianaMed/table_square.xml
index 4813a53..d1127d0 100644
--- a/roboimi/assets/models/manipulators/DianaMed/table_square.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/table_square.xml
@@ -7,6 +7,7 @@