docs: refine IMF migration spec defaults

docs: add IMF-AttnRes migration implementation plan
docs: add IMF-AttnRes migration design spec
2026-04-01 22:55:50 +08:00 · 2026-04-01 22:52:39 +08:00 · 2026-04-01 22:49:10 +08:00 · 2026-04-01 22:46:02 +08:00 · 2026-04-01 22:27:22 +08:00 · 2026-03-31 15:44:53 +08:00
103 changed files with 12151 additions and 2365 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -124,3 +124,8 @@ GEMINI.md

 # Copilot
 .github/copilot-instructions.md
+
+.hydra/
+
+# Local git worktrees
+.worktrees/
--- a/README.en.md
+++ b/README.en.md
@@ -1,36 +0,0 @@
-# robo-imi-act
-
-#### Description
-{**When you're done, you can delete the content in this README and update the file with details for others getting started with your repository**}
-
-#### Software Architecture
-Software architecture description
-
-#### Installation
-
-1.  xxxx
-2.  xxxx
-3.  xxxx
-
-#### Instructions
-
-1.  xxxx
-2.  xxxx
-3.  xxxx
-
-#### Contribution
-
-1.  Fork the repository
-2.  Create Feat_xxx branch
-3.  Commit your code
-4.  Create Pull Request
-
-
-#### Gitee Feature
-
-1.  You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md
-2.  Gitee blog [blog.gitee.com](https://blog.gitee.com)
-3.  Explore open source project [https://gitee.com/explore](https://gitee.com/explore)
-4.  The most valuable open source project [GVP](https://gitee.com/gvp)
-5.  The manual of Gitee [https://gitee.com/help](https://gitee.com/help)
-6.  The most popular members  [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
--- a/README.md
+++ b/README.md
@@ -1,39 +1,208 @@
-# robo-imi-act
+# RoboIMI

-#### 介绍
-{**以下是 Gitee 平台说明，您可以替换此简介**
-Gitee 是 OSCHINA 推出的基于 Git 的代码托管平台（同时支持 SVN）。专为开发者提供稳定、高效、安全的云端软件开发协作平台
-无论是个人、团队、或是企业，都能够用 Gitee 实现代码托管、项目管理、协作开发。企业项目请看 [https://gitee.com/enterprises](https://gitee.com/enterprises)}
+基于 MuJoCo 的机器人仿真与模仿学习框架，实现了使用扩散策略的视觉-语言-动作（VLA）模型，用于机器人操作任务。

-#### 软件架构
-软件架构说明
+## 主要特性

+- **多机器人平台支持**：支持 Diana 和 vx300s 机械臂，可扩展至其他机器人
+- **扩散策略**：采用最先进的扩散模型（DDPM/DDIM）进行动作序列预测
+- **视觉-语言-动作模型**：使用 ResNet-18 视觉骨干网络和空间 softmax 进行视觉特征提取
+- **灵活的控制模式**：支持关节空间和末端执行器（笛卡尔）控制
+- **Hydra 配置系统**：模块化配置系统，便于实验
+- **HDF5 数据集格式**：高效存储和加载演示数据
+- **单臂和双臂任务**：支持单臂和双臂操作任务

-#### 安装教程
+## 安装

-1.  xxxx
-2.  xxxx
-3.  xxxx
+### 环境要求

-#### 使用说明
+- Python 3.8+
+- 支持 CUDA 的 GPU（训练时推荐）
+- Conda 或 Miniconda

-1.  xxxx
-2.  xxxx
-3.  xxxx
+### 安装步骤

-#### 参与贡献
+```bash
+# 克隆仓库
+git clone <repository-url>
+cd robo-imi-act

-1.  Fork 本仓库
-2.  新建 Feat_xxx 分支
-3.  提交代码
-4.  新建 Pull Request
+# 创建并激活 conda 环境
+conda env create -f environment.yml
+conda activate roboimi

+# 以开发模式安装包
+pip install -e .
+```

-#### 特技
+## 快速开始

-1.  使用 Readme\_XXX.md 来支持不同的语言，例如 Readme\_en.md, Readme\_zh.md
-2.  Gitee 官方博客 [blog.gitee.com](https://blog.gitee.com)
-3.  你可以 [https://gitee.com/explore](https://gitee.com/explore) 这个地址来了解 Gitee 上的优秀开源项目
-4.  [GVP](https://gitee.com/gvp) 全称是 Gitee 最有价值开源项目，是综合评定出的优秀开源项目
-5.  Gitee 官方提供的使用手册 [https://gitee.com/help](https://gitee.com/help)
-6.  Gitee 封面人物是一档用来展示 Gitee 会员风采的栏目 [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
+### 1. 数据采集
+
+在仿真环境中记录演示轨迹：
+
+```bash
+# 为 vx300s 机器人记录轨迹
+python roboimi/demos/record_sim_episodes.py
+
+# 为 Diana 机器人记录轨迹
+python roboimi/demos/diana_record_sim_episodes.py
+```
+
+轨迹数据以 HDF5 文件格式保存，包含机器人状态、动作和相机观测。
+
+### 2. 计算数据集统计信息
+
+训练前需要计算归一化统计数据：
+
+```bash
+python roboimi/vla/scripts/calculate_stats.py
+```
+
+该命令会生成 `data_stats.pkl` 文件，包含动作和观测的均值/标准差或最小值/最大值。
+
+### 3. 训练 VLA 模型
+
+使用采集的数据训练视觉-语言-动作模型：
+
+```bash
+# 使用默认配置训练
+python roboimi/demos/vla_scripts/train_vla.py
+
+# 覆盖特定参数
+python roboimi/demos/vla_scripts/train_vla.py train.batch_size=32 train.lr=5e-5 train.max_steps=50000
+
+# 使用不同的模型架构
+python roboimi/demos/vla_scripts/train_vla.py agent=resnet_diffusion data=resnet_dataset
+```
+
+训练输出保存至 `outputs/<日期>/<时间>/`，模型检查点保存至 `checkpoints/`。
+
+### 4. 评估模型
+
+在仿真环境中评估训练好的模型：
+
+```bash
+# 使用默认配置评估（使用最佳检查点）
+python roboimi/demos/vla_scripts/eval_vla.py
+
+# 指定检查点和评估轮数
+python roboimi/demos/vla_scripts/eval_vla.py eval.ckpt_path=checkpoints/vla_model_step_8000.pt eval.num_episodes=5
+
+# 启用动作平滑以获得更流畅的执行
+python roboimi/demos/vla_scripts/eval_vla.py eval.use_smoothing=true eval.smooth_alpha=0.5
+```
+
+## 项目结构
+
+```
+robo-imi-act/
+├── roboimi/
+│   ├── assets/                    # 机器人模型和资源
+│   │   ├── models/manipulators/   # URDF 和 MuJoCo XML 文件
+│   │   └── robots/                # 机器人抽象类
+│   ├── envs/                      # 仿真环境
+│   │   ├── mujoco_base.py         # MuJoCo 环境基类
+│   │   ├── single_base.py         # 单臂任务基类
+│   │   └── double_base.py         # 双臂任务基类
+│   ├── vla/                       # 视觉-语言-动作模型
+│   │   ├── agent.py               # VLAAgent（训练与推理）
+│   │   ├── models/
+│   │   │   ├── backbones/         # 视觉编码器（ResNet 等）
+│   │   │   └── heads/             # 策略头（扩散 UNet1D）
+│   │   ├── conf/                  # Hydra 配置文件
+│   │   └── scripts/               # 训练和工具脚本
+│   └── demos/                     # 演示脚本和示例
+├── checkpoints/                   # 保存的模型检查点
+├── outputs/                       # 训练输出（Hydra）
+├── environment.yml                # Conda 环境配置
+└── CLAUDE.md                      # Claude Code 开发指南
+```
+
+## 架构设计
+
+### VLA 训练流程
+
+```
+HDF5 轨迹数据 → Dataset → DataLoader → VLAAgent → 模型检查点
+```
+
+**模型组件**：
+- **视觉骨干网络**：ResNet-18 + 空间 softmax，用于从相机图像中提取视觉特征
+- **扩散头**：条件 UNet1D，使用 DDPM/DDIM 预测动作序列
+- **VLAAgent**：组合视觉编码器和扩散策略，处理训练和推理
+
+### 配置系统
+
+基于 Hydra 的配置文件位于 `roboimi/vla/conf/`：
+- `config.yaml`：主要训练配置（批次大小、学习率、设备）
+- `agent/resnet_diffusion.yaml`：模型架构（动作维度、观测维度、时间窗口）
+- `data/resnet_dataset.yaml`：数据集路径、相机名称、归一化类型
+- `eval/eval.yaml`：评估设置（检查点路径、轮数、平滑参数）
+
+使用配置插值保持一致性：`${agent.obs_horizon}`
+
+### 数据集格式
+
+HDF5 轨迹文件（`episode_*.hdf5`）包含：
+- `action`：机器人动作 `[T, action_dim]`
+- `observations/qpos`：关节位置 `[T, obs_dim]`
+- `observations/images/<cam_name>`：相机图像 `[T, H, W, C]`
+
+统计文件（`data_stats.pkl`）存储归一化参数（最小值/最大值/均值/标准差）。
+
+## 开发指南
+
+### 添加新机器人
+
+1. 在 `roboimi/assets/models/manipulators/<robot_name>/` 创建 URDF/XML 文件
+2. 在 `roboimi/assets/robots/<robot_name>.py` 定义机器人类（继承自 `arm_base.py`）
+3. 在 `roboimi/envs/<robot_name>_*.py` 创建环境类
+4. 如需要，在常量中注册机器人
+
+### 修改 VLA 架构
+
+1. **自定义骨干网络**：在 `roboimi/vla/models/backbones/` 创建新类，继承 `VLABackbone`
+2. **自定义头部**：在 `roboimi/vla/models/heads/` 创建新类，继承 `VLAHead`
+3. **更新配置**：在 `roboimi/vla/conf/agent/` 添加新的 YAML 文件
+4. **接口定义**：参考 `roboimi/vla/core/interfaces.py` 的抽象基类
+
+### 训练最佳实践
+
+- 采集新数据后务必运行 `calculate_stats.py`
+- 训练时会归一化输入/输出；推理时使用检查点中保存的统计信息进行反归一化
+- 模型预测 `pred_horizon` 步，但只执行前 `action_horizon` 步
+- 推理使用 DDIM（10 步）快速采样；训练使用 DDPM（100 步）
+- 监控验证损失以防止过拟合
+
+## 技术细节
+
+- **坐标系**：关节空间（qpos）或末端执行器空间（xyz + rpy + 夹爪）
+- **动作时间窗口**：`obs_horizon` 为观测窗口，`pred_horizon` 为预测窗口，`action_horizon` 为执行窗口
+- **归一化**：对稳定训练至关重要 - 训练前务必计算统计信息
+- **推理加速**：使用 DDIM 调度器，比训练时的 DDPM 快 10 倍
+- **设备配置**：通过 `train.device` 配置（cuda/cpu）
+
+## 许可证
+
+[在此添加许可证信息]
+
+## 引用
+
+如果您在研究中使用了本代码库，请引用：
+
+```bibtex
+[在此添加引用信息]
+```
+
+## 贡献
+
+欢迎贡献！请随时提交 Pull Request 或开启 Issue。
+
+## 致谢
+
+本项目基于以下开源项目构建：
+- [MuJoCo](https://mujoco.org/) - 物理仿真引擎
+- [PyTorch](https://pytorch.org/) - 深度学习框架
+- [Hydra](https://hydra.cc/) - 配置管理系统
+- [Diffusers](https://github.com/huggingface/diffusers) - 扩散模型库
--- a/check_all_episodes.py
+++ b/check_all_episodes.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+"""
+检查所有 episode 的重复帧情况
+
+找出哪些 episode 有问题，需要删除或重新收集
+"""
+import os
+import h5py
+import glob
+import numpy as np
+
+
+def check_all_episodes():
+    """检查所有 episode 的质量"""
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+    print("="*80)
+    print("所有 Episode 质量检查")
+    print("="*80)
+
+    good_episodes = []
+    bad_episodes = []
+
+    for ep_idx, ep_file in enumerate(episode_files):
+        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+        try:
+            with h5py.File(ep_file, 'r') as f:
+                img_path = '/observations/images/top'
+                if img_path not in f:
+                    continue
+
+                images = f[img_path][:]
+
+                # 检查前 50 帧的重复情况
+                check_frames = min(50, len(images))
+                duplicate_count = 0
+
+                for i in range(check_frames - 1):
+                    img1 = images[i]
+                    img2 = images[i + 1]
+                    diff = np.mean(np.abs(img1.astype(float) - img2.astype(float)))
+
+                    if diff < 1.0:  # 重复
+                        duplicate_count += 1
+
+                duplicate_rate = duplicate_count / check_frames * 100
+
+                # 判断质量
+                if duplicate_rate > 10:  # 超过10%重复
+                    bad_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
+                    status = "❌"
+                else:
+                    good_episodes.append((ep_idx, ep_name, duplicate_rate, duplicate_count))
+                    status = "✅"
+
+                print(f"{status} Episode {ep_idx:2d}: {duplicate_rate:5.1f}% 重复 ({duplicate_count:2d}/{check_frames}) - {ep_name}")
+
+        except Exception as e:
+            print(f"❌ Episode {ep_idx}: 错误 - {e}")
+
+    # 总结
+    print("\n" + "="*80)
+    print("总结")
+    print("="*80)
+    print(f"总共检查: {len(episode_files)} 个 episodes")
+    print(f"正常的: {len(good_episodes)} 个 ✅")
+    print(f"有问题的: {len(bad_episodes)} 个 ❌")
+
+    if bad_episodes:
+        print(f"\n有问题的 episodes:")
+        for ep_idx, ep_name, rate, count in bad_episodes:
+            print(f"  - episode_{ep_idx}.hdf5: {rate:.1f}% 重复")
+
+        print(f"\n删除命令:")
+        ep_names = [name for _, name, _, _ in bad_episodes]
+        print(f"  rm " + " ".join([f"{dataset_dir}/{name}.hdf5" for name in ep_names]))
+
+    print(f"\n建议:")
+    if len(bad_episodes) > 0:
+        print(f"  1. 删除有问题的 {len(bad_episodes)} 个 episodes")
+        print(f"  2. 重新收集数据，或使用剩余的 {len(good_episodes)} 个正常 episodes")
+    else:
+        print(f"  ✅ 所有 episodes 都正常，可以直接使用！")
+
+
+if __name__ == "__main__":
+    check_all_episodes()
--- a/check_specific_frames.py
+++ b/check_specific_frames.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""
+检查特定帧的图像 - 用于验证数据记录问题
+
+功能：
+1. 提取每个 episode 的第 0、1、2 帧图像
+2. 对比不同 episode 的相同帧号
+3. 保存图像供人工检查
+"""
+import os
+import h5py
+import glob
+import cv2
+import numpy as np
+
+
+def check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10):
+    """
+    检查特定帧的图像和 qpos
+
+    Args:
+        frame_indices: 要检查的帧索引列表
+        camera: 相机名称
+        num_episodes: 要检查的 episode 数量
+    """
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+    # 按数字排序
+    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+    # 创建输出目录
+    output_dir = f'/tmp/dataset_frames'
+    os.makedirs(output_dir, exist_ok=True)
+
+    print(f"检查前 {min(num_episodes, len(episode_files))} 个 episode 的特定帧")
+    print(f"帧索引: {frame_indices}")
+    print(f"相机: {camera}")
+    print("="*80)
+
+    # 收集所有数据
+    for ep_idx in range(min(num_episodes, len(episode_files))):
+        ep_file = episode_files[ep_idx]
+        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+        try:
+            with h5py.File(ep_file, 'r') as f:
+                # 读取 qpos
+                qpos = f['/observations/qpos'][:]
+
+                # 读取图像
+                img_path = f'/observations/images/{camera}'
+                if img_path not in f:
+                    print(f"Episode {ep_name}: 相机 {camera} 不存在")
+                    continue
+
+                images = f[img_path][:]
+
+                print(f"\nEpisode {ep_name}:")
+                print(f"  总帧数: {len(images)}")
+
+                # 保存指定帧
+                for frame_idx in frame_indices:
+                    if frame_idx >= len(images):
+                        print(f"  帧 {frame_idx}: 超出范围")
+                        continue
+
+                    # 保存图像
+                    img = images[frame_idx]
+                    filename = f"{output_dir}/ep{ep_idx:02d}_frame{frame_idx:03d}.png"
+                    cv2.imwrite(filename, img)
+
+                    # 打印 qpos
+                    q = qpos[frame_idx]
+                    print(f"  帧 {frame_idx}: qpos[0:3]=[{q[0]:6.2f}, {q[1]:6.2f}, {q[2]:6.2f}], qpos[3]={q[3]:6.2f} → {filename}")
+
+        except Exception as e:
+            print(f"Episode {ep_name}: 错误 - {e}")
+
+    print("\n" + "="*80)
+    print(f"✅ 所有图像已保存到: {output_dir}")
+    print(f"\n查看方法:")
+    print(f"  eog {output_dir}/*.png")
+    print(f"  ")
+    print(f"  # 或对比特定帧:")
+    print(f"  eog {output_dir}/*_frame000.png  # 所有 episode 的第 0 帧")
+    print(f"  eog {output_dir}/*_frame001.png  # 所有 episode 的第 1 帧")
+    print(f"  eog {output_dir}/*_frame002.png  # 所有 episode 的第 2 帧")
+
+
+def compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10):
+    """
+    并排对比所有 episode 的某一帧
+
+    生成一个大的对比图，包含所有 episode 的指定帧
+    """
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+    episode_files = sorted(episode_files, key=lambda x: int(x.split('_')[-1].replace('.hdf5', '')))
+
+    num_compare = min(num_episodes, len(episode_files))
+    cols = 5  # 每行 5 个
+    rows = (num_compare + cols - 1) // cols
+
+    # 创建输出目录
+    output_dir = f'/tmp/dataset_frames'
+    os.makedirs(output_dir, exist_ok=True)
+
+    print(f"生成对比图: 所有 Episode 的第 {frame_idx} 帧")
+    print("="*80)
+
+    # 收集图像
+    images_compare = []
+    qpos_list = []
+
+    for ep_idx in range(num_compare):
+        ep_file = episode_files[ep_idx]
+        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+
+        try:
+            with h5py.File(ep_file, 'r') as f:
+                qpos = f['/observations/qpos'][:]
+                img_path = f'/observations/images/{camera}'
+
+                if img_path in f and frame_idx < f[img_path].shape[0]:
+                    img = f[img_path][frame_idx]
+                    images_compare.append(img)
+                    qpos_list.append(qpos[frame_idx])
+                    print(f"Episode {ep_name}: qpos[0:3]=[{qpos[frame_idx][0]:.2f}, {qpos[frame_idx][1]:.2f}, {qpos[frame_idx][2]:.2f}]")
+
+        except Exception as e:
+            print(f"Episode {ep_name}: 错误 - {e}")
+
+    if not images_compare:
+        print("❌ 没有收集到图像")
+        return
+
+    # 获取图像尺寸
+    h, w = images_compare[0].shape[:2]
+
+    # 创建对比图
+    compare_img = np.zeros((rows * h + 50, cols * w, 3), dtype=np.uint8)
+
+    for i, (img, qpos) in enumerate(zip(images_compare, qpos_list)):
+        row = i // cols
+        col = i % cols
+
+        y_start = row * h + 30
+        y_end = y_start + h
+        x_start = col * w
+        x_end = x_start + w
+
+        # 调整大小（如果需要）
+        if img.shape[:2] != (h, w):
+            img = cv2.resize(img, (w, h))
+
+        compare_img[y_start:y_end, x_start:x_end] = img
+
+        # 添加信息
+        ep_name = f"Ep {i}"
+        cv2.putText(compare_img, ep_name, (x_start + 10, row * h + 20),
+                   cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
+        cv2.putText(compare_img, f"qpos[3]={qpos[3]:.2f}", (x_start + 10, y_end - 10),
+                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
+
+    # 保存对比图
+    output_path = f"{output_dir}/compare_frame{frame_idx:03d}.png"
+    cv2.imwrite(output_path, compare_img)
+
+    print(f"\n✅ 对比图已保存: {output_path}")
+    print(f"  查看方法: eog {output_path}")
+
+
+if __name__ == "__main__":
+    import sys
+
+    print("="*80)
+    print("特定帧检查工具")
+    print("="*80)
+
+    if len(sys.argv) > 1:
+        frame_idx = int(sys.argv[1])
+        compare_frame_across_episodes(frame_idx=frame_idx, camera='top', num_episodes=10)
+    else:
+        # 默认检查第 0、1、2 帧
+        check_specific_frames(frame_indices=[0, 1, 2], camera='top', num_episodes=10)
+
+        print("\n" + "="*80)
+        print("生成对比图...")
+        print("="*80)
+
+        # 生成第 0 帧的对比图
+        compare_frame_across_episodes(frame_idx=0, camera='top', num_episodes=10)
+        compare_frame_across_episodes(frame_idx=1, camera='top', num_episodes=10)
+        compare_frame_across_episodes(frame_idx=2, camera='top', num_episodes=10)
+
+    print("\n" + "="*80)
+    print("其他用法:")
+    print("  python check_specific_frames.py 0    # 只检查第 0 帧")
+    print("  python check_specific_frames.py 1    # 只检查第 1 帧")
+    print("="*80)
--- a/diffusion/configuration_diffusion.py
+++ b/diffusion/configuration_diffusion.py
@@ -0,0 +1,238 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import NormalizationMode
+from lerobot.optim.optimizers import AdamConfig
+from lerobot.optim.schedulers import DiffuserSchedulerConfig
+
+
+@PreTrainedConfig.register_subclass("diffusion")
+@dataclass
+class DiffusionConfig(PreTrainedConfig):
+    """Configuration class for DiffusionPolicy.
+
+    Defaults are configured for training with PushT providing proprioceptive and single camera observations.
+
+    The parameters you will most likely need to change are the ones which depend on the environment / sensors.
+    Those are: `input_shapes` and `output_shapes`.
+
+    Notes on the inputs and outputs:
+        - "observation.state" is required as an input key.
+        - Either:
+            - At least one key starting with "observation.image is required as an input.
+              AND/OR
+            - The key "observation.environment_state" is required as input.
+        - If there are multiple keys beginning with "observation.image" they are treated as multiple camera
+          views. Right now we only support all images having the same shape.
+        - "action" is required as an output key.
+
+    Args:
+        n_obs_steps: Number of environment steps worth of observations to pass to the policy (takes the
+            current step and additional steps going back).
+        horizon: Diffusion model action prediction size as detailed in `DiffusionPolicy.select_action`.
+        n_action_steps: The number of action steps to run in the environment for one invocation of the policy.
+            See `DiffusionPolicy.select_action` for more details.
+        input_shapes: A dictionary defining the shapes of the input data for the policy. The key represents
+            the input data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "observation.image" refers to an input from a camera with dimensions [3, 96, 96],
+            indicating it has three color channels and 96x96 resolution. Importantly, `input_shapes` doesn't
+            include batch dimension or temporal dimension.
+        output_shapes: A dictionary defining the shapes of the output data for the policy. The key represents
+            the output data name, and the value is a list indicating the dimensions of the corresponding data.
+            For example, "action" refers to an output shape of [14], indicating 14-dimensional actions.
+            Importantly, `output_shapes` doesn't include batch dimension or temporal dimension.
+        input_normalization_modes: A dictionary with key representing the modality (e.g. "observation.state"),
+            and the value specifies the normalization mode to apply. The two available modes are "mean_std"
+            which subtracts the mean and divides by the standard deviation and "min_max" which rescale in a
+            [-1, 1] range.
+        output_normalization_modes: Similar dictionary as `normalize_input_modes`, but to unnormalize to the
+            original scale. Note that this is also used for normalizing the training targets.
+        vision_backbone: Name of the torchvision resnet backbone to use for encoding images.
+        crop_shape: (H, W) shape to crop images to as a preprocessing step for the vision backbone. Must fit
+            within the image size. If None, no cropping is done.
+        crop_is_random: Whether the crop should be random at training time (it's always a center crop in eval
+            mode).
+        pretrained_backbone_weights: Pretrained weights from torchvision to initialize the backbone.
+            `None` means no pretrained weights.
+        use_group_norm: Whether to replace batch normalization with group normalization in the backbone.
+            The group sizes are set to be about 16 (to be precise, feature_dim // 16).
+        spatial_softmax_num_keypoints: Number of keypoints for SpatialSoftmax.
+        use_separate_rgb_encoders_per_camera: Whether to use a separate RGB encoder for each camera view.
+        down_dims: Feature dimension for each stage of temporal downsampling in the diffusion modeling Unet.
+            You may provide a variable number of dimensions, therefore also controlling the degree of
+            downsampling.
+        kernel_size: The convolutional kernel size of the diffusion modeling Unet.
+        n_groups: Number of groups used in the group norm of the Unet's convolutional blocks.
+        diffusion_step_embed_dim: The Unet is conditioned on the diffusion timestep via a small non-linear
+            network. This is the output dimension of that network, i.e., the embedding dimension.
+        use_film_scale_modulation: FiLM (https://huggingface.co/papers/1709.07871) is used for the Unet conditioning.
+            Bias modulation is used be default, while this parameter indicates whether to also use scale
+            modulation.
+        noise_scheduler_type: Name of the noise scheduler to use. Supported options: ["DDPM", "DDIM"].
+        num_train_timesteps: Number of diffusion steps for the forward diffusion schedule.
+        beta_schedule: Name of the diffusion beta schedule as per DDPMScheduler from Hugging Face diffusers.
+        beta_start: Beta value for the first forward-diffusion step.
+        beta_end: Beta value for the last forward-diffusion step.
+        prediction_type: The type of prediction that the diffusion modeling Unet makes. Choose from "epsilon"
+            or "sample". These have equivalent outcomes from a latent variable modeling perspective, but
+            "epsilon" has been shown to work better in many deep neural network settings.
+        clip_sample: Whether to clip the sample to [-`clip_sample_range`, +`clip_sample_range`] for each
+            denoising step at inference time. WARNING: you will need to make sure your action-space is
+            normalized to fit within this range.
+        clip_sample_range: The magnitude of the clipping range as described above.
+        num_inference_steps: Number of reverse diffusion steps to use at inference time (steps are evenly
+            spaced). If not provided, this defaults to be the same as `num_train_timesteps`.
+        do_mask_loss_for_padding: Whether to mask the loss when there are copy-padded actions. See
+            `LeRobotDataset` and `load_previous_and_future_frames` for more information. Note, this defaults
+            to False as the original Diffusion Policy implementation does the same.
+    """
+
+    # Inputs / output structure.
+    n_obs_steps: int = 2
+    horizon: int = 16
+    n_action_steps: int = 8
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.MEAN_STD,
+            "STATE": NormalizationMode.MIN_MAX,
+            "ACTION": NormalizationMode.MIN_MAX,
+        }
+    )
+
+    # The original implementation doesn't sample frames for the last 7 steps,
+    # which avoids excessive padding and leads to improved training results.
+    drop_n_last_frames: int = 7  # horizon - n_action_steps - n_obs_steps + 1
+
+    # Architecture / modeling.
+    # Vision backbone.
+    vision_backbone: str = "resnet18"
+    crop_shape: tuple[int, int] | None = (84, 84)
+    crop_is_random: bool = True
+    pretrained_backbone_weights: str | None = None
+    use_group_norm: bool = True
+    spatial_softmax_num_keypoints: int = 32
+    use_separate_rgb_encoder_per_camera: bool = False
+    # Unet.
+    down_dims: tuple[int, ...] = (512, 1024, 2048)
+    kernel_size: int = 5
+    n_groups: int = 8
+    diffusion_step_embed_dim: int = 128
+    use_film_scale_modulation: bool = True
+    # Noise scheduler.
+    noise_scheduler_type: str = "DDPM"
+    num_train_timesteps: int = 100
+    beta_schedule: str = "squaredcos_cap_v2"
+    beta_start: float = 0.0001
+    beta_end: float = 0.02
+    prediction_type: str = "epsilon"
+    clip_sample: bool = True
+    clip_sample_range: float = 1.0
+
+    # Inference
+    num_inference_steps: int | None = None
+
+    # Loss computation
+    do_mask_loss_for_padding: bool = False
+
+    # Training presets
+    optimizer_lr: float = 1e-4
+    optimizer_betas: tuple = (0.95, 0.999)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 1e-6
+    scheduler_name: str = "cosine"
+    scheduler_warmup_steps: int = 500
+
+    def __post_init__(self):
+        super().__post_init__()
+
+        """Input validation (not exhaustive)."""
+        if not self.vision_backbone.startswith("resnet"):
+            raise ValueError(
+                f"`vision_backbone` must be one of the ResNet variants. Got {self.vision_backbone}."
+            )
+
+        supported_prediction_types = ["epsilon", "sample"]
+        if self.prediction_type not in supported_prediction_types:
+            raise ValueError(
+                f"`prediction_type` must be one of {supported_prediction_types}. Got {self.prediction_type}."
+            )
+        supported_noise_schedulers = ["DDPM", "DDIM"]
+        if self.noise_scheduler_type not in supported_noise_schedulers:
+            raise ValueError(
+                f"`noise_scheduler_type` must be one of {supported_noise_schedulers}. "
+                f"Got {self.noise_scheduler_type}."
+            )
+
+        # Check that the horizon size and U-Net downsampling is compatible.
+        # U-Net downsamples by 2 with each stage.
+        downsampling_factor = 2 ** len(self.down_dims)
+        if self.horizon % downsampling_factor != 0:
+            raise ValueError(
+                "The horizon should be an integer multiple of the downsampling factor (which is determined "
+                f"by `len(down_dims)`). Got {self.horizon=} and {self.down_dims=}"
+            )
+
+    def get_optimizer_preset(self) -> AdamConfig:
+        return AdamConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+        )
+
+    def get_scheduler_preset(self) -> DiffuserSchedulerConfig:
+        return DiffuserSchedulerConfig(
+            name=self.scheduler_name,
+            num_warmup_steps=self.scheduler_warmup_steps,
+        )
+
+    def validate_features(self) -> None:
+        if len(self.image_features) == 0 and self.env_state_feature is None:
+            raise ValueError("You must provide at least one image or the environment state among the inputs.")
+
+        if self.crop_shape is not None:
+            for key, image_ft in self.image_features.items():
+                if self.crop_shape[0] > image_ft.shape[1] or self.crop_shape[1] > image_ft.shape[2]:
+                    raise ValueError(
+                        f"`crop_shape` should fit within the images shapes. Got {self.crop_shape} "
+                        f"for `crop_shape` and {image_ft.shape} for "
+                        f"`{key}`."
+                    )
+
+        # Check that all input images have the same shape.
+        if len(self.image_features) > 0:
+            first_image_key, first_image_ft = next(iter(self.image_features.items()))
+            for key, image_ft in self.image_features.items():
+                if image_ft.shape != first_image_ft.shape:
+                    raise ValueError(
+                        f"`{key}` does not match `{first_image_key}`, but we expect all image shapes to match."
+                    )
+
+    @property
+    def observation_delta_indices(self) -> list:
+        return list(range(1 - self.n_obs_steps, 1))
+
+    @property
+    def action_delta_indices(self) -> list:
+        return list(range(1 - self.n_obs_steps, 1 - self.n_obs_steps + self.horizon))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
--- a/diffusion/modeling_diffusion.py
+++ b/diffusion/modeling_diffusion.py
@@ -0,0 +1,764 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
+
+TODO(alexander-soare):
+  - Remove reliance on diffusers for DDPMScheduler and LR scheduler.
+"""
+
+import math
+from collections import deque
+from collections.abc import Callable
+
+import einops
+import numpy as np
+import torch
+import torch.nn.functional as F  # noqa: N812
+import torchvision
+from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from torch import Tensor, nn
+
+from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
+from lerobot.policies.pretrained import PreTrainedPolicy
+from lerobot.policies.utils import (
+    get_device_from_parameters,
+    get_dtype_from_parameters,
+    get_output_shape,
+    populate_queues,
+)
+from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
+
+
+class DiffusionPolicy(PreTrainedPolicy):
+    """
+    Diffusion Policy as per "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
+    (paper: https://huggingface.co/papers/2303.04137, code: https://github.com/real-stanford/diffusion_policy).
+    """
+
+    config_class = DiffusionConfig
+    name = "diffusion"
+
+    def __init__(
+        self,
+        config: DiffusionConfig,
+        **kwargs,
+    ):
+        """
+        Args:
+            config: Policy configuration class instance or None, in which case the default instantiation of
+                the configuration class is used.
+            dataset_stats: Dataset statistics to be used for normalization. If not passed here, it is expected
+                that they will be passed with a call to `load_state_dict` before the policy is used.
+        """
+        super().__init__(config)
+        config.validate_features()
+        self.config = config
+
+        # queues are populated during rollout of the policy, they contain the n latest observations and actions
+        self._queues = None
+
+        self.diffusion = DiffusionModel(config)
+
+        self.reset()
+
+    def get_optim_params(self) -> dict:
+        return self.diffusion.parameters()
+
+    def reset(self):
+        """Clear observation and action queues. Should be called on `env.reset()`"""
+        self._queues = {
+            OBS_STATE: deque(maxlen=self.config.n_obs_steps),
+            ACTION: deque(maxlen=self.config.n_action_steps),
+        }
+        if self.config.image_features:
+            self._queues[OBS_IMAGES] = deque(maxlen=self.config.n_obs_steps)
+        if self.config.env_state_feature:
+            self._queues[OBS_ENV_STATE] = deque(maxlen=self.config.n_obs_steps)
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+        """Predict a chunk of actions given environment observations."""
+        # stack n latest observations from the queue
+        batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
+        actions = self.diffusion.generate_actions(batch, noise=noise)
+
+        return actions
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+        """Select a single action given environment observations.
+
+        This method handles caching a history of observations and an action trajectory generated by the
+        underlying diffusion model. Here's how it works:
+          - `n_obs_steps` steps worth of observations are cached (for the first steps, the observation is
+            copied `n_obs_steps` times to fill the cache).
+          - The diffusion model generates `horizon` steps worth of actions.
+          - `n_action_steps` worth of actions are actually kept for execution, starting from the current step.
+        Schematically this looks like:
+            ----------------------------------------------------------------------------------------------
+            (legend: o = n_obs_steps, h = horizon, a = n_action_steps)
+            |timestep            | n-o+1 | n-o+2 | ..... | n     | ..... | n+a-1 | n+a   | ..... | n-o+h |
+            |observation is used | YES   | YES   | YES   | YES   | NO    | NO    | NO    | NO    | NO    |
+            |action is generated | YES   | YES   | YES   | YES   | YES   | YES   | YES   | YES   | YES   |
+            |action is used      | NO    | NO    | NO    | YES   | YES   | YES   | NO    | NO    | NO    |
+            ----------------------------------------------------------------------------------------------
+        Note that this means we require: `n_action_steps <= horizon - n_obs_steps + 1`. Also, note that
+        "horizon" may not the best name to describe what the variable actually means, because this period is
+        actually measured from the first observation which (if `n_obs_steps` > 1) happened in the past.
+        """
+        # NOTE: for offline evaluation, we have action in the batch, so we need to pop it out
+        if ACTION in batch:
+            batch.pop(ACTION)
+
+        if self.config.image_features:
+            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
+            batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+        # NOTE: It's important that this happens after stacking the images into a single key.
+        self._queues = populate_queues(self._queues, batch)
+
+        if len(self._queues[ACTION]) == 0:
+            actions = self.predict_action_chunk(batch, noise=noise)
+            self._queues[ACTION].extend(actions.transpose(0, 1))
+
+        action = self._queues[ACTION].popleft()
+        return action
+
+    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, None]:
+        """Run the batch through the model and compute the loss for training or validation."""
+        if self.config.image_features:
+            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
+            batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+        loss = self.diffusion.compute_loss(batch)
+        # no output_dict so returning None
+        return loss, None
+
+
+def _make_noise_scheduler(name: str, **kwargs: dict) -> DDPMScheduler | DDIMScheduler:
+    """
+    Factory for noise scheduler instances of the requested type. All kwargs are passed
+    to the scheduler.
+    """
+    if name == "DDPM":
+        return DDPMScheduler(**kwargs)
+    elif name == "DDIM":
+        return DDIMScheduler(**kwargs)
+    else:
+        raise ValueError(f"Unsupported noise scheduler type {name}")
+
+
+class DiffusionModel(nn.Module):
+    def __init__(self, config: DiffusionConfig):
+        super().__init__()
+        self.config = config
+
+        # Build observation encoders (depending on which observations are provided).
+        global_cond_dim = self.config.robot_state_feature.shape[0]
+        if self.config.image_features:
+            num_images = len(self.config.image_features)
+            if self.config.use_separate_rgb_encoder_per_camera:
+                encoders = [DiffusionRgbEncoder(config) for _ in range(num_images)]
+                self.rgb_encoder = nn.ModuleList(encoders)
+                global_cond_dim += encoders[0].feature_dim * num_images
+            else:
+                self.rgb_encoder = DiffusionRgbEncoder(config)
+                global_cond_dim += self.rgb_encoder.feature_dim * num_images
+        if self.config.env_state_feature:
+            global_cond_dim += self.config.env_state_feature.shape[0]
+
+        self.unet = DiffusionConditionalUnet1d(config, global_cond_dim=global_cond_dim * config.n_obs_steps)
+
+        self.noise_scheduler = _make_noise_scheduler(
+            config.noise_scheduler_type,
+            num_train_timesteps=config.num_train_timesteps,
+            beta_start=config.beta_start,
+            beta_end=config.beta_end,
+            beta_schedule=config.beta_schedule,
+            clip_sample=config.clip_sample,
+            clip_sample_range=config.clip_sample_range,
+            prediction_type=config.prediction_type,
+        )
+
+        if config.num_inference_steps is None:
+            self.num_inference_steps = self.noise_scheduler.config.num_train_timesteps
+        else:
+            self.num_inference_steps = config.num_inference_steps
+
+    # ========= inference  ============
+    def conditional_sample(
+        self,
+        batch_size: int,
+        global_cond: Tensor | None = None,
+        generator: torch.Generator | None = None,
+        noise: Tensor | None = None,
+    ) -> Tensor:
+        device = get_device_from_parameters(self)
+        dtype = get_dtype_from_parameters(self)
+
+        # Sample prior.
+        sample = (
+            noise
+            if noise is not None
+            else torch.randn(
+                size=(batch_size, self.config.horizon, self.config.action_feature.shape[0]),
+                dtype=dtype,
+                device=device,
+                generator=generator,
+            )
+        )
+
+        self.noise_scheduler.set_timesteps(self.num_inference_steps)
+
+        for t in self.noise_scheduler.timesteps:
+            # Predict model output.
+            model_output = self.unet(
+                sample,
+                torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
+                global_cond=global_cond,
+            )
+            # Compute previous image: x_t -> x_t-1
+            sample = self.noise_scheduler.step(model_output, t, sample, generator=generator).prev_sample
+
+        return sample
+
+    def _prepare_global_conditioning(self, batch: dict[str, Tensor]) -> Tensor:
+        """Encode image features and concatenate them all together along with the state vector."""
+        batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
+        global_cond_feats = [batch[OBS_STATE]]
+        # Extract image features.
+        if self.config.image_features:
+            if self.config.use_separate_rgb_encoder_per_camera:
+                # Combine batch and sequence dims while rearranging to make the camera index dimension first.
+                images_per_camera = einops.rearrange(batch[OBS_IMAGES], "b s n ... -> n (b s) ...")
+                img_features_list = torch.cat(
+                    [
+                        encoder(images)
+                        for encoder, images in zip(self.rgb_encoder, images_per_camera, strict=True)
+                    ]
+                )
+                # Separate batch and sequence dims back out. The camera index dim gets absorbed into the
+                # feature dim (effectively concatenating the camera features).
+                img_features = einops.rearrange(
+                    img_features_list, "(n b s) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
+                )
+            else:
+                # Combine batch, sequence, and "which camera" dims before passing to shared encoder.
+                img_features = self.rgb_encoder(
+                    einops.rearrange(batch[OBS_IMAGES], "b s n ... -> (b s n) ...")
+                )
+                # Separate batch dim and sequence dim back out. The camera index dim gets absorbed into the
+                # feature dim (effectively concatenating the camera features).
+                img_features = einops.rearrange(
+                    img_features, "(b s n) ... -> b s (n ...)", b=batch_size, s=n_obs_steps
+                )
+            global_cond_feats.append(img_features)
+
+        if self.config.env_state_feature:
+            global_cond_feats.append(batch[OBS_ENV_STATE])
+
+        # Concatenate features then flatten to (B, global_cond_dim).
+        return torch.cat(global_cond_feats, dim=-1).flatten(start_dim=1)
+
+    def generate_actions(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
+        """
+        This function expects `batch` to have:
+        {
+            "observation.state": (B, n_obs_steps, state_dim)
+
+            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
+                AND/OR
+            "observation.environment_state": (B, n_obs_steps, environment_dim)
+        }
+        """
+        batch_size, n_obs_steps = batch[OBS_STATE].shape[:2]
+        assert n_obs_steps == self.config.n_obs_steps
+
+        # Encode image features and concatenate them all together along with the state vector.
+        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)
+
+        # run sampling
+        actions = self.conditional_sample(batch_size, global_cond=global_cond, noise=noise)
+
+        # Extract `n_action_steps` steps worth of actions (from the current observation).
+        start = n_obs_steps - 1
+        end = start + self.config.n_action_steps
+        actions = actions[:, start:end]
+
+        return actions
+
+    def compute_loss(self, batch: dict[str, Tensor]) -> Tensor:
+        """
+        This function expects `batch` to have (at least):
+        {
+            "observation.state": (B, n_obs_steps, state_dim)
+
+            "observation.images": (B, n_obs_steps, num_cameras, C, H, W)
+                AND/OR
+            "observation.environment_state": (B, n_obs_steps, environment_dim)
+
+            "action": (B, horizon, action_dim)
+            "action_is_pad": (B, horizon)
+        }
+        """
+        # Input validation.
+        assert set(batch).issuperset({OBS_STATE, ACTION, "action_is_pad"})
+        assert OBS_IMAGES in batch or OBS_ENV_STATE in batch
+        n_obs_steps = batch[OBS_STATE].shape[1]
+        horizon = batch[ACTION].shape[1]
+        assert horizon == self.config.horizon
+        assert n_obs_steps == self.config.n_obs_steps
+
+        # Encode image features and concatenate them all together along with the state vector.
+        global_cond = self._prepare_global_conditioning(batch)  # (B, global_cond_dim)
+
+        # Forward diffusion.
+        trajectory = batch[ACTION]
+        # Sample noise to add to the trajectory.
+        eps = torch.randn(trajectory.shape, device=trajectory.device)
+        # Sample a random noising timestep for each item in the batch.
+        timesteps = torch.randint(
+            low=0,
+            high=self.noise_scheduler.config.num_train_timesteps,
+            size=(trajectory.shape[0],),
+            device=trajectory.device,
+        ).long()
+        # Add noise to the clean trajectories according to the noise magnitude at each timestep.
+        noisy_trajectory = self.noise_scheduler.add_noise(trajectory, eps, timesteps)
+
+        # Run the denoising network (that might denoise the trajectory, or attempt to predict the noise).
+        pred = self.unet(noisy_trajectory, timesteps, global_cond=global_cond)
+
+        # Compute the loss.
+        # The target is either the original trajectory, or the noise.
+        if self.config.prediction_type == "epsilon":
+            target = eps
+        elif self.config.prediction_type == "sample":
+            target = batch[ACTION]
+        else:
+            raise ValueError(f"Unsupported prediction type {self.config.prediction_type}")
+
+        loss = F.mse_loss(pred, target, reduction="none")
+
+        # Mask loss wherever the action is padded with copies (edges of the dataset trajectory).
+        if self.config.do_mask_loss_for_padding:
+            if "action_is_pad" not in batch:
+                raise ValueError(
+                    "You need to provide 'action_is_pad' in the batch when "
+                    f"{self.config.do_mask_loss_for_padding=}."
+                )
+            in_episode_bound = ~batch["action_is_pad"]
+            loss = loss * in_episode_bound.unsqueeze(-1)
+
+        return loss.mean()
+
+
+class SpatialSoftmax(nn.Module):
+    """
+    Spatial Soft Argmax operation described in "Deep Spatial Autoencoders for Visuomotor Learning" by Finn et al.
+    (https://huggingface.co/papers/1509.06113). A minimal port of the robomimic implementation.
+
+    At a high level, this takes 2D feature maps (from a convnet/ViT) and returns the "center of mass"
+    of activations of each channel, i.e., keypoints in the image space for the policy to focus on.
+
+    Example: take feature maps of size (512x10x12). We generate a grid of normalized coordinates (10x12x2):
+    -----------------------------------------------------
+    | (-1., -1.)   | (-0.82, -1.)   | ... | (1., -1.)   |
+    | (-1., -0.78) | (-0.82, -0.78) | ... | (1., -0.78) |
+    | ...          | ...            | ... | ...         |
+    | (-1., 1.)    | (-0.82, 1.)    | ... | (1., 1.)    |
+    -----------------------------------------------------
+    This is achieved by applying channel-wise softmax over the activations (512x120) and computing the dot
+    product with the coordinates (120x2) to get expected points of maximal activation (512x2).
+
+    The example above results in 512 keypoints (corresponding to the 512 input channels). We can optionally
+    provide num_kp != None to control the number of keypoints. This is achieved by a first applying a learnable
+    linear mapping (in_channels, H, W) -> (num_kp, H, W).
+    """
+
+    def __init__(self, input_shape, num_kp=None):
+        """
+        Args:
+            input_shape (list): (C, H, W) input feature map shape.
+            num_kp (int): number of keypoints in output. If None, output will have the same number of channels as input.
+        """
+        super().__init__()
+
+        assert len(input_shape) == 3
+        self._in_c, self._in_h, self._in_w = input_shape
+
+        if num_kp is not None:
+            self.nets = torch.nn.Conv2d(self._in_c, num_kp, kernel_size=1)
+            self._out_c = num_kp
+        else:
+            self.nets = None
+            self._out_c = self._in_c
+
+        # we could use torch.linspace directly but that seems to behave slightly differently than numpy
+        # and causes a small degradation in pc_success of pre-trained models.
+        pos_x, pos_y = np.meshgrid(np.linspace(-1.0, 1.0, self._in_w), np.linspace(-1.0, 1.0, self._in_h))
+        pos_x = torch.from_numpy(pos_x.reshape(self._in_h * self._in_w, 1)).float()
+        pos_y = torch.from_numpy(pos_y.reshape(self._in_h * self._in_w, 1)).float()
+        # register as buffer so it's moved to the correct device.
+        self.register_buffer("pos_grid", torch.cat([pos_x, pos_y], dim=1))
+
+    def forward(self, features: Tensor) -> Tensor:
+        """
+        Args:
+            features: (B, C, H, W) input feature maps.
+        Returns:
+            (B, K, 2) image-space coordinates of keypoints.
+        """
+        if self.nets is not None:
+            features = self.nets(features)
+
+        # [B, K, H, W] -> [B * K, H * W] where K is number of keypoints
+        features = features.reshape(-1, self._in_h * self._in_w)
+        # 2d softmax normalization
+        attention = F.softmax(features, dim=-1)
+        # [B * K, H * W] x [H * W, 2] -> [B * K, 2] for spatial coordinate mean in x and y dimensions
+        expected_xy = attention @ self.pos_grid
+        # reshape to [B, K, 2]
+        feature_keypoints = expected_xy.view(-1, self._out_c, 2)
+
+        return feature_keypoints
+
+
+class DiffusionRgbEncoder(nn.Module):
+    """Encodes an RGB image into a 1D feature vector.
+
+    Includes the ability to normalize and crop the image first.
+    """
+
+    def __init__(self, config: DiffusionConfig):
+        super().__init__()
+        # Set up optional preprocessing.
+        if config.crop_shape is not None:
+            self.do_crop = True
+            # Always use center crop for eval
+            self.center_crop = torchvision.transforms.CenterCrop(config.crop_shape)
+            if config.crop_is_random:
+                self.maybe_random_crop = torchvision.transforms.RandomCrop(config.crop_shape)
+            else:
+                self.maybe_random_crop = self.center_crop
+        else:
+            self.do_crop = False
+
+        # Set up backbone.
+        backbone_model = getattr(torchvision.models, config.vision_backbone)(
+            weights=config.pretrained_backbone_weights
+        )
+        # Note: This assumes that the layer4 feature map is children()[-3]
+        # TODO(alexander-soare): Use a safer alternative.
+        self.backbone = nn.Sequential(*(list(backbone_model.children())[:-2]))
+        if config.use_group_norm:
+            if config.pretrained_backbone_weights:
+                raise ValueError(
+                    "You can't replace BatchNorm in a pretrained model without ruining the weights!"
+                )
+            self.backbone = _replace_submodules(
+                root_module=self.backbone,
+                predicate=lambda x: isinstance(x, nn.BatchNorm2d),
+                func=lambda x: nn.GroupNorm(num_groups=x.num_features // 16, num_channels=x.num_features),
+            )
+
+        # Set up pooling and final layers.
+        # Use a dry run to get the feature map shape.
+        # The dummy input should take the number of image channels from `config.image_features` and it should
+        # use the height and width from `config.crop_shape` if it is provided, otherwise it should use the
+        # height and width from `config.image_features`.
+
+        # Note: we have a check in the config class to make sure all images have the same shape.
+        images_shape = next(iter(config.image_features.values())).shape
+        dummy_shape_h_w = config.crop_shape if config.crop_shape is not None else images_shape[1:]
+        dummy_shape = (1, images_shape[0], *dummy_shape_h_w)
+        feature_map_shape = get_output_shape(self.backbone, dummy_shape)[1:]
+
+        self.pool = SpatialSoftmax(feature_map_shape, num_kp=config.spatial_softmax_num_keypoints)
+        self.feature_dim = config.spatial_softmax_num_keypoints * 2
+        self.out = nn.Linear(config.spatial_softmax_num_keypoints * 2, self.feature_dim)
+        self.relu = nn.ReLU()
+
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        Args:
+            x: (B, C, H, W) image tensor with pixel values in [0, 1].
+        Returns:
+            (B, D) image feature.
+        """
+        # Preprocess: maybe crop (if it was set up in the __init__).
+        if self.do_crop:
+            if self.training:  # noqa: SIM108
+                x = self.maybe_random_crop(x)
+            else:
+                # Always use center crop for eval.
+                x = self.center_crop(x)
+        # Extract backbone feature.
+        x = torch.flatten(self.pool(self.backbone(x)), start_dim=1)
+        # Final linear layer with non-linearity.
+        x = self.relu(self.out(x))
+        return x
+
+
+def _replace_submodules(
+    root_module: nn.Module, predicate: Callable[[nn.Module], bool], func: Callable[[nn.Module], nn.Module]
+) -> nn.Module:
+    """
+    Args:
+        root_module: The module for which the submodules need to be replaced
+        predicate: Takes a module as an argument and must return True if the that module is to be replaced.
+        func: Takes a module as an argument and returns a new module to replace it with.
+    Returns:
+        The root module with its submodules replaced.
+    """
+    if predicate(root_module):
+        return func(root_module)
+
+    replace_list = [k.split(".") for k, m in root_module.named_modules(remove_duplicate=True) if predicate(m)]
+    for *parents, k in replace_list:
+        parent_module = root_module
+        if len(parents) > 0:
+            parent_module = root_module.get_submodule(".".join(parents))
+        if isinstance(parent_module, nn.Sequential):
+            src_module = parent_module[int(k)]
+        else:
+            src_module = getattr(parent_module, k)
+        tgt_module = func(src_module)
+        if isinstance(parent_module, nn.Sequential):
+            parent_module[int(k)] = tgt_module
+        else:
+            setattr(parent_module, k, tgt_module)
+    # verify that all BN are replaced
+    assert not any(predicate(m) for _, m in root_module.named_modules(remove_duplicate=True))
+    return root_module
+
+
+class DiffusionSinusoidalPosEmb(nn.Module):
+    """1D sinusoidal positional embeddings as in Attention is All You Need."""
+
+    def __init__(self, dim: int):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x: Tensor) -> Tensor:
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+        emb = x.unsqueeze(-1) * emb.unsqueeze(0)
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+
+
+class DiffusionConv1dBlock(nn.Module):
+    """Conv1d --> GroupNorm --> Mish"""
+
+    def __init__(self, inp_channels, out_channels, kernel_size, n_groups=8):
+        super().__init__()
+
+        self.block = nn.Sequential(
+            nn.Conv1d(inp_channels, out_channels, kernel_size, padding=kernel_size // 2),
+            nn.GroupNorm(n_groups, out_channels),
+            nn.Mish(),
+        )
+
+    def forward(self, x):
+        return self.block(x)
+
+
+class DiffusionConditionalUnet1d(nn.Module):
+    """A 1D convolutional UNet with FiLM modulation for conditioning.
+
+    Note: this removes local conditioning as compared to the original diffusion policy code.
+    """
+
+    def __init__(self, config: DiffusionConfig, global_cond_dim: int):
+        super().__init__()
+
+        self.config = config
+
+        # Encoder for the diffusion timestep.
+        self.diffusion_step_encoder = nn.Sequential(
+            DiffusionSinusoidalPosEmb(config.diffusion_step_embed_dim),
+            nn.Linear(config.diffusion_step_embed_dim, config.diffusion_step_embed_dim * 4),
+            nn.Mish(),
+            nn.Linear(config.diffusion_step_embed_dim * 4, config.diffusion_step_embed_dim),
+        )
+
+        # The FiLM conditioning dimension.
+        cond_dim = config.diffusion_step_embed_dim + global_cond_dim
+
+        # In channels / out channels for each downsampling block in the Unet's encoder. For the decoder, we
+        # just reverse these.
+        in_out = [(config.action_feature.shape[0], config.down_dims[0])] + list(
+            zip(config.down_dims[:-1], config.down_dims[1:], strict=True)
+        )
+
+        # Unet encoder.
+        common_res_block_kwargs = {
+            "cond_dim": cond_dim,
+            "kernel_size": config.kernel_size,
+            "n_groups": config.n_groups,
+            "use_film_scale_modulation": config.use_film_scale_modulation,
+        }
+        self.down_modules = nn.ModuleList([])
+        for ind, (dim_in, dim_out) in enumerate(in_out):
+            is_last = ind >= (len(in_out) - 1)
+            self.down_modules.append(
+                nn.ModuleList(
+                    [
+                        DiffusionConditionalResidualBlock1d(dim_in, dim_out, **common_res_block_kwargs),
+                        DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
+                        # Downsample as long as it is not the last block.
+                        nn.Conv1d(dim_out, dim_out, 3, 2, 1) if not is_last else nn.Identity(),
+                    ]
+                )
+            )
+
+        # Processing in the middle of the auto-encoder.
+        self.mid_modules = nn.ModuleList(
+            [
+                DiffusionConditionalResidualBlock1d(
+                    config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
+                ),
+                DiffusionConditionalResidualBlock1d(
+                    config.down_dims[-1], config.down_dims[-1], **common_res_block_kwargs
+                ),
+            ]
+        )
+
+        # Unet decoder.
+        self.up_modules = nn.ModuleList([])
+        for ind, (dim_out, dim_in) in enumerate(reversed(in_out[1:])):
+            is_last = ind >= (len(in_out) - 1)
+            self.up_modules.append(
+                nn.ModuleList(
+                    [
+                        # dim_in * 2, because it takes the encoder's skip connection as well
+                        DiffusionConditionalResidualBlock1d(dim_in * 2, dim_out, **common_res_block_kwargs),
+                        DiffusionConditionalResidualBlock1d(dim_out, dim_out, **common_res_block_kwargs),
+                        # Upsample as long as it is not the last block.
+                        nn.ConvTranspose1d(dim_out, dim_out, 4, 2, 1) if not is_last else nn.Identity(),
+                    ]
+                )
+            )
+
+        self.final_conv = nn.Sequential(
+            DiffusionConv1dBlock(config.down_dims[0], config.down_dims[0], kernel_size=config.kernel_size),
+            nn.Conv1d(config.down_dims[0], config.action_feature.shape[0], 1),
+        )
+
+    def forward(self, x: Tensor, timestep: Tensor | int, global_cond=None) -> Tensor:
+        """
+        Args:
+            x: (B, T, input_dim) tensor for input to the Unet.
+            timestep: (B,) tensor of (timestep_we_are_denoising_from - 1).
+            global_cond: (B, global_cond_dim)
+            output: (B, T, input_dim)
+        Returns:
+            (B, T, input_dim) diffusion model prediction.
+        """
+        # For 1D convolutions we'll need feature dimension first.
+        x = einops.rearrange(x, "b t d -> b d t")
+
+        timesteps_embed = self.diffusion_step_encoder(timestep)
+
+        # If there is a global conditioning feature, concatenate it to the timestep embedding.
+        if global_cond is not None:
+            global_feature = torch.cat([timesteps_embed, global_cond], axis=-1)
+        else:
+            global_feature = timesteps_embed
+
+        # Run encoder, keeping track of skip features to pass to the decoder.
+        encoder_skip_features: list[Tensor] = []
+        for resnet, resnet2, downsample in self.down_modules:
+            x = resnet(x, global_feature)
+            x = resnet2(x, global_feature)
+            encoder_skip_features.append(x)
+            x = downsample(x)
+
+        for mid_module in self.mid_modules:
+            x = mid_module(x, global_feature)
+
+        # Run decoder, using the skip features from the encoder.
+        for resnet, resnet2, upsample in self.up_modules:
+            x = torch.cat((x, encoder_skip_features.pop()), dim=1)
+            x = resnet(x, global_feature)
+            x = resnet2(x, global_feature)
+            x = upsample(x)
+
+        x = self.final_conv(x)
+
+        x = einops.rearrange(x, "b d t -> b t d")
+        return x
+
+
+class DiffusionConditionalResidualBlock1d(nn.Module):
+    """ResNet style 1D convolutional block with FiLM modulation for conditioning."""
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        cond_dim: int,
+        kernel_size: int = 3,
+        n_groups: int = 8,
+        # Set to True to do scale modulation with FiLM as well as bias modulation (defaults to False meaning
+        # FiLM just modulates bias).
+        use_film_scale_modulation: bool = False,
+    ):
+        super().__init__()
+
+        self.use_film_scale_modulation = use_film_scale_modulation
+        self.out_channels = out_channels
+
+        self.conv1 = DiffusionConv1dBlock(in_channels, out_channels, kernel_size, n_groups=n_groups)
+
+        # FiLM modulation (https://huggingface.co/papers/1709.07871) outputs per-channel bias and (maybe) scale.
+        cond_channels = out_channels * 2 if use_film_scale_modulation else out_channels
+        self.cond_encoder = nn.Sequential(nn.Mish(), nn.Linear(cond_dim, cond_channels))
+
+        self.conv2 = DiffusionConv1dBlock(out_channels, out_channels, kernel_size, n_groups=n_groups)
+
+        # A final convolution for dimension matching the residual (if needed).
+        self.residual_conv = (
+            nn.Conv1d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()
+        )
+
+    def forward(self, x: Tensor, cond: Tensor) -> Tensor:
+        """
+        Args:
+            x: (B, in_channels, T)
+            cond: (B, cond_dim)
+        Returns:
+            (B, out_channels, T)
+        """
+        out = self.conv1(x)
+
+        # Get condition embedding. Unsqueeze for broadcasting to `out`, resulting in (B, out_channels, 1).
+        cond_embed = self.cond_encoder(cond).unsqueeze(-1)
+        if self.use_film_scale_modulation:
+            # Treat the embedding as a list of scales and biases.
+            scale = cond_embed[:, : self.out_channels]
+            bias = cond_embed[:, self.out_channels :]
+            out = scale * out + bias
+        else:
+            # Treat the embedding as biases.
+            out = out + cond_embed
+
+        out = self.conv2(out)
+        out = out + self.residual_conv(x)
+        return out
--- a/diffusion/processor_diffusion.py
+++ b/diffusion/processor_diffusion.py
@@ -0,0 +1,92 @@
+#!/usr/bin/env python
+
+# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any
+
+import torch
+
+from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
+
+
+def make_diffusion_pre_post_processors(
+    config: DiffusionConfig,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """
+    Constructs pre-processor and post-processor pipelines for a diffusion policy.
+
+    The pre-processing pipeline prepares the input data for the model by:
+    1. Renaming features.
+    2. Normalizing the input and output features based on dataset statistics.
+    3. Adding a batch dimension.
+    4. Moving the data to the specified device.
+
+    The post-processing pipeline handles the model's output by:
+    1. Moving the data to the CPU.
+    2. Unnormalizing the output features to their original scale.
+
+    Args:
+        config: The configuration object for the diffusion policy,
+            containing feature definitions, normalization mappings, and device information.
+        dataset_stats: A dictionary of statistics used for normalization.
+            Defaults to None.
+
+    Returns:
+        A tuple containing the configured pre-processor and post-processor pipelines.
+    """
+
+    input_steps = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        DeviceProcessorStep(device=config.device),
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+    ]
+    output_steps = [
+        UnnormalizerProcessorStep(
+            features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
--- a/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
+++ b/docs/superpowers/plans/2026-03-30-streaming-hdf5-ee-action.md
@@ -0,0 +1,42 @@
+# Streaming HDF5 EE Action Dataset Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 将 Diana 仿真采集改为流式写入 HDF5，图像保存为 256x256 的四路相机视角，并把 `/action` 改为 IK 前的原始末端位姿动作。
+
+**Architecture:** 新增一个独立的流式 HDF5 episode writer，负责逐帧写入 qpos、原始 action 和 resize 后图像，并在 episode 成功时原子提交、失败时删除临时文件。采集脚本只负责 rollout 和把每一步观测/动作交给 writer，避免整集数据先堆在内存里。
+
+**Tech Stack:** Python, h5py, numpy, cv2, unittest, MuJoCo demo scripts
+
+---
+
+### Task 1: 为流式 writer 建立测试边界
+
+**Files:**
+- Create: `tests/test_streaming_episode_writer.py`
+- Create: `roboimi/utils/streaming_episode_writer.py`
+
+- [ ] **Step 1: Write the failing test**
+- [ ] **Step 2: Run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it fails because the writer module does not exist**
+- [ ] **Step 3: Implement the minimal streaming writer with temp-file commit/discard, per-frame append, and 256x256 image resize**
+- [ ] **Step 4: Re-run `python -m unittest tests.test_streaming_episode_writer -v` and confirm it passes**
+
+### Task 2: 接入 Diana 采集脚本
+
+**Files:**
+- Modify: `roboimi/demos/diana_record_sim_episodes.py`
+- Reuse: `roboimi/utils/streaming_episode_writer.py`
+
+- [ ] **Step 1: Replace in-memory `data_dict` / `obs` accumulation with per-episode streaming writer lifecycle**
+- [ ] **Step 2: Keep four cameras (`angle`, `r_vis`, `top`, `front`) and resize to 256x256 before persistence**
+- [ ] **Step 3: Capture raw policy output before IK and write that to `/action`**
+- [ ] **Step 4: On success commit to `episode_{idx}.hdf5`; on failure remove temp file**
+
+### Task 3: 验证改动
+
+**Files:**
+- Verify only
+
+- [ ] **Step 1: Run unit tests for the writer**
+- [ ] **Step 2: Run one end-to-end collection episode and stop after `episode_0.hdf5` becomes readable**
+- [ ] **Step 3: Verify HDF5 keys and shapes: `action=(700,16)`, image datasets are `(700,256,256,3)`, and `/action` matches raw EE action semantics**
--- a/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
+++ b/docs/superpowers/plans/2026-03-31-raw-action-trajectory-viewer.md
@@ -0,0 +1,26 @@
+# Raw Action Trajectory Viewer Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 在可交互 MuJoCo 仿真窗口中，把 rollout 导出的 raw EE action 轨迹用红色轨迹标出来并启动仿真供人工查看。
+
+**Architecture:** 读取已有 trajectory artifact 中的 raw_action / step 数据，生成左右臂末端轨迹点，并在 viewer 渲染循环中持续注入红色 marker。实现尽量独立为一个可复用的小脚本，避免影响训练/评估主路径。
+
+**Tech Stack:** Python, NumPy, MuJoCo viewer, unittest/mock.
+
+---
+
+### Task 1: 抽取 raw_action 轨迹并生成可视化点集
+- [ ] 写失败测试，验证从 trajectory.npz 提取左右臂轨迹点
+- [ ] 实现最小 helper
+- [ ] 运行测试确认通过
+
+### Task 2: 在 viewer 中渲染红色轨迹并支持交互查看
+- [ ] 写失败测试，验证 marker 配置/调用
+- [ ] 实现 viewer 可视化脚本
+- [ ] 运行测试确认通过
+
+### Task 3: 启动真实仿真窗口供人工查看
+- [ ] 用现有 trajectory artifact 启动 viewer
+- [ ] 确认窗口可交互、红线出现
+- [ ] 向用户汇报启动方式与脚本路径
--- a/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
+++ b/docs/superpowers/plans/2026-03-31-rollout-artifacts.md
@@ -0,0 +1,44 @@
+# Rollout Artifacts Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Extend rollout evaluation so one selected checkpoint can be run once with video capture, timing breakdown, and saved EE trajectory artifacts.
+
+**Architecture:** Keep the implementation centered in `eval_vla.py` so existing training-time rollout validation remains compatible. Add config-gated artifact capture helpers, serialize outputs under the eval run directory, and add lightweight tests for helper behavior and summary wiring; default eval behavior must remain unchanged when artifact capture is off.
+
+**Tech Stack:** Python, Hydra/OmegaConf, NumPy, OpenCV, JSON, PyTorch unittest/mocking.
+
+---
+
+### Task 1: Add artifact capture configuration and helper wiring
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Modify: `roboimi/vla/conf/eval/eval.yaml`
+- Test: `tests/test_eval_vla_rollout_artifacts.py`
+
+- [ ] **Step 1: Write failing tests for optional artifact config / summary wiring**
+- [ ] **Step 2: Implement config-backed artifact flags and output paths with defaults that write nothing**
+- [ ] **Step 3: Verify existing eval call sites still work with defaults**
+
+### Task 2: Add timing breakdown, video recording, and trajectory export
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Test: `tests/test_eval_vla_rollout_artifacts.py`
+
+- [ ] **Step 1: Write failing tests for timing aggregation, trajectory serialization, and summary schema**
+- [ ] **Step 2: Implement per-step timing capture for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`**
+- [ ] **Step 3: Implement MP4 recording from a chosen camera stream and canonical `trajectory.npz` export using `left_link7/right_link7` executed poses after `env.step`**
+- [ ] **Step 4: Run focused tests and fix issues**
+
+### Task 3: Stop training safely and execute one real rollout
+
+**Files:**
+- Use: `roboimi/demos/vla_scripts/eval_vla.py`
+- Output: `runs/.../eval_artifacts/...`
+
+- [ ] **Step 1: Stop the active training process, wait for exit, and confirm the target checkpoint is readable**
+- [ ] **Step 2: Select the latest completed checkpoint if an explicit one is not provided; fall back to prior completed / best checkpoint if needed**
+- [ ] **Step 3: Run one headless rollout with artifact capture enabled**
+- [ ] **Step 4: Verify the MP4 / timing summary / trajectory files exist and summarize findings**
--- a/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
+++ b/docs/superpowers/plans/2026-04-01-imf-attnres-policy-migration.md
@@ -0,0 +1,268 @@
+# IMF-AttnRes Policy Migration Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 将 external `diffusion_policy@185ed659` 的 IMF-AttnRes 模型、训练目标和一步推理机制迁移到 RoboIMI，并在保持三相机视觉条件输入与现有训练/rollout 工作流的前提下启动同参数训练。
+
+**Architecture:** 保留 RoboIMI 现有 ResNet 三相机观测编码、normalization、queue-based online rollout 和训练脚本；新增 AttnRes 组件与 IMF transformer head，并新增 IMF 专用 agent 以覆盖 DDPM loss / DDIM inference 语义。训练脚本只做最小接线修改，让新 head/agent 能用现有 optimizer、checkpoint、SwanLab 和 headless rollout。
+
+**Tech Stack:** PyTorch, Hydra, diffusers schedulers (仅保留兼容初始化), MuJoCo rollout, unittest, SwanLab
+
+---
+
+## File Map
+
+### New files
+- `roboimi/vla/models/heads/attnres_transformer_components.py` — 本地 IMF AttnRes 基础组件
+- `roboimi/vla/models/heads/imf_transformer1d.py` — IMF transformer head，暴露 `forward(sample, r, t, cond=None)`
+- `roboimi/vla/agent_imf.py` — IMF 专用 VLA agent，复用现有观测/队列/normalization 逻辑并覆盖 loss / inference
+- `roboimi/vla/conf/head/imf_transformer1d.yaml` — IMF head 配置
+- `roboimi/vla/conf/agent/resnet_imf_attnres.yaml` — IMF agent + backbone/head 组合配置
+- `tests/test_imf_transformer1d_external_alignment.py` — external `185ed659` 对齐测试
+- `tests/test_imf_vla_agent.py` — IMF agent 的 loss / inference / queue 语义测试
+
+### Modified files
+- `roboimi/demos/vla_scripts/train_vla.py` — 优化器参数分组接线；确保新 agent 能无缝训练
+- `roboimi/vla/conf/config.yaml` — 保持默认配置不变，仅支持通过 override 启用 IMF agent
+- `tests/test_train_vla_transformer_optimizer.py` — 覆盖 IMF head 的 optimizer-group 行为
+- （如需要）`roboimi/vla/models/heads/__init__.py` 或相近导出文件 — 暴露新 head
+
+---
+
+### Task 1: 写 IMF transformer 对齐测试
+
+**Files:**
+- Create: `tests/test_imf_transformer1d_external_alignment.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- Reference: `/home/droid/project/diffusion_policy/diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+
+- [ ] **Step 1: 写失败测试，验证 local IMF head 与 external `185ed659` 的 state-dict key、前向 shape、forward 数值、optim groups 对齐**
+
+```python
+with torch.no_grad():
+    external_out = external_model(sample=sample, r=r, t=t, cond=cond)
+    local_out = local_model(sample=sample, r=r, t=t, cond=cond)
+assert torch.allclose(local_out, external_out, atol=1e-6, rtol=1e-5)
+```
+
+- [ ] **Step 2: 运行单测，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: FAIL，提示 `imf_transformer1d` / `attnres` 模块不存在
+
+- [ ] **Step 3: 若测试需要复用现有 external-loader 逻辑，则从 `tests/test_transformer1d_external_alignment.py` 复制最小必要 helper，避免重复依赖 session context**
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_transformer1d_external_alignment.py
+git commit -m "test: add IMF transformer external alignment coverage"
+```
+
+### Task 2: 实现 AttnRes 组件与 IMF transformer head
+
+**Files:**
+- Create: `roboimi/vla/models/heads/attnres_transformer_components.py`
+- Create: `roboimi/vla/models/heads/imf_transformer1d.py`
+- Modify: `tests/test_imf_transformer1d_external_alignment.py`
+
+- [ ] **Step 1: 按 external `185ed659` 迁移 AttnRes 基础组件，保持命名和参数语义一致**
+
+必须包含：
+- `RMSNorm`
+- `RMSNormNoWeight`
+- `precompute_rope_freqs`
+- `apply_rope`
+- `GroupedQuerySelfAttention`
+- `SwiGLUFFN`
+- `AttnResOperator`
+- `AttnResSubLayer`
+- `AttnResTransformerBackbone`
+
+- [ ] **Step 2: 在 `imf_transformer1d.py` 中实现本地 IMF head**
+
+必须满足：
+- `forward(sample, r, t, cond=None)`
+- 默认支持 `backbone_type='attnres_full'`
+- token 序列为 `[r_token, t_token, cond_tokens..., sample_tokens...]`
+- 输出只切回 sample token 段
+- 保留 `get_optim_groups()` 供 AdamW 分组
+
+- [ ] **Step 3: 运行对齐测试，修正 state-dict key / init / no-decay 参数分组不一致问题**
+
+Run: `python -m unittest tests.test_imf_transformer1d_external_alignment -v`
+Expected: PASS
+
+- [ ] **Step 4: 提交模型组件实现**
+
+```bash
+git add roboimi/vla/models/heads/attnres_transformer_components.py \
+        roboimi/vla/models/heads/imf_transformer1d.py \
+        tests/test_imf_transformer1d_external_alignment.py
+git commit -m "feat: add IMF AttnRes transformer head"
+```
+
+### Task 3: 写 IMF agent 行为测试
+
+**Files:**
+- Create: `tests/test_imf_vla_agent.py`
+- Reference: `roboimi/vla/agent.py`
+- Reference: `tests/test_resnet_transformer_agent_wiring.py`
+
+- [ ] **Step 1: 写失败测试，覆盖 IMF agent 的核心契约**
+
+需要覆盖：
+1. `compute_loss()` 接受当前 batch 结构并返回标量 loss
+2. `predict_action()` 输出 `(B, pred_horizon, action_dim)`
+3. `select_action()` 仍按 queue/chunk 语义工作
+4. `predict_action()` 不走 DDIM 多步循环，而是只触发一步 IMF sample
+5. `action_is_pad` 存在时仅在有效 action 上计 loss
+
+- [ ] **Step 2: 用 stub backbone / stub head 记录调用参数，验证 `r,t,cond` 的传递与 observation conditioning 维度正确**
+
+```python
+self.assertEqual(recorded['cond'].shape, (B, obs_horizon, expected_cond_dim))
+self.assertTrue(torch.allclose(recorded['r'], torch.zeros(B)))
+self.assertTrue(torch.allclose(recorded['t'], torch.ones(B)))
+```
+
+- [ ] **Step 3: 运行测试，确认当前失败**
+
+Run: `python -m unittest tests.test_imf_vla_agent -v`
+Expected: FAIL，提示 `roboimi.vla.agent_imf` 不存在
+
+- [ ] **Step 4: 提交测试骨架**
+
+```bash
+git add tests/test_imf_vla_agent.py
+git commit -m "test: add IMF VLA agent behavior coverage"
+```
+
+### Task 4: 实现 IMF agent 与 Hydra 接线
+
+**Files:**
+- Create: `roboimi/vla/agent_imf.py`
+- Create: `roboimi/vla/conf/head/imf_transformer1d.yaml`
+- Create: `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_transformer_optimizer.py`
+- Modify: `tests/test_imf_vla_agent.py`
+
+- [ ] **Step 1: 以 `VLAAgent` 为基础实现 `IMFVLAAgent`**
+
+实现策略：
+- 复用 `VLAAgent.__init__`、`_build_cond()`、`reset()`、`_populate_queues()`、`_prepare_observation_batch()`、`select_action()`、`get_normalization_stats()`
+- 覆盖：
+  - `compute_loss()` -> IMF objective
+  - `predict_action()` -> one-step sample
+- 提供内部 helper：
+  - `_broadcast_batch_time`
+  - `_apply_conditioning`（如需）
+  - `_compute_u_and_du_dt`
+  - `_compound_velocity`
+  - `_sample_one_step`
+
+- [ ] **Step 2: 在 JVP 路径中加入 CUDA math SDPA fallback，保持 external repo 的稳定性策略**
+
+- [ ] **Step 3: 新增 Hydra 配置，让 `agent=resnet_imf_attnres` 可实例化**
+
+关键默认值：
+- `_target_: roboimi.vla.agent_imf.IMFVLAAgent`
+- `head._target_: roboimi.vla.models.heads.imf_transformer1d.IMFTransformer1D`
+- `head.backbone_type: attnres_full`
+- `head.causal_attn: false`
+- `head.time_as_cond: true`
+- `head.n_cond_layers: 0`
+- `inference_steps: 1`
+- `camera_names: ${data.camera_names}`
+- `vision_backbone.camera_names: ${agent.camera_names}`
+
+- [ ] **Step 4: 让训练脚本对任何带 `get_optim_groups()` 的 head 复用参数分组，而不是硬编码旧 transformer head_type**
+
+推荐最小改法：
+```python
+use_head_groups = callable(getattr(noise_pred_net, 'get_optim_groups', None))
+```
+
+- [ ] **Step 5: 运行测试并修复 wiring 问题**
+
+Run:
+- `python -m unittest tests.test_imf_vla_agent -v`
+- `python -m unittest tests.test_train_vla_transformer_optimizer -v`
+
+Expected: PASS
+
+- [ ] **Step 6: 提交 agent / config / train-script 接线**
+
+```bash
+git add roboimi/vla/agent_imf.py \
+        roboimi/vla/conf/head/imf_transformer1d.yaml \
+        roboimi/vla/conf/agent/resnet_imf_attnres.yaml \
+        roboimi/demos/vla_scripts/train_vla.py \
+        tests/test_imf_vla_agent.py \
+        tests/test_train_vla_transformer_optimizer.py
+git commit -m "feat: add IMF VLA agent and training wiring"
+```
+
+### Task 5: 集成验证与训练启动
+
+**Files:**
+- Modify: none required unless验证暴露真实问题
+- Use run artifacts under: `runs/`
+
+- [ ] **Step 1: 运行聚焦测试集**
+
+Run:
+```bash
+python -m unittest \
+  tests.test_imf_transformer1d_external_alignment \
+  tests.test_imf_vla_agent \
+  tests.test_resnet_transformer_agent_wiring \
+  tests.test_train_vla_transformer_optimizer -v
+```
+Expected: PASS
+
+- [ ] **Step 2: 运行一个最小 GPU 训练冒烟任务（不必长跑）**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.max_steps=2 train.batch_size=4 train.num_workers=2 \
+  train.use_swanlab=false train.rollout_val_freq_epochs=0
+```
+Expected: 成功完成 2 steps，生成 checkpoint / log，无 shape 或 JVP 错误
+
+- [ ] **Step 3: 用正式参数启动 IMF 训练**
+
+Run:
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  agent=resnet_imf_attnres \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  data.camera_names=[r_vis,top,front] \
+  train.device=cuda train.val_split=0.0 train.seed=42 \
+  train.batch_size=80 train.lr=5e-4 train.num_workers=12 train.max_steps=150000 \
+  train.log_freq=100 train.save_freq=10000 train.use_swanlab=true \
+  train.swanlab_project=roboimi-vla \
+  train.rollout_val_freq_epochs=5 train.rollout_validate_on_checkpoint=false \
+  train.rollout_num_episodes=5 train.warmup_steps=2000 \
+  train.scheduler_type=cosine train.min_lr=1e-6 train.weight_decay=1e-5 train.grad_clip=1.0 \
+  agent.pred_horizon=16 agent.inference_steps=1 \
+  agent.head.n_emb=384 agent.head.n_layer=18 agent.head.n_head=1 agent.head.n_kv_head=1 \
+  agent.vision_backbone.pretrained_backbone_weights=null \
+  agent.vision_backbone.freeze_backbone=false \
+  agent.vision_backbone.use_separate_rgb_encoder_per_camera=true
+```
+Expected: 训练启动成功，SwanLab 记录完整 config，5 epoch 一次 headless rollout
+
+- [ ] **Step 4: 记录 run 路径、训练 PID、SwanLab 运行名并向用户汇报**
+
+- [ ] **Step 5: 提交最终收尾改动（如果 smoke fix 需要额外 patch）**
+
+```bash
+git add <changed files>
+git commit -m "chore: verify IMF AttnRes training launch"
+```
--- a/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
+++ b/docs/superpowers/specs/2026-03-30-vla-training-headless-swanlab-design.md
@@ -0,0 +1,241 @@
+# VLA Training + Headless Rollout + SwanLab Design
+
+**Date:** 2026-03-30
+**Branch:** feat-align-dp-transformer-ee
+
+## Goal
+在当前仓库中补齐默认 `resnet_transformer` / `Transformer1D` 路线的训练依赖，使用数据集 `/home/droid/project/diana_sim/sim_transfer` 启动训练；同时支持训练过程中的 SwanLab 标量日志上传，并为后续 rollout 验证提供 headless 模式，避免弹出 MuJoCo / OpenCV 图形界面。
+
+## Non-Goals
+- 不重写整套训练框架
+- 不引入新的 workspace / callback 框架
+- 不在本轮做复杂的视频/媒体日志上传
+- 不修改数据集格式本身
+
+## Current State
+- 默认训练配置已切到 `agent=resnet_transformer`，head 为 `Transformer1D`
+- 当前环境缺少训练所需的若干 Python 依赖：`diffusers`、`torchvision`、`einops`、`swanlab`
+- 评估环境 `make_sim_env(task_name)` 当前写死 `is_render=True`
+- 相机线程 `camera_viewer()` 默认会 `cv2.namedWindow/imshow`，即使只想拿图像也会弹窗
+- 训练脚本当前支持 train/val loss、checkpoint，但没有 SwanLab 集成
+- 数据集目录 `/home/droid/project/diana_sim/sim_transfer` 下已有 100 个 episode，但还没有 `dataset_stats.pkl`
+
+## User Requirements
+1. 在现有 mamba 环境里补齐训练依赖
+2. 在 `/home/droid/project/diana_sim/sim_transfer` 上开始训练
+3. 如果训练中需要 rollout 验证，希望支持 headless，不弹 GUI
+4. 训练指标上传到 SwanLab
+5. 默认 SwanLab project 名为 `roboimi-vla`
+
+## Proposed Approach
+采用“最小必要改造”方案：
+
+### 1. Dependency Layer
+在现有 `roboimi` 环境中补齐缺失训练依赖，并优先保持现有环境名与脚本入口不变。
+
+#### Install Plan
+- 环境：继续使用现有 mamba 环境 `roboimi`
+- 安装方式：
+  - 优先使用当前 env 的 `python -m pip install`
+  - 安装包：
+    - `diffusers`
+    - `torchvision`
+    - `einops`
+    - `swanlab`
+- 版本策略：
+  - 优先选择与当前 `torch==2.4.0` 可兼容的最新可安装版本
+  - 若出现兼容性问题，再回退到与 `torch 2.4` 对齐的稳定版本
+- 复现策略：
+  - 本轮会把**实际安装成功的 resolved versions** 补写回仓库的环境定义文件，避免后续环境漂移
+
+训练前验证以下 import：
+- `torch`
+- `hydra`
+- `omegaconf`
+- `diffusers`
+- `torchvision`
+- `einops`
+- `swanlab`
+- `cv2`
+- `h5py`
+- `mujoco`
+
+### 2. Dataset Preparation
+直接复用现有 `SimpleRobotDataset`，仅将 `data.dataset_dir` 指向：
+- `/home/droid/project/diana_sim/sim_transfer`
+
+训练前使用现有统计脚本生成：
+- `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
+
+统计文件生成命令目标为：
+- 从仓库根目录执行
+- 直接针对 `/home/droid/project/diana_sim/sim_transfer` 输出 stats
+- 训练脚本不再依赖默认数据目录
+
+### 3. SwanLab Logging
+在训练脚本中增加一个轻量 logging 集成层：
+- 通过配置决定是否启用 SwanLab，默认启用
+- 默认 project：`roboimi-vla`
+- API key 不写入仓库，不写入配置文件，只通过本地登录状态或环境变量使用
+- 当 `train.use_swanlab=true` 时：
+  - 若 `swanlab` 不可 import，训练直接 fail fast
+  - 若未登录或认证失败，训练直接 fail fast
+- 每个训练日志点上传：
+  - `train/loss`
+  - `train/lr`
+  - `train/best_loss`
+  - `train/step`
+- 每次验证时上传：
+  - `val/loss`
+- 训练结束时记录最终 checkpoint 路径与 best checkpoint 路径
+
+### 4. Headless Rollout Design
+目标是让 rollout 验证可以“拿到图像观测，但不弹任何窗口”。
+
+最小改造策略：
+- 给 `make_sim_env(...)` 增加 `headless` / `is_render` 参数
+- 给相机线程显示逻辑增加开关：
+  - headless 时继续更新 `r_vis/top/front/...` 图像缓存
+  - 但不执行 `cv2.namedWindow` / `cv2.imshow` / `cv2.waitKey`
+- 评估脚本中：
+  - headless 时不调用 `env.render()`
+  - 仍然允许 `env._get_image_obs()` 和 policy inference 正常运行
+
+#### Training-Time Rollout Scope
+- 本轮**会提供一个可选的 checkpoint-time rollout validation 路径**，默认关闭
+- 启用后，在训练保存 checkpoint 时可以调用同仓库的 rollout/eval 逻辑做少量 episode 验证
+- 此路径要求支持**唯一权威开关** `eval.headless=true`，即：
+  - 不弹 MuJoCo viewer
+  - 不执行 `cv2.namedWindow / cv2.imshow / cv2.waitKey`
+  - 仍可读取图像并完成策略推理
+- 默认情况下不增加频繁 rollout，以避免拖慢训练；只提供能力与配置开关
+
+如果验证发现相机线程强依赖 GUI，我们的降级策略是：
+- 训练主流程 + SwanLab 必须先跑通
+- rollout validation 保持为显式可选能力
+- 但本轮仍要保证至少存在可调用的 headless 验证执行路径，而不是仅停留在文档层面
+
+### 5. Training Execution Strategy
+分两步执行：
+
+#### Step A: Smoke Run
+使用较小步数启动一次 smoke training，确认：
+- 数据集可正常读取
+- 统计文件可加载
+- 模型可实例化
+- 单步前后向正常
+- checkpoint 正常写出
+- SwanLab 成功上传标量
+
+#### Step B: Real Training Run
+在 smoke run 成功后，再启动正式训练。
+
+## Execution Commands
+
+### A. Stats Generation
+从仓库根目录执行，生成：
+- `/home/droid/project/diana_sim/sim_transfer/dataset_stats.pkl`
+
+命令模板：
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/vla/scripts/calculate_stats.py \
+  --dataset_dir /home/droid/project/diana_sim/sim_transfer
+```
+
+### B. Smoke Training Command
+从仓库根目录执行，核心覆盖项包括：
+- `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
+- 较小 `train.max_steps`
+- 较高日志频率
+- 启用 SwanLab
+- 输出目录使用当前运行目录下的 `checkpoints/`
+
+命令模板：
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  train.max_steps=20 \
+  train.log_freq=1 \
+  train.save_freq=10 \
+  train.use_swanlab=true \
+  train.swanlab_project=roboimi-vla \
+  train.rollout_validate_on_checkpoint=false
+```
+
+### C. Real Training Command
+从仓库根目录执行，核心覆盖项包括：
+- `data.dataset_dir=/home/droid/project/diana_sim/sim_transfer`
+- 正式 `train.max_steps`
+- 默认 project=`roboimi-vla`
+- 若启用 rollout validation，则传入 `eval.headless=true` 以及训练侧 rollout 开关
+
+命令模板：
+```bash
+/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
+  data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
+  train.use_swanlab=true \
+  train.swanlab_project=roboimi-vla \
+  train.rollout_validate_on_checkpoint=true \
+  eval.headless=true
+```
+
+### D. Output Behavior
+- checkpoint 输出目录：当前工作目录下的 `checkpoints/`
+- 关键文件：
+  - `checkpoints/vla_model_step_<N>.pt`
+  - `checkpoints/vla_model_best.pt`
+  - `checkpoints/vla_model_final.pt`
+
+## File-Level Changes
+- `environment.yml`
+  - 补写新增训练依赖，保证后续可复现
+- `roboimi/demos/vla_scripts/train_vla.py`
+  - 增加 SwanLab 集成
+  - 增加更明确的数据集目录覆盖支持
+  - 增加可选 checkpoint-time rollout validation 入口
+  - 保持当前 optimizer 对齐逻辑不变
+- `roboimi/vla/conf/config.yaml`
+  - 增加/扩展训练日志、SwanLab、rollout 相关配置项
+- `roboimi/vla/conf/eval/eval.yaml`
+  - 增加 `headless` 等评估控制项
+- `roboimi/envs/double_pos_ctrl_env.py`
+  - `make_sim_env` 支持 headless / no-render
+- `roboimi/envs/double_base.py`
+  - 相机采集与 GUI 显示解耦
+- `roboimi/vla/scripts/calculate_stats.py`
+  - 改为直接支持通过命令行传入外部 `dataset_dir`
+- tests（新增）
+  - 覆盖 SwanLab 可选初始化路径
+  - 覆盖 headless 环境下“不弹窗但可取图”的关键逻辑
+
+## Validation Plan
+1. 补齐依赖后验证 import 全通过
+2. 生成 `dataset_stats.pkl`
+3. 运行训练 smoke run
+4. 确认 SwanLab dashboard 在 project `roboimi-vla` 下有标量更新
+5. 若启用 rollout 验证：确认 headless 下不弹 GUI，且 rollout 路径能真正执行
+6. 再启动正式训练
+
+## Config Contract
+本轮新增/固定的配置键以以下形式为准：
+- `train.use_swanlab: true|false`
+- `train.swanlab_project: roboimi-vla`
+- `train.rollout_validate_on_checkpoint: true|false`
+- `eval.headless: true|false`
+
+## Risks and Mitigations
+- **Risk:** GUI/相机线程与离屏渲染耦合
+  - **Mitigation:** 先解耦显示与图像更新；必要时把 rollout 验证降级为第二阶段
+- **Risk:** 现有 env 依赖不完整
+  - **Mitigation:** 先做 import 验证，再做 smoke run
+- **Risk:** 数据集过大导致 smoke run 也很慢
+  - **Mitigation:** smoke run 只跑极小步数
+- **Risk:** SwanLab API key 泄漏
+  - **Mitigation:** 不写入代码/配置，只保存在本地登录态或环境变量
+
+## Success Criteria
+- 训练脚本能在 `/home/droid/project/diana_sim/sim_transfer` 上启动
+- 能成功写出 checkpoint 到 `checkpoints/`
+- SwanLab 在 `roboimi-vla` 项目下能看到 train/val 标量
+- headless rollout 具备不弹 GUI 的执行路径
+- 若训练侧启用 rollout validation，则该路径可以在 headless 模式下被实际调用
--- a/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
+++ b/docs/superpowers/specs/2026-03-31-rollout-artifacts-design.md
@@ -0,0 +1,16 @@
+# Rollout Artifacts Design
+
+**Goal:** Add a one-off evaluation path that can record rollout video, export per-step timing breakdowns, and save executed end-effector trajectories for a selected checkpoint while preserving default eval behavior when artifact capture is disabled.
+
+**Approach:** Extend `roboimi/demos/vla_scripts/eval_vla.py` with optional evaluation-time artifact capture that stays backward compatible when disabled. Reuse existing environment observation and camera streams, record one camera stream to MP4, collect per-step timing around observation read / preprocessing / model inference / env step / total loop, and save per-step raw predicted EE actions plus executed EE poses after stepping.
+
+**Artifact contract:**
+- `video.mp4`: optional MP4 encoded from a selected camera stream (`r_vis`, `top`, `front`, etc.), written only when recording is enabled.
+- `trajectory.npz`: canonical trajectory export containing at minimum `step`, `reward`, `raw_action`, `executed_left_link7_pos`, `executed_left_link7_quat`, `executed_right_link7_pos`, `executed_right_link7_quat`, and optional duplicated tool-body poses if captured.
+- `timing.json`: JSON-serializable per-episode timing summary with millisecond units for `obs_read_ms`, `preprocess_ms`, `inference_ms`, `env_step_ms`, `loop_total_ms`, plus aggregate mean/std/min/max and counts. Raw per-step timing arrays should also be persisted in the NPZ for later analysis.
+
+**Checkpoint selection:** Prefer an explicitly requested checkpoint path. If the caller asks for “latest” or omits a path in the execution helper, select the newest fully written checkpoint file by mtime/name and fail clearly if none exists.
+
+**Stop-training / execution safety:** Before rollout, stop any active training process using the target run, wait for process exit, then verify the chosen checkpoint exists and is readable. If the most recent checkpoint is missing or mid-write, fall back to the previous completed checkpoint or `vla_model_best.pt` with the decision logged.
+
+**Backward compatibility:** With all new eval flags left at default values, `_run_eval` return shape must remain compatible with existing callers, training-time rollout validation should continue to work without passing new options, and no artifact files should be written.
--- a/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
+++ b/docs/superpowers/specs/2026-04-01-imf-attnres-policy-design.md
@@ -0,0 +1,272 @@
+# IMF-AttnRes Policy Migration Design
+
+**Date:** 2026-04-01
+**Status:** Approved in chat, written spec pending review
+
+## Goal
+
+将 `/home/droid/project/diffusion_policy` 中提交 `185ed659` 的 IMF-AttnRes diffusion policy 迁移到当前 `roboimi` 仓库，作为当前 DiT / Transformer diffusion policy 的替代训练选项；同时迁移其训练目标与一步推理机制，并保持 RoboIMI 现有的仿真环境、三相机视觉输入、数据集格式、训练脚本和 rollout 验证工作流可继续使用。
+
+## Non-Goals
+
+- 不迁移 external repo 中与当前任务无关的 obs encoder、dataset、env wrapper、PushT 专用逻辑。
+- 不强行复刻 external repo 中全部目录结构；仅迁移当前 RoboIMI 训练所必需的模型、loss、inference 语义。
+- 不在本次工作中同时保留旧 DiT 为默认训练目标；旧配置继续可用，但新模型单独提供 config 入口。
+
+## User-Confirmed Requirements
+
+1. 迁移对象是 `185ed659` 中的 **IMF-AttnRes 模型相关代码**。
+2. 不只是迁移骨架，还要迁移：
+   - **训练目标**
+   - **一步推理机制**
+3. 视觉输入与当前 RoboIMI diffusion policy 一致：
+   - 使用三个相机图像作为条件输入
+   - 图像观测必须作为条件，而不是拼进输出预测目标
+4. 当前任务里，IMF policy 用来替代现有 DiT/Transformer diffusion policy 训练。
+5. 训练参数沿用最近一次训练的大体设置（后续由训练命令显式覆盖），但推理方式改为 IMF 的 one-step 机制。
+6. 用户接受 IMF 中“全注意力 / 非因果注意力”的实现约束。
+
+## External Source of Truth
+
+迁移语义以 external repo 的以下文件为准：
+
+- `diffusion_policy/model/diffusion/attnres_transformer_components.py`
+- `diffusion_policy/model/diffusion/imf_transformer_for_diffusion.py`
+- `diffusion_policy/policy/imf_transformer_hybrid_image_policy.py`
+- 参考配置：`image_pusht_diffusion_policy_dit_imf_attnres_full.yaml`
+
+其中最关键的差异是：该策略并非 DDPM/DDIM 多步去噪，而是 IMF 训练目标 + one-step 推理。
+
+## Current RoboIMI Baseline
+
+当前 RoboIMI 中与该任务直接相关的基线如下：
+
+- 视觉编码：`ResNetDiffusionBackbone`
+  - 三相机：`r_vis`, `top`, `front`
+  - 每个时间步将相机特征与 `qpos` 拼接为 per-step condition
+- 策略主体：`VLAAgent`
+  - `compute_loss()` 使用 DDPM 噪声预测损失
+  - `predict_action()` 使用 DDIM 多步采样
+  - 在线控制通过动作队列机制在 `select_action()` 中按 chunk 触发预测
+- 训练脚本：`roboimi/demos/vla_scripts/train_vla.py`
+  - 支持 GPU 训练、SwanLab 日志、headless rollout 验证
+
+因此，本次迁移的核心不是换视觉 backbone，而是替换 **head + loss + inference semantics**。
+
+## Recommended Integration Approach
+
+采用 **最小侵入式集成**：
+
+1. **保留当前 RoboIMI 的视觉编码、数据读取、rollout/eval、训练脚本主框架**。
+2. **新增 IMF 专用 head 模块**，在 RoboIMI 内本地实现：
+   - AttnRes 组件
+   - IMF transformer 主体
+3. **新增 IMF 专用 agent**，复用当前 `VLAAgent` 的：
+   - 归一化逻辑
+   - 相机顺序管理
+   - 观测缓存 / 动作 chunk 缓存
+   - rollout 接口
+   但覆盖：
+   - `compute_loss()`
+   - `predict_action()`
+4. **新增独立 Hydra config**，让 IMF policy 作为新的 agent 选项，不破坏已有 resnet_transformer / gr00t_dit 配置。
+
+这样做的原因：
+
+- 迁移 IMF 语义时不必把当前 DDPM agent 搅乱；
+- rollout / eval / checkpoint 逻辑仍然可复用；
+- 便于和现有 Transformer / DiT 直接做 A/B 对比训练。
+
+## Architecture
+
+### 1. Observation / Conditioning Path
+
+沿用当前 RoboIMI 的视觉路径：
+
+- 输入观测：`images={r_vis, top, front}` + `qpos`
+- `ResNetDiffusionBackbone` 对每个相机编码，得到 per-camera feature
+- `state_encoder` 编码 `qpos`
+- 将三相机特征与 state feature 按时间步拼接，形成 `per_step_cond`
+
+这里不迁移 external repo 的 obs_encoder 实现；我们只对齐 **“图像作为条件 token 输入 transformer”** 这一语义。
+
+### 2. Condition Tokenization
+
+对齐 external IMF transformer 的 token 使用方式：
+
+- action trajectory token：由 `(B, pred_horizon, action_dim)` 通过线性层映射到 `n_emb`
+- time token：两个标量 `r` 与 `t`，分别通过 sinusoidal embedding + linear projection 得到 token
+- observation token：`per_step_cond` 通过线性层映射到 `n_emb`
+- 最终 token 序列为：
+  - `[r_token, t_token, obs_cond_tokens..., action_tokens...]`
+
+在当前任务中，obs token 数量等于 `obs_horizon`，且图像观测始终作为条件输入。
+
+### 3. IMF-AttnRes Backbone
+
+在 RoboIMI 内新增 AttnRes backbone 实现，保持 external commit 的关键语义：
+
+- `RMSNorm` / `RMSNormNoWeight`
+- RoPE
+- Grouped Query Self-Attention
+- SwiGLU FFN
+- AttnRes operator / residual source aggregation
+- `AttnResTransformerBackbone`
+
+并保持：
+
+- **full attention**（不使用因果注意力）
+- `backbone_type='attnres_full'`
+- 输出仅切回 action token 部分，再经过最终 norm + head 得到 velocity-like 输出
+
+### 4. Training Objective
+
+训练目标从当前 DDPM epsilon prediction 改为 external IMF 目标：
+
+给定真实轨迹 `x` 与随机噪声 `e`：
+
+1. 采样 `t ~ U(0,1)`、`r ~ U(0,1)`，并排序为 `t >= r`
+2. 构造插值状态：
+   - `z_t = (1 - t) x + t e`
+3. 用模型计算：
+   - `v = f(z_t, t, t, cond)`
+4. 对 `g(z, r, t) = f(z, r, t, cond)` 做 JVP，得到：
+   - `u, du_dt`
+5. 构造 compound velocity：
+   - `V = u + (t - r) * du_dt`
+6. 目标为：
+   - `target = e - x`
+7. 用 action 维度上的 MSE 作为最终损失
+
+RoboIMI 现有 batch 中的 `action_is_pad` 仍要保留支持；如果存在 padding，只在有效 action 上计算损失。
+
+### 5. One-Step Inference
+
+推理改为 external IMF 的一步采样语义：
+
+1. 从标准高斯初始化 action trajectory `z_t`
+2. 计算 `u = f(z_t, r=0, t=1, cond)`
+3. 一步更新：
+   - `x_hat = z_t - (t-r) * u = z_t - u`
+4. 反归一化得到动作序列
+
+这意味着：
+
+- `num_inference_steps` 对 IMF policy 固定为 `1`
+- 不再调用 DDIM scheduler 的多步 `step()`
+- 在线控制中仍沿用当前 chunk 机制：
+  - 动作队列为空时触发一次 `predict_action_chunk()`
+  - 取预测序列中 `[obs_horizon-1 : obs_horizon-1+num_action_steps]` 这一段入队
+
+也就是说，**触发模型前向的规则不变，改变的是每次触发后的动作序列生成方式**。
+
+## API / Code Structure
+
+计划中的主要代码边界如下：
+
+- `roboimi/vla/models/heads/attnres_transformer_components.py`
+  - IMF AttnRes 基础组件
+- `roboimi/vla/models/heads/imf_transformer1d.py`
+  - RoboIMI 版本 IMF transformer head
+  - 对外暴露 `forward(sample, r, t, cond=None)`
+  - 暴露 `get_optim_groups()` 供 AdamW 分组使用
+- `roboimi/vla/agent_imf.py`
+  - 复用 `VLAAgent` 的观测处理 / normalization / queue 基础设施
+  - 覆盖 IMF 的训练损失与 one-step 预测逻辑
+- Hydra config
+  - `roboimi/vla/conf/head/imf_transformer1d.yaml`
+  - `roboimi/vla/conf/agent/resnet_imf_attnres.yaml`
+
+训练脚本主流程尽量不改；只要求它能 instantiate 新 agent 并继续使用当前 rollout / checkpoint / swanlab 逻辑。
+
+## Compatibility Decisions
+
+## Initial Config Defaults To Preserve
+
+为避免迁移时语义漂移，首版 IMF 配置默认值明确固定为：
+
+- `backbone_type: attnres_full`
+- `n_head: 1`
+- `n_kv_head: 1`
+- `n_cond_layers: 0`
+- `time_as_cond: true`
+- `causal_attn: false`
+- `num_inference_steps: 1`
+
+这些默认值与 external `185ed659` 的 IMF-AttnRes 使用方式保持一致；后续调参可以覆盖，但首版迁移必须先以该语义跑通。
+
+### Reuse From RoboIMI
+
+保留：
+
+- 三相机数据读取方式
+- ResNet visual backbone
+- qpos / action normalization
+- 训练循环、优化器、scheduler、SwanLab、headless rollout
+- `select_action()` 的在线 chunk 执行方式
+
+### Replace With External IMF Semantics
+
+替换：
+
+- transformer head 实现
+- diffusion training objective
+- inference sampling semantics
+
+### Intentionally Not Mirrored 1:1
+
+不强行与 external repo 一致的部分：
+
+- external repo 的整体 policy 基类继承体系
+- external repo 的 obs encoder 模块树
+- external repo 的 normalizer / mask generator 框架
+
+原因是当前 RoboIMI 已有稳定的数据接口和 rollout 流程，直接嫁接进去更稳。
+
+## Testing / Verification Strategy
+
+迁移完成后至少验证以下内容：
+
+1. **单元 / 冒烟验证**
+   - IMF head 前向 shape 正确
+   - IMF agent `compute_loss()` 在真实 batch 上可前向、反向
+   - IMF agent `predict_action()` 能输出 `(B, pred_horizon, action_dim)`
+2. **训练链路验证**
+   - 使用 GPU 跑一个短训练任务，确认：
+     - dataloader 正常
+     - optimizer / lr scheduler 正常
+     - SwanLab 正常记录配置和训练指标
+3. **rollout 验证**
+   - 训练中周期性 headless rollout 能跑通
+   - 环境仍按 EE-style `step()` 接收动作
+4. **最终交付**
+   - 用用户指定的同类超参数启动正式训练
+
+## Risks and Mitigations
+
+### Risk 1: JVP 在 CUDA 注意力内核上不稳定
+
+缓解：沿用 external repo 的策略，在 JVP 路径上切换到 math SDP kernel，必要时 fallback 到 `torch.autograd.functional.jvp`。同时，JVP 的切线构造与 `u, du_dt` 计算流程必须严格对齐 external source，不在本次迁移中自行改写其数学语义。
+
+### Risk 2: Optimizer 参数分组遗漏新模块
+
+缓解：IMF head 提供 `get_optim_groups()`，并在训练脚本中按“只要 head 提供该接口就使用”的策略统一处理，而不是绑定旧 `head_type`。
+
+### Risk 3: 现有 rollout 逻辑假定 DDIM 多步采样
+
+缓解：保持 `select_action()` / `predict_action_chunk()` 接口不变，只替换 `predict_action()` 内部实现，确保 eval 代码无需理解 IMF 细节。
+
+### Risk 4: 训练命令参数与新 config 不一致
+
+缓解：新增独立 agent config，并保留此前训练参数作为显式 CLI override 模板。
+
+## Success Criteria
+
+以下条件全部满足，视为本次迁移成功：
+
+1. RoboIMI 中新增 IMF-AttnRes policy，可通过 Hydra config 单独启用。
+2. 训练时使用 external IMF 的 loss，而不是当前 DDPM epsilon loss。
+3. 推理时使用 one-step IMF 采样，而不是 DDIM 多步采样。
+4. 三相机图像始终作为条件输入参与模型前向。
+5. 在线 rollout 能在 headless 仿真环境中跑通。
+6. 能按最近一次实验参数模板成功启动训练。
--- a/environment.yml
+++ b/environment.yml
@@ -0,0 +1,458 @@
+name: roboimi
+channels:
+  - conda-forge
+dependencies:
+  - _libgcc_mutex=0.1
+  - _openmp_mutex=4.5
+  - _python_abi3_support=1.0
+  - aiohappyeyeballs=2.6.1
+  - aiohttp=3.13.3
+  - aiosignal=1.4.0
+  - alsa-lib=1.2.9
+  - anyio=4.12.1
+  - aom=3.5.0
+  - async-timeout=5.0.1
+  - attr=2.5.1
+  - attrs=25.4.0
+  - aws-c-auth=0.7.22
+  - aws-c-cal=0.6.15
+  - aws-c-common=0.9.23
+  - aws-c-compression=0.2.18
+  - aws-c-event-stream=0.4.2
+  - aws-c-http=0.8.2
+  - aws-c-io=0.14.9
+  - aws-c-mqtt=0.10.4
+  - aws-c-s3=0.5.10
+  - aws-c-sdkutils=0.1.16
+  - aws-checksums=0.1.18
+  - aws-crt-cpp=0.26.12
+  - aws-sdk-cpp=1.11.329
+  - box2d-py=2.3.8
+  - brotli=1.1.0
+  - brotli-bin=1.1.0
+  - brotli-python=1.1.0
+  - bzip2=1.0.8
+  - c-ares=1.34.6
+  - ca-certificates=2026.1.4
+  - cairo=1.16.0
+  - certifi=2026.1.4
+  - cffi=1.17.1
+  - charset-normalizer=3.4.4
+  - click=8.3.1
+  - cloudpickle=3.0.0
+  - contourpy=1.3.0
+  - cpython=3.10.19
+  - cuda-cudart=12.6.68
+  - cuda-cudart_linux-64=12.6.68
+  - cuda-nvrtc=12.6.68
+  - cuda-nvtx=12.6.68
+  - cuda-version=12.6
+  - cudnn=8.9.7.29
+  - cycler=0.12.1
+  - datasets=4.0.0
+  - dav1d=1.2.1
+  - dbus=1.13.6
+  - dill=0.3.8
+  - eigen=3.4.0
+  - exceptiongroup=1.3.1
+  - expat=2.6.3
+  - farama-notifications=0.0.4
+  - filelock=3.15.4
+  - fluidsynth=2.3.3
+  - font-ttf-dejavu-sans-mono=2.37
+  - font-ttf-inconsolata=3.000
+  - font-ttf-source-code-pro=2.038
+  - font-ttf-ubuntu=0.83
+  - fontconfig=2.14.2
+  - fonts-conda-ecosystem=1
+  - fonts-conda-forge=1
+  - fonttools=4.53.1
+  - freetype=2.12.1
+  - frozenlist=1.7.0
+  - fsspec=2024.6.1
+  - gettext=0.22.5
+  - gettext-tools=0.22.5
+  - gflags=2.2.2
+  - git-lfs=3.7.1
+  - glog=0.7.1
+  - gmp=6.3.0
+  - gmpy2=2.1.5
+  - graphite2=1.3.13
+  - gym=0.26.1
+  - gym-box2d=0.26.1
+  - gym-notices=0.0.8
+  - gymnasium=0.29.1
+  - h11=0.16.0
+  - h2=4.3.0
+  - harfbuzz=7.3.0
+  - hf-xet=1.2.1
+  - hpack=4.1.0
+  - httpcore=1.0.9
+  - httpx=0.28.1
+  - huggingface_hub=1.3.5
+  - hyperframe=6.1.0
+  - icu=72.1
+  - idna=3.11
+  - jack=1.9.22
+  - jax-jumpy=1.0.0
+  - jinja2=3.1.4
+  - jpeg=9e
+  - keyutils=1.6.3
+  - kiwisolver=1.4.9
+  - krb5=1.21.3
+  - lame=3.100
+  - lcms2=2.15
+  - ld_impl_linux-64=2.40
+  - lerc=4.0.0
+  - libabseil=20240116.2
+  - libarrow=16.1.0
+  - libarrow-acero=16.1.0
+  - libarrow-dataset=16.1.0
+  - libarrow-substrait=16.1.0
+  - libasprintf=0.22.5
+  - libasprintf-devel=0.22.5
+  - libavif=0.11.1
+  - libblas=3.9.0
+  - libbrotlicommon=1.1.0
+  - libbrotlidec=1.1.0
+  - libbrotlienc=1.1.0
+  - libcap=2.69
+  - libcblas=3.9.0
+  - libcrc32c=1.1.2
+  - libcublas=12.6.1.4
+  - libcufft=11.2.6.59
+  - libcurand=10.3.7.68
+  - libcurl=8.12.1
+  - libcusolver=11.6.4.69
+  - libcusparse=12.5.3.3
+  - libdb=6.2.32
+  - libdeflate=1.17
+  - libedit=3.1.20250104
+  - libev=4.33
+  - libevent=2.1.12
+  - libexpat=2.6.3
+  - libffi=3.4.2
+  - libflac=1.4.3
+  - libgcc=14.1.0
+  - libgcc-ng=14.1.0
+  - libgcrypt=1.11.0
+  - libgettextpo=0.22.5
+  - libgettextpo-devel=0.22.5
+  - libgfortran=14.1.0
+  - libgfortran-ng=14.1.0
+  - libgfortran5=14.1.0
+  - libglib=2.80.3
+  - libgoogle-cloud=2.25.0
+  - libgoogle-cloud-storage=2.25.0
+  - libgpg-error=1.50
+  - libgrpc=1.62.2
+  - libhwloc=2.9.3
+  - libiconv=1.17
+  - libjpeg-turbo=2.1.4
+  - liblapack=3.9.0
+  - libmad=0.15.1b
+  - libmagma=2.8.0
+  - libmagma_sparse=2.8.0
+  - libnghttp2=1.67.0
+  - libnsl=2.0.1
+  - libnvjitlink=12.6.68
+  - libogg=1.3.5
+  - libopenblas=0.3.27
+  - libopus=1.3.1
+  - libparquet=16.1.0
+  - libpng=1.6.43
+  - libprotobuf=4.25.3
+  - libre2-11=2023.09.01
+  - libsndfile=1.2.2
+  - libsqlite=3.46.0
+  - libssh2=1.11.1
+  - libstdcxx=14.1.0
+  - libstdcxx-ng=14.1.0
+  - libsystemd0=256.5
+  - libthrift=0.19.0
+  - libtiff=4.5.0
+  - libtorch=2.4.0
+  - libutf8proc=2.8.0
+  - libuuid=2.38.1
+  - libuv=1.48.0
+  - libvorbis=1.3.7
+  - libwebp-base=1.4.0
+  - libxcb=1.13
+  - libxcrypt=4.4.36
+  - libxml2=2.11.5
+  - libzlib=1.3.1
+  - llvm-openmp=18.1.8
+  - lz4-c=1.9.4
+  - markupsafe=2.1.5
+  - matplotlib-base=3.9.2
+  - mkl=2023.2.0
+  - mpc=1.3.1
+  - mpfr=4.2.1
+  - mpg123=1.31.3
+  - mpmath=1.3.0
+  - multidict=6.7.0
+  - multiprocess=0.70.16
+  - munkres=1.1.4
+  - nccl=2.22.3.1
+  - ncurses=6.5
+  - networkx=3.3
+  - numpy=1.26.4
+  - openjpeg=2.5.0
+  - openssl=3.6.1
+  - opusfile=0.12
+  - orc=2.0.1
+  - orocos-kdl=1.5.1
+  - packaging=24.1
+  - pandas=2.2.2
+  - pcre2=10.44
+  - pillow=9.4.0
+  - pip=24.2
+  - pixman=0.43.2
+  - portaudio=19.6.0
+  - portmidi=2.0.4
+  - propcache=0.3.1
+  - pthread-stubs=0.4
+  - pulseaudio-client=16.1
+  - pyarrow=16.1.0
+  - pyarrow-core=16.1.0
+  - pybind11=2.13.5
+  - pybind11-global=2.13.5
+  - pycparser=2.22
+  - pygame=2.1.3
+  - pyparsing=3.1.4
+  - pysocks=1.7.1
+  - python=3.10.14
+  - python-dateutil=2.9.0
+  - python-gil=3.10.19
+  - python-orocos-kdl=1.5.1
+  - python-tzdata=2024.1
+  - python-xxhash=3.6.0
+  - python_abi=3.10
+  - pytorch=2.4.0
+  - hydra-core=1.3.2
+  - omegaconf=2.3.0
+  - einops=0.8.2
+  - diffusers=0.36.0
+  - torchvision=0.19.0
+  - pytz=2024.1
+  - pyyaml=6.0.3
+  - qhull=2020.2
+  - re2=2023.09.01
+  - readline=8.2
+  - regex=2026.1.15
+  - requests=2.32.5
+  - s2n=1.4.16
+  - safetensors=0.7.0
+  - sdl2=2.26.5
+  - sdl2_image=2.6.3
+  - sdl2_mixer=2.6.3
+  - sdl2_ttf=2.20.2
+  - setuptools=72.2.0
+  - shellingham=1.5.4
+  - six=1.16.0
+  - sleef=3.6.1
+  - snappy=1.2.2
+  - sniffio=1.3.1
+  - stable-baselines3=2.3.2
+  - sympy=1.13.2
+  - tbb=2021.11.0
+  - tk=8.6.13
+  - tokenizers=0.22.2
+  - tqdm=4.67.2
+  - transformers=5.0.0
+  - typer-slim=0.21.1
+  - typing-extensions=4.12.2
+  - typing_extensions=4.12.2
+  - tzdata=2024a
+  - unicodedata2=15.1.0
+  - urllib3=2.5.0
+  - wheel=0.44.0
+  - xorg-kbproto=1.0.7
+  - xorg-libice=1.1.1
+  - xorg-libsm=1.2.4
+  - xorg-libx11=1.8.4
+  - xorg-libxau=1.0.11
+  - xorg-libxdmcp=1.1.3
+  - xorg-libxext=1.3.4
+  - xorg-libxrender=0.9.10
+  - xorg-renderproto=0.11.1
+  - xorg-xextproto=7.3.0
+  - xorg-xproto=7.0.31
+  - xxhash=0.8.3
+  - xz=5.2.6
+  - yaml=0.2.5
+  - yarl=1.22.0
+  - zlib=1.3.1
+  - zstandard=0.23.0
+  - zstd=1.5.6
+  - pip:
+    - GitPython==3.1.46
+    - Jinja2==3.1.6
+    - MarkupSafe==3.0.3
+    - PyOpenGL==3.1.7
+    - PyYAML==6.0.3
+    - Pygments==2.19.2
+    - absl-py==2.1.0
+    - accelerate==1.12.0
+    - aiofiles==24.1.0
+    - aiohappyeyeballs==2.6.1
+    - aiohttp==3.13.3
+    - aiosignal==1.4.0
+    - annotated-doc==0.0.4
+    - annotated-types==0.7.0
+    - antlr4-python3-runtime==4.9.3
+    - anyio==4.12.1
+    - asciitree==0.3.3
+    - asttokens==3.0.1
+    - async-timeout==5.0.1
+    - attrs==25.4.0
+    - av==15.1.0
+    - brotli==1.2.0
+    - charset-normalizer==3.4.4
+    - cmake==4.1.3
+    - cmeel==0.58.0
+    - cmeel-assimp==5.4.3.1
+    - cmeel-boost==1.87.0.1
+    - cmeel-console-bridge==1.0.2.3
+    - cmeel-octomap==1.10.0
+    - cmeel-qhull==8.0.2.1
+    - cmeel-tinyxml==2.6.2.3
+    - cmeel-tinyxml2==10.0.0
+    - cmeel-urdfdom==3.1.1.1
+    - cmeel-zlib==1.3.1
+    - coal==3.0.2
+    - coal-library==3.0.1
+    - colorama==0.4.6
+    - datasets==4.5.0
+    - decorator==5.2.1
+    - deepdiff==8.6.1
+    - dill==0.4.0
+    - docstring_parser==0.17.0
+    - draccus==0.10.0
+    - eigenpy==3.10.3
+    - etils==1.7.0
+    - evdev==1.9.2
+    - exceptiongroup==1.3.1
+    - executing==2.2.1
+    - fastapi==0.128.0
+    - fasteners==0.20
+    - ffmpy==1.0.0
+    - filelock==3.20.3
+    - frozenlist==1.8.0
+    - fsspec==2025.10.0
+    - gitdb==4.0.12
+    - glfw==2.7.0
+    - gradio==6.3.0
+    - gradio_client==2.0.3
+    - groovy==0.1.2
+    - gymnasium==1.2.3
+    - h11==0.16.0
+    - h5py==3.15.1
+    - hf-xet==1.2.0
+    - hf_transfer==0.1.9
+    - httpcore==1.0.9
+    - httpx==0.28.1
+    - huggingface_hub==1.3.2
+    - imageio==2.35.1
+    - imageio-ffmpeg==0.6.0
+    - importlib_metadata==8.7.1
+    - importlib_resources==6.5.2
+    - inquirerpy==0.3.4
+    - ipython==8.38.0
+    - jedi==0.19.2
+    - jsonargparse==4.45.0
+    - jsonlines==4.0.0
+    - kiwisolver==1.4.5
+    - lerobot==0.4.2
+    - libcoal==3.0.2
+    - libpinocchio==3.8.0
+    - lightning==2.5.0.post0
+    - lightning-utilities==0.15.2
+    - lxml==5.3.0
+    - markdown-it-py==4.0.0
+    - matplotlib-inline==0.2.1
+    - mdurl==0.1.2
+    - mergedeep==1.3.4
+    - mpmath==1.3.0
+    - mujoco==3.2.2
+    - mujoco-python-viewer==0.1.4
+    - multidict==6.7.0
+    - multiprocess==0.70.18
+    - mypy_extensions==1.1.0
+    - networkx==3.4.2
+    - numcodecs==0.13.1
+    - numpy==2.2.6
+    - opencv-contrib-python==4.10.0.84
+    - opencv-python==4.13.0.90
+    - orderly-set==5.5.0
+    - orjson==3.11.5
+    - packaging==24.2
+    - pandas==2.3.3
+    - parso==0.8.5
+    - pexpect==4.9.0
+    - pfzy==0.3.4
+    - pillow==12.1.0
+    - pin==3.3.1
+    - platformdirs==4.5.1
+    - prompt_toolkit==3.0.52
+    - propcache==0.4.1
+    - protobuf==6.33.4
+    - proxsuite==0.7.2
+    - psutil==7.2.1
+    - ptyprocess==0.7.0
+    - pure_eval==0.2.3
+    - pyarrow==22.0.0
+    - pydantic==2.12.5
+    - pydantic_core==2.41.5
+    - pydub==0.25.1
+    - pynput==1.8.1
+    - pyquaternion==0.9.9
+    - pyserial==3.5
+    - python-dateutil==2.9.0.post0
+    - python-multipart==0.0.21
+    - python-xlib==0.33
+    - pytorch-lightning==2.6.0
+    - pyyaml-include==1.4.1
+    - qwen-vl-utils==0.0.14
+    - regex==2026.1.15
+    - requests==2.32.5
+    - rerun-sdk==0.26.2
+    - rich==13.9.4
+    - ruckig==0.9.2
+    - safehttpx==0.1.7
+    - safetensors==0.7.0
+    - scipy==1.14.1
+    - semantic-version==2.10.0
+    - sentry-sdk==2.49.0
+    - shellingham==1.5.4
+    - smmap==5.0.2
+    - stack-data==0.6.3
+    - starlette==0.50.0
+    - sympy==1.13.1
+    - swanlab==0.7.13
+    - termcolor==3.3.0
+    - timm==1.0.24
+    - toml==0.10.2
+    - tomli==2.4.0
+    - tomlkit==0.13.3
+    - torchcodec==0.5
+    - torchmetrics==1.8.2
+    - tqdm==4.67.1
+    - traitlets==5.14.3
+    - typer==0.21.1
+    - typer-slim==0.21.1
+    - typeshed_client==2.8.2
+    - typing-inspect==0.9.0
+    - typing-inspection==0.4.2
+    - typing_extensions==4.15.0
+    - tzdata==2025.3
+    - urdf_parser_py==0.0.4
+    - urllib3==2.6.3
+    - uv==0.9.28
+    - uvicorn==0.40.0
+    - wandb==0.24.0
+    - wcwidth==0.2.14
+    - xxhash==3.6.0
+    - yarl==1.22.0
+    - zarr==2.18.3
+    - zipp==3.20.1
--- a/generate_dataset_videos.py
+++ b/generate_dataset_videos.py
@@ -0,0 +1,324 @@
+#!/usr/bin/env python3
+"""
+将 HDF5 数据集转换为视频，用于可视化检查
+
+功能：
+1. 将单个 episode 转换为视频
+2. 对比多个 episode 的视频
+3. 放慢播放速度便于观察
+"""
+import os
+import h5py
+import glob
+import cv2
+import numpy as np
+
+
+def episode_to_video(episode_file, output_path, camera='top', fps=30, slow_factor=1):
+    """
+    将单个 episode 转换为视频
+
+    Args:
+        episode_file: HDF5 文件路径
+        output_path: 输出视频路径
+        camera: 要使用的相机名称
+        fps: 帧率
+        slow_factor: 慢放倍数（1=正常，2=半速）
+    """
+    try:
+        with h5py.File(episode_file, 'r') as f:
+            # 读取图像序列
+            img_path = f'/observations/images/{camera}'
+
+            if img_path not in f:
+                print(f"  ❌ 相机 {camera} 不存在")
+                return False
+
+            images = f[img_path][:]  # shape: (T, H, W, C)
+            qpos = f['/observations/qpos'][:]
+            actions = f['/action'][:]
+
+            total_frames = len(images)
+            height, width = images.shape[1], images.shape[2]
+
+            # 创建视频写入器
+            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+            actual_fps = fps // slow_factor
+            out = cv2.VideoWriter(output_path, fourcc, actual_fps, (width, height))
+
+            # 逐帧写入
+            for i in range(total_frames):
+                frame = images[i].astype(np.uint8)
+
+                # 在图像上添加信息
+                info_text = [
+                    f"Episode: {os.path.basename(episode_file).replace('.hdf5', '')}",
+                    f"Frame: {i}/{total_frames}",
+                    f"qpos[0:3]: [{qpos[i, 0]:.2f}, {qpos[i, 1]:.2f}, {qpos[i, 2]:.2f}]",
+                ]
+
+                for j, text in enumerate(info_text):
+                    cv2.putText(frame, text, (10, 30 + j*30),
+                               cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
+
+                out.write(frame)
+
+            out.release()
+            print(f"  ✅ 保存: {output_path}")
+            print(f"     帧数: {total_frames}, 尺寸: {width}x{height}, FPS: {actual_fps}")
+            return True
+
+    except Exception as e:
+        print(f"  ❌ 错误: {e}")
+        return False
+
+
+def generate_all_videos(camera='top', num_episodes=5, slow_factor=1):
+    """生成前 N 个 episode 的视频"""
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+    if len(episode_files) == 0:
+        print(f"❌ 没有找到数据文件: {dataset_dir}")
+        return
+
+    # 创建输出目录
+    output_dir = '/tmp/dataset_videos'
+    os.makedirs(output_dir, exist_ok=True)
+
+    print(f"找到 {len(episode_files)} 个 episode 文件")
+    print(f"将生成前 {min(num_episodes, len(episode_files))} 个 episode 的视频\n")
+
+    # 生成视频
+    for i in range(min(num_episodes, len(episode_files))):
+        ep_file = episode_files[i]
+        ep_name = os.path.basename(ep_file).replace('.hdf5', '')
+        output_path = f"{output_dir}/{ep_name}_{camera}.mp4"
+
+        print(f"[{i+1}/{min(num_episodes, len(episode_files))}] {ep_name}")
+        episode_to_video(ep_file, output_path, camera=camera, slow_factor=slow_factor)
+        print()
+
+    print(f"✅ 所有视频已保存到: {output_dir}")
+    print(f"\n播放方法:")
+    print(f"  # 播放单个视频")
+    print(f"  vlc {output_dir}/*.mp4")
+    print(f"  ")
+    print(f"  # 或用文件管理器")
+    print(f"  nautilus {output_dir}")
+
+
+def generate_multi_camera_video(episode_idx=0, slow_factor=1):
+    """生成包含多个相机的视频（分屏显示）"""
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+    if episode_idx >= len(episode_files):
+        print(f"❌ Episode {episode_idx} 不存在")
+        return
+
+    ep_file = episode_files[episode_idx]
+
+    try:
+        with h5py.File(ep_file, 'r') as f:
+            # 获取所有相机
+            cameras = []
+            for key in f.keys():
+                if 'images' in key:
+                    for cam_name in f[key].keys():
+                        if cam_name not in cameras:
+                            cameras.append(cam_name)
+
+            print(f"Episode {episode_idx} 的相机: {cameras}")
+
+            # 读取所有相机的图像
+            all_images = {}
+            for cam in cameras:
+                img_path = f'/observations/images/{cam}'
+                if img_path in f:
+                    all_images[cam] = f[img_path][:]
+
+            if not all_images:
+                print("❌ 没有找到图像数据")
+                return
+
+            # 获取第一个相机的尺寸
+            first_cam = list(all_images.keys())[0]
+            total_frames = len(all_images[first_cam])
+            height, width = all_images[first_cam].shape[1], all_images[first_cam].shape[2]
+
+            # 创建多相机布局
+            num_cams = len(all_images)
+            cols = min(2, num_cams)
+            rows = (num_cams + cols - 1) // cols
+
+            canvas_width = width * cols
+            canvas_height = height * rows
+
+            # 创建视频写入器
+            output_path = f'/tmp/dataset_videos/episode_{episode_idx}_all_cameras.mp4'
+            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+            out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
+
+            # 逐帧合成
+            for i in range(total_frames):
+                canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
+
+                for cam_idx, cam_name in enumerate(all_images.keys()):
+                    img = all_images[cam_name][i]
+
+                    # 计算在画布上的位置
+                    row = cam_idx // cols
+                    col = cam_idx % cols
+                    y_start = row * height
+                    y_end = y_start + height
+                    x_start = col * width
+                    x_end = x_start + width
+
+                    # 调整大小（如果需要）
+                    if img.shape[:2] != (height, width):
+                        img = cv2.resize(img, (width, height))
+
+                    # 放到画布上
+                    canvas[y_start:y_end, x_start:x_end] = img
+
+                    # 添加相机名称
+                    cv2.putText(canvas, cam_name, (x_start + 10, y_start + 30),
+                               cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
+
+                # 添加帧信息
+                cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 10),
+                           cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
+
+                out.write(canvas)
+
+            out.release()
+            print(f"✅ 保存多相机视频: {output_path}")
+
+    except Exception as e:
+        print(f"❌ 错误: {e}")
+
+
+def compare_episodes(camera='top', slow_factor=2):
+    """并排对比多个 episode 的视频"""
+
+    dataset_dir = "roboimi/demos/dataset/sim_transfer"
+    episode_files = sorted(glob.glob(os.path.join(dataset_dir, "episode_*.hdf5")))
+
+    # 选择要对比的 episode
+    episodes_to_compare = [0, 1, 2, 3, 4]  # 对比前 5 个
+
+    print(f"对比 Episodes: {episodes_to_compare}")
+
+    # 读取所有 episode 的数据
+    all_data = []
+    for ep_idx in episodes_to_compare:
+        if ep_idx >= len(episode_files):
+            continue
+
+        try:
+            with h5py.File(episode_files[ep_idx], 'r') as f:
+                img_path = f'/observations/images/{camera}'
+                if img_path in f:
+                    all_data.append({
+                        'idx': ep_idx,
+                        'images': f[img_path][:],
+                        'qpos': f['/observations/qpos'][:]
+                    })
+        except:
+            pass
+
+    if len(all_data) == 0:
+        print("❌ 没有数据")
+        return
+
+    # 获取参数
+    first_data = all_data[0]
+    height, width = first_data['images'].shape[1], first_data['images'].shape[2]
+    total_frames = min([d['images'].shape[0] for d in all_data])
+
+    # 创建并排布局
+    num_compare = len(all_data)
+    canvas_width = width * num_compare
+    canvas_height = height
+
+    # 创建视频
+    output_path = f'/tmp/dataset_videos/compare_{camera}.mp4'
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+    out = cv2.VideoWriter(output_path, fourcc, 30 // slow_factor, (canvas_width, canvas_height))
+
+    print(f"生成对比视频，共 {total_frames} 帧...")
+
+    # 逐帧对比
+    for i in range(total_frames):
+        canvas = np.zeros((canvas_height, canvas_width, 3), dtype=np.uint8)
+
+        for j, data in enumerate(all_data):
+            img = data['images'][i]
+            qpos = data['qpos'][i]
+
+            # 调整大小（如果需要）
+            if img.shape[:2] != (height, width):
+                img = cv2.resize(img, (width, height))
+
+            # 放到画布上
+            x_start = j * width
+            x_end = x_start + width
+            canvas[:, x_start:x_end] = img
+
+            # 添加信息
+            ep_name = f"Ep {data['idx']}"
+            cv2.putText(canvas, ep_name, (x_start + 10, 30),
+                       cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2)
+            cv2.putText(canvas, f"qpos[0:3]: [{qpos[0]:.2f}, {qpos[1]:.2f}, {qpos[2]:.2f}]",
+                       (x_start + 10, height - 10),
+                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
+
+        # 添加帧号
+        cv2.putText(canvas, f"Frame: {i}/{total_frames}", (10, canvas_height - 30),
+                   cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
+
+        out.write(canvas)
+
+        if i % 100 == 0:
+            print(f"  进度: {i}/{total_frames}")
+
+    out.release()
+    print(f"✅ 保存对比视频: {output_path}")
+
+
+if __name__ == "__main__":
+    import sys
+
+    print("="*60)
+    print("数据集视频生成工具")
+    print("="*60)
+
+    if len(sys.argv) > 1:
+        command = sys.argv[1]
+
+        if command == 'compare':
+            # 对比多个 episode
+            camera = sys.argv[2] if len(sys.argv) > 2 else 'top'
+            compare_episodes(camera=camera, slow_factor=2)
+
+        elif command == 'multi':
+            # 多相机视频
+            ep_idx = int(sys.argv[2]) if len(sys.argv) > 2 else 0
+            generate_multi_camera_video(episode_idx=ep_idx, slow_factor=1)
+
+        else:
+            print("未知命令")
+    else:
+        # 默认：生成前 5 个 episode 的视频
+        print("\n生成前 5 个 episode 的视频（top 相机，慢放 2x）...")
+        print("="*60 + "\n")
+        generate_all_videos(camera='top', num_episodes=5, slow_factor=2)
+
+        print("\n" + "="*60)
+        print("其他用法:")
+        print("  python generate_dataset_videos.py compare top    # 对比多个 episode")
+        print("  python generate_dataset_videos.py multi 0        # 多相机视频")
+        print("="*60)
--- a/gr00t/main.py
+++ b/gr00t/main.py
@@ -0,0 +1,125 @@
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+"""
+GR00T (diffusion-based DiT policy) model builder.
+
+This module provides functions to build GR00T models and optimizers
+from configuration dictionaries (typically from config.yaml's 'gr00t:' section).
+"""
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+from .models import build_gr00t_model
+
+
+def get_args_parser():
+    """
+    Create argument parser for GR00T model configuration.
+
+    All parameters can be overridden via args_override dictionary in
+    build_gr00t_model_and_optimizer(). This allows loading from config.yaml.
+    """
+    parser = argparse.ArgumentParser('GR00T training and evaluation script', add_help=False)
+
+    # Training parameters
+    parser.add_argument('--lr', default=1e-5, type=float,
+                        help='Learning rate for main parameters')
+    parser.add_argument('--lr_backbone', default=1e-5, type=float,
+                        help='Learning rate for backbone parameters')
+    parser.add_argument('--weight_decay', default=1e-4, type=float,
+                        help='Weight decay for optimizer')
+
+    # GR00T model architecture parameters
+    parser.add_argument('--embed_dim', default=1536, type=int,
+                        help='Embedding dimension for transformer')
+    parser.add_argument('--hidden_dim', default=1024, type=int,
+                        help='Hidden dimension for MLP layers')
+    parser.add_argument('--state_dim', default=16, type=int,
+                        help='State (qpos) dimension')
+    parser.add_argument('--action_dim', default=16, type=int,
+                        help='Action dimension')
+    parser.add_argument('--num_queries', default=16, type=int,
+                        help='Number of action queries (chunk size)')
+
+    # DiT (Diffusion Transformer) parameters
+    parser.add_argument('--num_layers', default=16, type=int,
+                        help='Number of transformer layers')
+    parser.add_argument('--nheads', default=32, type=int,
+                        help='Number of attention heads')
+    parser.add_argument('--mlp_ratio', default=4, type=float,
+                        help='MLP hidden dimension ratio')
+    parser.add_argument('--dropout', default=0.2, type=float,
+                        help='Dropout rate')
+
+    # Backbone parameters
+    parser.add_argument('--backbone', default='dino_v2', type=str,
+                        help='Backbone architecture (dino_v2, resnet18, resnet34)')
+    parser.add_argument('--position_embedding', default='sine', type=str,
+                        choices=('sine', 'learned'),
+                        help='Type of positional encoding')
+
+    # Camera configuration
+    parser.add_argument('--camera_names', default=[], nargs='+',
+                        help='List of camera names for observations')
+
+    # Other parameters (not directly used but kept for compatibility)
+    parser.add_argument('--batch_size', default=15, type=int)
+    parser.add_argument('--epochs', default=20000, type=int)
+    parser.add_argument('--masks', action='store_true',
+                        help='Use intermediate layer features')
+    parser.add_argument('--dilation', action='store_false',
+                        help='Use dilated convolution in backbone')
+
+    return parser
+
+
+def build_gr00t_model_and_optimizer(args_override):
+    """
+    Build GR00T model and optimizer from config dictionary.
+
+    This function is designed to work with config.yaml loading:
+    1. Parse default arguments
+    2. Override with values from args_override (typically from config['gr00t'])
+    3. Build model and optimizer
+
+    Args:
+        args_override: Dictionary of config values, typically from config.yaml's 'gr00t:' section
+                      Expected keys: embed_dim, hidden_dim, state_dim, action_dim,
+                                     num_queries, nheads, mlp_ratio, dropout, num_layers,
+                                     lr, lr_backbone, camera_names, backbone, etc.
+
+    Returns:
+        model: GR00T model on CUDA
+        optimizer: AdamW optimizer with separate learning rates for backbone and other params
+    """
+    parser = argparse.ArgumentParser('GR00T training and evaluation script',
+                                     parents=[get_args_parser()])
+    args = parser.parse_args()
+
+    # Override with config values
+    for k, v in args_override.items():
+        setattr(args, k, v)
+
+    # Build model
+    model = build_gr00t_model(args)
+    model.cuda()
+
+    # Create parameter groups with different learning rates
+    param_dicts = [
+        {
+            "params": [p for n, p in model.named_parameters()
+                      if "backbone" not in n and p.requires_grad]
+        },
+        {
+            "params": [p for n, p in model.named_parameters()
+                      if "backbone" in n and p.requires_grad],
+            "lr": args.lr_backbone,
+        },
+    ]
+
+    optimizer = torch.optim.AdamW(param_dicts,
+                                  lr=args.lr,
+                                  weight_decay=args.weight_decay)
+
+    return model, optimizer
--- a/gr00t/models/init.py
+++ b/gr00t/models/init.py
@@ -0,0 +1,3 @@
+from .gr00t import build_gr00t_model
+
+__all__ = ['build_gr00t_model']
--- a/roboimi/detr/models/backbone.py
+++ b/roboimi/detr/models/backbone.py
--- a/gr00t/models/dit.py
+++ b/gr00t/models/dit.py
@@ -0,0 +1,142 @@
+from typing import Optional
+
+from diffusers import ConfigMixin, ModelMixin
+from diffusers.configuration_utils import register_to_config
+from diffusers.models.embeddings import SinusoidalPositionalEmbedding, TimestepEmbedding, Timesteps
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+class TimestepEncoder(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        embedding_dim = args.embed_dim
+        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=1)
+        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
+
+    def forward(self, timesteps):
+        dtype = next(self.parameters()).dtype
+        timesteps_proj = self.time_proj(timesteps).to(dtype)
+        timesteps_emb = self.timestep_embedder(timesteps_proj)  # (N, D)
+        return timesteps_emb
+
+
+class AdaLayerNorm(nn.Module):
+    def __init__(self, embedding_dim, norm_eps=1e-5, norm_elementwise_affine=False):
+        super().__init__()
+
+        output_dim = embedding_dim * 2
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(embedding_dim, output_dim)
+        self.norm = nn.LayerNorm(output_dim // 2, norm_eps, norm_elementwise_affine)
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        temb: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        temb = self.linear(self.silu(temb))
+        scale, shift = temb.chunk(2, dim=1)
+        x = self.norm(x) * (1 + scale[:, None]) + shift[:, None]
+        return x
+    
+
+class BasicTransformerBlock(nn.Module):
+    def __init__(self, args, crosss_attention_dim, use_self_attn=False):
+        super().__init__()
+        dim = args.embed_dim
+        num_heads = args.nheads
+        mlp_ratio = args.mlp_ratio
+        dropout = args.dropout
+        self.norm1 = AdaLayerNorm(dim)
+        
+        if not use_self_attn:
+            self.attn = nn.MultiheadAttention(
+                embed_dim=dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                kdim=crosss_attention_dim,
+                vdim=crosss_attention_dim,
+                batch_first=True,
+            )
+        else:
+            self.attn = nn.MultiheadAttention(
+                embed_dim=dim,
+                num_heads=num_heads,
+                dropout=dropout,
+                batch_first=True,
+            )
+
+        self.norm2 = nn.LayerNorm(dim, eps=1e-5, elementwise_affine=False)
+
+        self.mlp = nn.Sequential(
+            nn.Linear(dim, dim * mlp_ratio),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(dim * mlp_ratio, dim),
+            nn.Dropout(dropout)
+        )
+
+    def forward(self, hidden_states, temb, context=None):
+        norm_hidden_states = self.norm1(hidden_states, temb)
+
+        attn_output = self.attn(
+            norm_hidden_states,
+            context if context is not None else norm_hidden_states,
+            context if context is not None else norm_hidden_states,
+        )[0]
+
+        hidden_states = attn_output + hidden_states
+
+        norm_hidden_states = self.norm2(hidden_states)
+
+        ff_output = self.mlp(norm_hidden_states)
+
+        hidden_states = ff_output + hidden_states
+
+        return hidden_states
+    
+class DiT(nn.Module):
+    def __init__(self, args, cross_attention_dim):
+        super().__init__()
+        inner_dim = args.embed_dim
+        num_layers = args.num_layers
+        output_dim = args.hidden_dim
+
+        self.timestep_encoder = TimestepEncoder(args)
+
+        all_blocks = []
+        for idx in range(num_layers):
+            use_self_attn = idx % 2 == 1
+            if use_self_attn:
+                block = BasicTransformerBlock(args, crosss_attention_dim=None, use_self_attn=True)
+            else:
+                block = BasicTransformerBlock(args, crosss_attention_dim=cross_attention_dim, use_self_attn=False)
+            all_blocks.append(block)
+
+        self.transformer_blocks = nn.ModuleList(all_blocks)
+
+        self.norm_out = nn.LayerNorm(inner_dim, eps=1e-6, elementwise_affine=False)
+        self.proj_out_1 = nn.Linear(inner_dim, 2 * inner_dim)
+        self.proj_out_2 = nn.Linear(inner_dim, output_dim)
+
+    def forward(self, hidden_states, timestep, encoder_hidden_states):
+        temb = self.timestep_encoder(timestep)
+
+        hidden_states = hidden_states.contiguous()
+        encoder_hidden_states = encoder_hidden_states.contiguous()    
+
+        for idx, block in enumerate(self.transformer_blocks):
+            if idx % 2 == 1:
+                hidden_states = block(hidden_states, temb)
+            else:
+                hidden_states = block(hidden_states, temb, context=encoder_hidden_states)
+
+        conditioning = temb
+        shift, scale = self.proj_out_1(F.silu(conditioning)).chunk(2, dim=1)
+        hidden_states = self.norm_out(hidden_states) * (1 + scale[:, None]) + shift[:, None]
+        return self.proj_out_2(hidden_states)
+    
+
+def build_dit(args, cross_attention_dim):
+    return DiT(args, cross_attention_dim)
--- a/gr00t/models/gr00t.py
+++ b/gr00t/models/gr00t.py
@@ -0,0 +1,124 @@
+
+from .modules import (
+    build_action_decoder,
+    build_action_encoder,
+    build_state_encoder,
+    build_time_sampler,
+    build_noise_scheduler,
+)
+from .backbone import build_backbone
+from .dit import build_dit
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+class gr00t(nn.Module):
+    def __init__(
+            self,
+            backbones,
+            dit,
+            state_encoder,
+            action_encoder,
+            action_decoder,
+            time_sampler,
+            noise_scheduler,
+            num_queries,
+            camera_names,
+    ):
+        super().__init__()
+        self.num_queries = num_queries
+        self.camera_names = camera_names
+        self.dit = dit
+        self.state_encoder = state_encoder
+        self.action_encoder = action_encoder
+        self.action_decoder = action_decoder
+        self.time_sampler = time_sampler
+        self.noise_scheduler = noise_scheduler
+
+        if backbones is not None:
+            self.backbones = nn.ModuleList(backbones)
+        else:
+            raise NotImplementedError
+        
+    def forward(self, qpos, image, actions=None, is_pad=None):
+        is_training = actions is not None # train or val
+        bs, _ = qpos.shape
+
+        all_cam_features = []
+        for cam_id, cam_name in enumerate(self.camera_names):
+            # features, pos = self.backbones[0](image[:, cam_id]) # HARDCODED
+            features, pos = self.backbones[cam_id](image[:, cam_id])
+            features = features[0] # take the last layer feature
+            B, C, H, W = features.shape
+            features_seq = features.permute(0, 2, 3, 1).reshape(B, H * W, C)
+            all_cam_features.append(features_seq)
+        encoder_hidden_states = torch.cat(all_cam_features, dim=1)
+
+        state_features = self.state_encoder(qpos)  # [B, 1, emb_dim]
+
+        if is_training:
+            # training logic
+            
+            timesteps = self.time_sampler(bs, actions.device, actions.dtype)
+            noisy_actions, target_velocity = self.noise_scheduler.add_noise(
+                actions, timesteps
+            )
+            t_discretized = (timesteps[:, 0, 0] * 1000).long()
+            action_features = self.action_encoder(noisy_actions, t_discretized)
+            sa_embs = torch.cat((state_features, action_features), dim=1)
+            model_output = self.dit(sa_embs, t_discretized, encoder_hidden_states)
+            pred = self.action_decoder(model_output)
+            pred_actions = pred[:, -actions.shape[1] :]
+            action_loss = F.mse_loss(pred_actions, target_velocity, reduction='none')
+            return pred_actions, action_loss
+        else:
+            actions = torch.randn(bs, self.num_queries, qpos.shape[-1], device=qpos.device, dtype=qpos.dtype)
+            k = 5
+            dt = 1.0 / k
+            for t in range(k):
+                t_cont = t / float(k)
+                t_discretized = int(t_cont * 1000)
+                timesteps = torch.full((bs,), t_discretized, device=qpos.device, dtype=qpos.dtype)
+                action_features = self.action_encoder(actions, timesteps)
+                sa_embs = torch.cat((state_features, action_features), dim=1)
+                # Create tensor of shape [B] for DiT (consistent with training path)
+                model_output = self.dit(sa_embs, timesteps, encoder_hidden_states)
+                pred = self.action_decoder(model_output)
+                pred_velocity = pred[:, -self.num_queries :]
+                actions = actions + pred_velocity * dt
+            return actions, _
+def build_gr00t_model(args):
+    state_dim = args.state_dim
+    action_dim = args.action_dim
+
+    backbones = []
+    for _ in args.camera_names:
+        backbone = build_backbone(args)
+        backbones.append(backbone)
+
+    cross_attention_dim = backbones[0].num_channels
+
+    dit = build_dit(args, cross_attention_dim)
+
+    state_encoder = build_state_encoder(args)
+    action_encoder = build_action_encoder(args)
+    action_decoder = build_action_decoder(args)
+    time_sampler = build_time_sampler(args)
+    noise_scheduler = build_noise_scheduler(args)
+    model = gr00t(
+        backbones,
+        dit,
+        state_encoder,
+        action_encoder,
+        action_decoder,
+        time_sampler,
+        noise_scheduler,
+        args.num_queries,
+        args.camera_names,
+    )
+
+    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print("number of parameters: %.2fM" % (n_parameters/1e6,))
+    return model
+
+
--- a/gr00t/models/modules.py
+++ b/gr00t/models/modules.py
@@ -0,0 +1,179 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+# ActionEncoder
+class SinusoidalPositionalEncoding(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.embed_dim = args.embed_dim
+
+    def forward(self, timesteps):
+        timesteps = timesteps.float()
+        B, T = timesteps.shape
+        device = timesteps.device
+
+        half_dim = self.embed_dim // 2
+
+        exponent = -torch.arange(half_dim, dtype=torch.float, device=device) * (
+            torch.log(torch.tensor(10000.0)) / half_dim
+        )
+
+        freqs = timesteps.unsqueeze(-1) * exponent.exp()
+
+        sin = torch.sin(freqs)
+        cos = torch.cos(freqs)
+        enc = torch.cat([sin, cos], dim=-1)  # (B, T, w)
+
+        return enc
+
+
+class ActionEncoder(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        action_dim = args.action_dim
+        embed_dim = args.embed_dim
+
+        self.W1 = nn.Linear(action_dim, embed_dim)
+        self.W2 = nn.Linear(2 * embed_dim, embed_dim)
+        self.W3 = nn.Linear(embed_dim, embed_dim)
+
+        self.pos_encoder = SinusoidalPositionalEncoding(args)
+
+    def forward(self, actions, timesteps):
+        B, T, _ = actions.shape
+
+        # 1) Expand each batch's single scalar time 'tau' across all T steps
+        #    so that shape => (B, T)
+        #    Handle different input shapes: (B,), (B, 1), (B, 1, 1)
+        #    Reshape to (B,) then expand to (B, T)
+        # if timesteps.dim() == 3:
+        #     # Shape (B, 1, 1) or (B, T, 1) -> (B,)
+        #     timesteps = timesteps[:, 0, 0]
+        # elif timesteps.dim() == 2:
+        #     # Shape (B, 1) or (B, T) -> take first element if needed
+        #     if timesteps.shape[1] == 1:
+        #         timesteps = timesteps[:, 0]
+        #     # else: already (B, T), use as is
+        # elif timesteps.dim() != 1:
+        #     raise ValueError(
+        #         f"Expected `timesteps` to have shape (B,), (B, 1), or (B, 1, 1), got {timesteps.shape}"
+        #     )
+
+        # Now timesteps should be (B,), expand to (B, T)
+        if timesteps.dim() == 1 and timesteps.shape[0] == B:
+            timesteps = timesteps.unsqueeze(1).expand(-1, T)
+        else:
+            raise ValueError(
+                "Expected `timesteps` to have shape (B,) so we can replicate across T."
+            )
+
+        # 2) Standard action MLP step for shape => (B, T, w)
+        a_emb = self.W1(actions)
+
+        # 3) Get the sinusoidal encoding (B, T, w)
+        tau_emb = self.pos_encoder(timesteps).to(dtype=a_emb.dtype)
+
+        # 4) Concat along last dim => (B, T, 2w), then W2 => (B, T, w), swish
+        x = torch.cat([a_emb, tau_emb], dim=-1)
+        x = F.silu(self.W2(x))
+
+        # 5) Finally W3 => (B, T, w)
+        x = self.W3(x)
+
+        return x
+
+
+def build_action_encoder(args):
+    return ActionEncoder(args)
+
+
+# StateEncoder
+class StateEncoder(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        input_dim = args.state_dim
+        hidden_dim = args.hidden_dim
+        output_dim = args.embed_dim
+
+        self.mlp = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, output_dim),
+        )
+
+    def forward(self, states):
+        state_emb = self.mlp(states)  # [B, emb_dim]
+        state_emb = state_emb.unsqueeze(1)
+        return state_emb  # [B, 1, emb_dim]
+    
+
+def build_state_encoder(args):
+    return StateEncoder(args)
+
+
+# ActionDecoder
+class ActionDecoder(nn.Module):
+    def __init__(self,args):
+        super().__init__()
+        input_dim = args.hidden_dim
+        hidden_dim = args.hidden_dim
+        output_dim = args.action_dim
+
+        self.num_queries = args.num_queries
+
+        self.mlp = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, output_dim),
+        )
+
+    def forward(self, model_output):
+        pred_actions = self.mlp(model_output)
+        return pred_actions[:, -self.num_queries:]
+    
+
+def build_action_decoder(args):
+    return ActionDecoder(args)
+
+
+# TimeSampler
+class TimeSampler(nn.Module):
+    def __init__(self, noise_s = 0.999, noise_beta_alpha=1.5, noise_beta_beta=1.0):
+        super().__init__()
+        self.noise_s = noise_s
+        self.beta_dist = torch.distributions.Beta(noise_beta_alpha, noise_beta_beta)
+
+    def forward(self, batch_size, device, dtype):
+        sample = self.beta_dist.sample([batch_size]).to(device, dtype=dtype)
+        sample = (1 - sample) * self.noise_s
+        return sample[:, None, None]
+    
+
+def build_time_sampler(args):
+    return TimeSampler()
+
+
+# NoiseScheduler
+import torch
+import torch.nn as nn
+
+class FlowMatchingScheduler(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    # --- 训练逻辑：加噪并计算目标 ---
+    def add_noise(self, actions, timesteps):
+        noise = torch.randn_like(actions)
+        noisy_samples = actions * timesteps + noise * (1 - timesteps)
+        target_velocity = actions - noise
+        
+        return noisy_samples, target_velocity
+
+    # --- 推理逻辑：欧拉步 (Euler Step) ---
+    def step(self, model_output, sample, dt):
+        prev_sample = sample + model_output * dt
+        return prev_sample
+
+def build_noise_scheduler(args):
+    return FlowMatchingScheduler()
--- a/roboimi/detr/models/position_encoding.py
+++ b/roboimi/detr/models/position_encoding.py
--- a/gr00t/policy.py
+++ b/gr00t/policy.py
@@ -0,0 +1,90 @@
+"""
+GR00T Policy wrapper for imitation learning.
+
+This module provides the gr00tPolicy class that wraps the GR00T model
+for training and evaluation in the imitation learning framework.
+"""
+import torch.nn as nn
+from torch.nn import functional as F
+from torchvision.transforms import v2
+import torch
+from roboimi.gr00t.main import build_gr00t_model_and_optimizer
+
+
+class gr00tPolicy(nn.Module):
+    """
+    GR00T Policy for action prediction using diffusion-based DiT architecture.
+
+    This policy wraps the GR00T model and handles:
+    - Image resizing to match DINOv2 patch size requirements
+    - Image normalization (ImageNet stats)
+    - Training with action chunks and loss computation
+    - Inference with diffusion sampling
+    """
+    def __init__(self, args_override):
+        super().__init__()
+        model, optimizer = build_gr00t_model_and_optimizer(args_override)
+        self.model = model
+        self.optimizer = optimizer
+
+        # DINOv2 requires image dimensions to be multiples of patch size (14)
+        # Common sizes: 224x224, 336x336, etc. (14*16=224, 14*24=336)
+        self.patch_h = 16  # Number of patches vertically
+        self.patch_w = 22  # Number of patches horizontally
+        target_size = (self.patch_h * 14, self.patch_w * 14)  # (224, 308)
+
+        # Training transform with data augmentation
+        self.train_transform = v2.Compose([
+            v2.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
+            v2.RandomPerspective(distortion_scale=0.5),
+            v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
+            v2.GaussianBlur(kernel_size=(9, 9), sigma=(0.1, 2.0)),
+            v2.Resize(target_size),
+            v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
+        ])
+
+        # Inference transform (no augmentation)
+        self.inference_transform = v2.Compose([
+            v2.Resize(target_size),
+            v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
+        ])
+
+    def __call__(self, qpos, image, actions=None, is_pad=None):
+        """
+        Forward pass for training or inference.
+
+        Args:
+            qpos: Joint positions [B, state_dim]
+            image: Camera images [B, num_cameras, C, H, W]
+            actions: Ground truth actions [B, chunk_size, action_dim] (training only)
+            is_pad: Padding mask [B, chunk_size] (training only)
+
+        Returns:
+            Training: dict with 'mse' loss
+            Inference: predicted actions [B, num_queries, action_dim]
+        """
+        # Apply transforms (resize + normalization)
+        if actions is not None:  # training time
+            image = self.train_transform(image)
+        else:  # inference time
+            image = self.inference_transform(image)
+
+        if actions is not None:  # training time
+            actions = actions[:, :self.model.num_queries]
+            is_pad = is_pad[:, :self.model.num_queries]
+            _, action_loss = self.model(qpos, image, actions, is_pad)
+
+            # Mask out padded positions
+            mse_loss = (action_loss * ~is_pad.unsqueeze(-1)).mean()
+
+            loss_dict = {
+                'loss': mse_loss
+            }
+            return loss_dict
+        else:  # inference time
+            a_hat, _ = self.model(qpos, image)
+            return a_hat
+
+    def configure_optimizers(self):
+        """Return the optimizer for training."""
+        return self.optimizer
--- a/roboimi/.gitattributes
+++ b/roboimi/.gitattributes
@@ -0,0 +1 @@
+*.safetensors filter=lfs diff=lfs merge=lfs -text
--- a/roboimi/init.py
+++ b/roboimi/init.py
--- a/roboimi/assets/models/manipulators/DianaMed/box.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/box.xml
@@ -3,7 +3,7 @@
      <body name="box" pos="0.2 1.0 0.47">
            <joint name="red_box_joint" type="free" frictionloss="0.01" />
            <inertial pos="0 0 0" mass="0.05" diaginertia="0.002 0.002 0.002" />
-            <geom contype="1" conaffinity="1"  condim="4" solimp="2 1 0.01" solref="0.01 1" friction="1 0.005 0.0001" pos="0 0 0" size="0.02 0.02 0.02" type="box" name="red_box" rgba="1 0 0 1" />
+            <geom contype="1" conaffinity="1"  condim="4" solimp="2 1 0.01" solref="0.01 1" friction="1 0.005 0.0001" pos="0 0 0" size="0.018 0.018 0.02" type="box" name="red_box" rgba="1 0 0 1" />
        </body>
  </worldbody>
 </mujoco>
--- a/roboimi/assets/models/manipulators/DianaMed/table_square.xml
+++ b/roboimi/assets/models/manipulators/DianaMed/table_square.xml
@@ -8,5 +8,6 @@
      </body>
      <camera name="top" pos="0.0 1.0 2.0" fovy="44" mode="targetbody" target="table"/>
      <camera name="angle" pos="0.0 0.0 2.0" fovy="37" mode="targetbody" target="table"/>
+      <camera name="front" pos="0 0 0.8" fovy="65" mode="fixed" quat="0.7071 0.7071 0 0"/>
  </worldbody>
 </mujoco>
--- a/roboimi/assets/robots/arm_base.py
+++ b/roboimi/assets/robots/arm_base.py
@@ -1,8 +1,46 @@
 import mujoco
 import numpy as np
+from pathlib import Path
 from roboimi.utils.KDL_utils import KDL_utils


+def resolve_robot_asset_path(asset_path):
+    if asset_path is None:
+        return None
+
+    raw_path = Path(asset_path).expanduser()
+    if raw_path.is_absolute():
+        return str(raw_path.resolve())
+
+    current_dir = Path(__file__).resolve().parent
+    package_root = current_dir.parents[1]
+    repo_root = current_dir.parents[2]
+
+    candidates = []
+    if raw_path.parts and raw_path.parts[0] == 'roboimi':
+        candidates.append(repo_root / raw_path)
+
+    candidates.extend([
+        current_dir / raw_path,
+        package_root / raw_path,
+        repo_root / raw_path,
+    ])
+
+    normalized_candidates = []
+    seen = set()
+    for candidate in candidates:
+        resolved = candidate.resolve()
+        if resolved not in seen:
+            normalized_candidates.append(resolved)
+            seen.add(resolved)
+
+    for candidate in normalized_candidates:
+        if candidate.exists():
+            return str(candidate)
+
+    return str(normalized_candidates[0])
+
+
 class ArmBase(object):
    def __init__(self,
                 name=None,
@@ -11,8 +49,8 @@ class ArmBase(object):
                 gripper=None
                 ):
        self.name = name
-        self.urdf_path = urdf_path
-        self.xml_path = xml_path
+        self.urdf_path = resolve_robot_asset_path(urdf_path)
+        self.xml_path = resolve_robot_asset_path(xml_path)
        self.gripper = gripper
        self.robot_model = mujoco.MjModel.from_xml_path(filename=self.xml_path, assets=None)
        self.robot_data = mujoco.MjData(self.robot_model)
--- a/roboimi/assets/robots/diana_med.py
+++ b/roboimi/assets/robots/diana_med.py
@@ -58,8 +58,8 @@ class BiDianaMed(ArmBase):
    def __init__(self):
        super().__init__(
            name="Bidiana",
-            urdf_path="./assets/models/manipulators/DianaMed/DualDianaMed.urdf",
-            xml_path="./assets/models/manipulators/DianaMed/bi_diana_transfer_ee.xml",
+            urdf_path="roboimi/assets/models/manipulators/DianaMed/DualDianaMed.urdf",
+            xml_path="roboimi/assets/models/manipulators/DianaMed/bi_diana_transfer_ee.xml",
            gripper=None
        )
        self.left_arm = self.Arm(self, 'single', self.urdf_path)
--- a/roboimi/demos/config.yaml
+++ b/roboimi/demos/config.yaml
@@ -8,7 +8,7 @@ temporal_agg: false

 # policy_class: "ACT"
 # backbone: 'resnet18'
-policy_class: "ACTTV"
+policy_class: "GR00T"
 backbone: 'dino_v2'

 seed: 0
@@ -38,8 +38,13 @@ episode_len:      # leave empty here by default
 camera_names: []  # leave empty here by default
 xml_dir:          # leave empty here by default

+# action smoothing settings (for GR00T)
+use_action_smoothing: true
+smooth_method: "ema"     # Options: "ema", "moving_avg", "lowpass", "none"
+smooth_alpha: 0.3        # Smoothing factor (0-1), smaller = smoother
+
 # transformer settings
-batch_size: 15                          
+batch_size: 10                          
 state_dim: 16            
 action_dim: 16      
 lr_backbone: 0.00001        
@@ -51,6 +56,21 @@ nheads: 8
 qpos_noise_std: 0
 DT: 0.02

+gr00t:
+  action_dim: 16
+  state_dim: 16
+  embed_dim: 1536
+  hidden_dim: 1024
+  num_queries: 8
+
+  nheads: 32
+  mlp_ratio: 4
+  dropout: 0.2
+
+  num_layers: 16
+
+
+
 # DO NOT CHANGE IF UNNECESSARY
 lr: 0.00001             
 kl_weight: 100              
--- a/roboimi/demos/diana_eval.py
+++ b/roboimi/demos/diana_eval.py
@@ -1,119 +0,0 @@
-import torch
-import os
-import numpy as np
-import matplotlib.pyplot as plt
-from tqdm import tqdm
-from einops import rearrange
-from roboimi.utils.utils import set_seed
-from roboimi.utils.io_utils import IOUtils
-from roboimi.utils.model_interface import ModelInterface
-from roboimi.envs.double_pos_ctrl_env import make_sim_env
-# from visualize_episodes import save_videos
-from roboimi.utils.act_ex_utils import sample_transfer_pose
-
-
-
-#should be added into IOUtils
-def get_image(obs,camera_names):
-    curr_images = []
-    for cam_name in camera_names:
-        curr_image = rearrange(obs['images'][cam_name], 'h w c -> c h w')
-        curr_images.append(curr_image)
-    curr_image = np.stack(curr_images, axis=0)
-    curr_image = torch.from_numpy(curr_image / 255.0).float().cuda().unsqueeze(0)
-    return curr_image
-
-
-def eval_bc(config, ckpt_name='policy_best.ckpt', save_episode=True):
-    set_seed(1)
-    model_interface = ModelInterface(config)
-    model_interface.setup()
-    policy = IOUtils.load_policy(config, ckpt_name)
-    stats = IOUtils.load_stats(config['ckpt_dir'])
-    num_rollouts = 3
-    episode_returns = []
-    highest_rewards = []
-    
-    
-    
-   
-    
-    run_episode(config, policy, stats,
-                save_episode,num_rollouts)
-    # episode_return, episode_highest_reward = run_episode(config, policy, stats,
-    #                                                           save_episode,num_rollouts)
-    
-
-    
-
-def run_episode(config, policy, stats, save_episode,num_rollouts):
-
-    if 'sim_transfer' in config['task_name']:
-        task_name =  'sim_transfer'  #config['task_name']
-        env = make_sim_env(task_name)
-    
-    max_timesteps = config['episode_len']
-    max_timesteps = int(max_timesteps * 1)
-    pre_process = lambda s_qpos: (s_qpos - stats['qpos_mean']) / stats['qpos_std']
-    post_process = lambda a: a * stats['action_std'] + stats['action_mean']
-    box_pos = sample_transfer_pose()
-    for rollout_id in range(num_rollouts):
-        print("\nrollout_id===",rollout_id,"\n")
-        image_list = []
-        rewards = []
-        query_frequency = config['policy_config'].get('num_queries', 1)
-        print("query_freq =====",query_frequency)
-        env.reset(box_pos)
-        with torch.inference_mode():
-            for t in range(700):
-                image_list.append(env._get_image_obs()['images'] if 'images' in env._get_image_obs() else {print("img error")})
-                qpos_numpy = np.array(env._get_qpos_obs()['qpos'])
-                qpos = pre_process(qpos_numpy)
-                qpos = torch.from_numpy(qpos).float().cuda().unsqueeze(0)
-                curr_image = get_image(env._get_image_obs(), config['camera_names'])
-                if config['policy_class'] == "ACT" or "ACTTV":
-                    if t % query_frequency == 0:
-                        all_actions = policy(qpos, curr_image)
-                    raw_action = all_actions[:, t % query_frequency]
-                    # raw_action = all_actions[:, t % 1]
-                    raw_action = raw_action.squeeze(0).cpu().numpy()
-                elif config['policy_class'] == "CNNMLP":
-                    raw_action = policy(qpos, curr_image)
-                else:
-                    raise NotImplementedError
-
-
-                action = post_process(raw_action)
-                print("action == ",action)
-                env.step_jnt(action)
-                rewards.append(env.rew)
-                env.render()
-
-
-    rewards = np.array(rewards)
-    # episode_return = np.sum(rewards[rewards != None])
-    # episode_highest_reward = np.max(rewards)
-    # env.viewer.close()
-
-    # del env
-    # return episode_return, episode_highest_reward
-
-
-
-
-def test_env():
-    try:
-        env = make_sim_env('sim_transfer')
-        env.reset()
-        while True: pass
-    except KeyboardInterrupt:
-        del env
-        print("stop")
-
-if __name__ == '__main__':
-    # test_env()
-    io_utils = IOUtils()
-    config = io_utils.load_config()
-    eval_bc(config)
-
-
--- a/roboimi/demos/diana_policy.py
+++ b/roboimi/demos/diana_policy.py
@@ -104,8 +104,8 @@ class TestPickAndTransferPolicy(PolicyBase):
            {"t": 1, "xyz": init_mocap_pose_right[:3], "quat": init_mocap_pose_right[3:], "gripper": -100}, # sleep
            {"t": 75, "xyz": np.array([(0.8+box_xyz[0])*0.5,(1.0+box_xyz[1])*0.5,init_mocap_pose_right[2]]), "quat": gripper_approach_quat.elements, "gripper": 100},
            {"t": 225, "xyz": box_xyz + np.array([0, 0, 0.3]), "quat": gripper_pick_quat.elements, "gripper": 100}, # approach the cube
-            {"t": 275, "xyz": box_xyz + np.array([0, 0, 0.12]), "quat": gripper_pick_quat.elements, "gripper": 100}, # go down
-            {"t": 280, "xyz": box_xyz + np.array([0, 0, 0.12]), "quat": gripper_pick_quat.elements, "gripper": -100}, # close gripper
+            {"t": 275, "xyz": box_xyz + np.array([0, 0, 0.11]), "quat": gripper_pick_quat.elements, "gripper": 100}, # go down
+            {"t": 280, "xyz": box_xyz + np.array([0, 0, 0.11]), "quat": gripper_pick_quat.elements, "gripper": -100}, # close gripper
            {"t": 450, "xyz": init_mocap_pose_right[:3], "quat": init_mocap_pose_right[3:], "gripper": -100},# approach wait position
            {"t": 500, "xyz": meet_xyz + np.array([0.1, 0, 0.0]), "quat": meet_right_quat.elements, "gripper": -100},# approach meet position
            {"t": 510, "xyz": meet_xyz + np.array([0.1, 0, 0.0]), "quat": meet_right_quat.elements, "gripper": 100}, # open gripper
@@ -116,8 +116,8 @@ class TestPickAndTransferPolicy(PolicyBase):
        self.left_trajectory = [
            {"t": 1, "xyz": init_mocap_pose_left[:3], "quat": init_mocap_pose_left[3:], "gripper": -100},# sleep
            {"t": 250, "xyz": meet_xyz + np.array([-0.5, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # approach meet position
-            {"t": 500, "xyz": meet_xyz + np.array([-0.15, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # move to meet position
-            {"t": 505, "xyz": meet_xyz + np.array([-0.15, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # close gripper
+            {"t": 500, "xyz": meet_xyz + np.array([-0.14, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": 100}, # move to meet position
+            {"t": 505, "xyz": meet_xyz + np.array([-0.14, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # close gripper
            {"t": 675, "xyz": meet_xyz + np.array([-0.3, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # move left
            {"t": 700, "xyz": meet_xyz + np.array([-0.3, 0, 0.0]), "quat": meet_left_quat.elements, "gripper": -100}, # stay
        ]
--- a/roboimi/demos/diana_record_sim_episodes.py
+++ b/roboimi/demos/diana_record_sim_episodes.py
@@ -1,11 +1,11 @@
 import time
-import os,collections,sys
+import os
 import numpy as np
-import h5py
 from roboimi.envs.double_pos_ctrl_env import make_sim_env
 from diana_policy import TestPickAndTransferPolicy
 import cv2
 from roboimi.utils.act_ex_utils import sample_transfer_pose
+from roboimi.utils.streaming_episode_writer import StreamingEpisodeWriter

 import pathlib
 HOME_PATH = str(pathlib.Path(__file__).parent.resolve())
@@ -16,14 +16,12 @@ def main():
    task_name = 'sim_transfer'
    dataset_dir = DATASET_DIR + '/sim_transfer'   #SIM_TASK_CONFIGS[task_name]['dataset_dir']
    num_episodes = 100       #SIM_TASK_CONFIGS[task_name]['num_episodes']
-    onscreen_render = None  #config['onscreen_render']
    inject_noise = False
-    render_cam_name = 'angle'

    episode_len  = 700      #SIM_TASK_CONFIGS[task_name]['episode_len']
-    camera_names = ['angle','r_vis', 'top']   #SIM_TASK_CONFIGS[task_name]['camera_names']
+    camera_names = ['angle','r_vis', 'top', 'front']   #SIM_TASK_CONFIGS[task_name]['camera_names']
+    image_size = (256, 256)
    if task_name == 'sim_transfer':
-        policy = TestPickAndTransferPolicy(inject_noise)
        print(task_name)
    else:
        raise NotImplementedError
@@ -32,63 +30,45 @@ def main():
    
    env = make_sim_env(task_name)
    policy = TestPickAndTransferPolicy(inject_noise)
+
+    # 等待osmesa完全启动后再开始收集数据
+    print("等待osmesa线程启动...")
+    time.sleep(60)
+    print("osmesa已就绪，开始收集数据...")
+
    for episode_idx in range(num_episodes):
-        obs = []
-        reward_ee = []
+        sum_reward = 0.0
+        max_reward = float('-inf')
        print(f'\n{episode_idx=}')
        print('Rollout out EE space scripted policy')
        box_pos = sample_transfer_pose()
        env.reset(box_pos)
+        episode_writer = StreamingEpisodeWriter(
+            dataset_path=os.path.join(dataset_dir, f'episode_{episode_idx}.hdf5'),
+            max_timesteps=episode_len,
+            camera_names=camera_names,
+            image_size=image_size,
+        )
        for step in range(episode_len):
-            
-            
-            action = policy.predict(box_pos,step)
-            env.step(action)
+            raw_action = policy.predict(box_pos,step)
+            env.step(raw_action)
            env.render()
-            reward_ee.append(env.rew)
-            obs.append(env.obs)
-        sum_reward = np.sum(reward_ee)
-        max_reward = np.max(reward_ee)
+            sum_reward += env.rew
+            max_reward = max(max_reward, env.rew)
+            episode_writer.append(
+                qpos=env.obs['qpos'],
+                action=raw_action,
+                images=env.obs['images'],
+            )
        if max_reward == env.max_reward:
            success.append(1)
            print(f"{episode_idx=} Successful, {sum_reward=}")
-            t0 = time.time()
-            data_dict = {
-            '/observations/qpos': [],
-            '/action': [],
-            }
-            
-            for cam_name in camera_names:
-                data_dict[f'/observations/images/{cam_name}'] = []
-            for i in range(episode_len):
-                print("type qpos==",obs[i]['qpos'])
-                data_dict['/observations/qpos'].append(obs[i]['qpos'])
-                data_dict['/action'].append(obs[i]['action'])
-                for cam_name in camera_names:
-                    data_dict[f'/observations/images/{cam_name}'].append(obs[i]['images'][cam_name])
-            
-            dataset_path = os.path.join(dataset_dir, f'episode_{episode_idx}')
-
-            with h5py.File(dataset_path + '.hdf5', 'w', rdcc_nbytes=1024 ** 2 * 2) as root:
-                max_timesteps = episode_len
-                root.attrs['sim'] = True
-                obs_ = root.create_group('observations')
-                image = obs_.create_group('images')
-                for cam_name in camera_names:
-                    _ = image.create_dataset(cam_name, (max_timesteps, 480, 640, 3), dtype='uint8',
-                                         chunks=(1, 480, 640, 3), )
-                qpos = obs_.create_dataset('qpos', (max_timesteps, 16))
-                action = root.create_dataset('action', (max_timesteps, 16))
-                for name, array in data_dict.items():
-                    root[name][...] = np.array(array)              
+            episode_writer.commit()
        else:
            success.append(0)
            print(f"{episode_idx=} Failed")
            print(max_reward)
-        del obs
-        del reward_ee
-        del sum_reward
-        del max_reward
+            episode_writer.discard()
        
        # del policy
        # env.viewer.close()
--- a/roboimi/demos/eval.py
+++ b/roboimi/demos/eval.py
@@ -1,152 +0,0 @@
-import torch
-import os
-import numpy as np
-import matplotlib.pyplot as plt
-from tqdm import tqdm
-from einops import rearrange
-from roboimi.utils.utils import set_seed
-from roboimi.utils.io_utils import IOUtils
-from roboimi.utils.model_interface import ModelInterface
-from roboimi.envs.vx300s_jnt import make_sim_env
-import time
-
-# from visualize_episodes import save_videos
-from roboimi.utils.utils import sample_box_pose, sample_insertion_pose
-
-
-
-#should be added into IOUtils
-def get_image(obs,camera_names):
-    curr_images = []
-    for cam_name in camera_names:
-        curr_image = rearrange(obs['images'][cam_name], 'h w c -> c h w')
-        curr_images.append(curr_image)
-    curr_image = np.stack(curr_images, axis=0)
-    curr_image = torch.from_numpy(curr_image / 255.0).float().cuda().unsqueeze(0)
-    return curr_image
-
-
-def eval_bc(config, ckpt_name='policy_best.ckpt', save_episode=True):
-    set_seed(1)
-    model_interface = ModelInterface(config)
-    task_name = 'sim_insertion' #config['task_name']
-    model_interface.setup()
-    policy = IOUtils.load_policy(config, ckpt_name)
-    stats = IOUtils.load_stats(config['ckpt_dir'])
-    num_rollouts = 3
-    episode_returns = []
-    highest_rewards = []
-    for rollout_id in range(num_rollouts):
-        episode_return, episode_highest_reward = run_episode(config, policy, stats,
-                                                              save_episode,rollout_id)
-    
-
-    
-
-def run_episode(config, policy, stats, save_episode,rollout_id):
-    print("\nrollout_id===",rollout_id,"\n")
-    pre_process = lambda s_qpos: (s_qpos - stats['qpos_mean']) / stats['qpos_std']
-    post_process = lambda a: a * stats['action_std'] + stats['action_mean']
-    if 'sim_insertion' in config['task_name']:
-        peg_pose, socket_pose = sample_insertion_pose()
-        box_pose = np.hstack((peg_pose[:3],socket_pose[:3])) # used in sim reset
-    task_name =  'sim_insertion'  #config['task_name']
-    env = make_sim_env(task_name)
-    env.reset(box_pose)
-    max_timesteps = config['episode_len']
-    max_timesteps = int(max_timesteps * 1)
- 
-    image_list = []
-    rewards = []
-    query_frequency = config['policy_config'].get('num_queries', 1)
-
-    with torch.inference_mode():
-        for t in range(700):
-            # print("obs_img",env.obs['images'])
-            image_list.append(env.obs['images'] if 'images' in env.obs else {print("img error")})
-            qpos_numpy = np.array(env.obs['qpos'])
-            qpos = pre_process(qpos_numpy)
-            qpos = torch.from_numpy(qpos).float().cuda().unsqueeze(0)
-            curr_image = get_image(env.obs, config['camera_names'])
-            if config['policy_class'] == "ACT" or "ACTTV":
-                if t % query_frequency == 0:
-                    all_actions = policy(qpos, curr_image)
-            elif config['policy_class'] == "CNNMLP":
-                raw_action = policy(qpos, curr_image)
-            else:
-                raise NotImplementedError
-            raw_action = all_actions[:, t % query_frequency]
-            raw_action = raw_action.squeeze(0).cpu().numpy()
-            action = post_process(raw_action)
-            
-            env.step(action)
-            rewards.append(env.rew)
-            env.render()
-
-
-    rewards = np.array(rewards)
-    episode_return = np.sum(rewards[rewards != None])
-    episode_highest_reward = np.max(rewards)
-    env.viewer.close()
-
-    del env
-    return episode_return, episode_highest_reward
-
-
-def test_env():
-    try:
-        env = make_sim_env('sim_insertion')
-        box_pos = np.concatenate(sample_insertion_pose())
-        env.reset(box_pos)
-        while True: pass
-    except KeyboardInterrupt:
-        del env
-        print("stop")
-    
-
-if __name__ == '__main__':
-    test_env()
-    # io_utils = IOUtils()
-    # config = io_utils.load_config()
-    # eval_bc(config)
-
-
-
-
-# config===== {'onscreen_render': False,
-#  'eval': 1, 
-# 'ckpt_dir': 'ckpt_models', 
-# 'num_epochs': 3000, 
-# 'temporal_agg': False, 
-# 'policy_class': 'ACT', 
-# 'backbone': 'resnet18', 
-# 'seed': 0, 'real_robot': 0,
-#  'task_name': 'sim_insertion', 
-# 'images_render_height': 480, 
-# 'images_render_width': 640, 
-# 'left_arm_DOF_number': 6, 
-# 'right_arm_DOF_number': 6, 
-# 'left_qpos_raw': 8, 
-# 'right_qpos_raw': 8, 
-# 'left_qvel_raw': 8, 
-# 'right_qvel_raw': 8, 
-# 'dataset_dir': '/home/arm/lzd/act_env/dataset/sim_insertion', 
-# 'num_episodes': 7, 
-# 'episode_len': 400, 
-# 'camera_names': ['top'], 
-# 'xml_dir': None, 
-# 'batch_size': 8, 
-# 'state_dim': 14, 
-# 'action_dim': 14, 
-# 'lr_backbone': 1e-05, 
-# 'enc_layers': 4, 
-# 'dec_layers': 7, 
-# 'nheads': 8, 
-# 'qpos_noise_std': 0, 
-# 'DT': 0.02, 
-# 'lr': 1e-05, 
-# 'kl_weight': 10, 
-# 'chunk_size': 100, 
-# 'hidden_dim': 512, 
-# 'dim_feedforward': 3200, 
-# 'policy_config': {'lr': 1e-05, 'num_queries': 100, 'kl_weight': 10, 'hidden_dim': 512, 'dim_feedforward': 3200, 'lr_backbone': 1e-05, 'backbone': 'resnet18', 'enc_layers': 4, 'dec_layers': 7, 'nheads': 8, 'camera_names': ['top']}} 
--- a/roboimi/demos/training.py
+++ b/roboimi/demos/training.py
@@ -1,179 +0,0 @@
-import torch
-import os
-from tqdm import tqdm
-import numpy as np
-from copy import deepcopy
-from itertools import repeat
-import matplotlib.pyplot as plt
-import time
-from roboimi.utils.utils import set_seed, compute_dict_mean, detach_dict, load_data
-from roboimi.utils.io_utils import IOUtils
-from roboimi.utils.model_interface import ModelInterface
-import matplotlib.pyplot as plt
-
-def train_bc(config):
-    num_epochs = config['num_epochs']
-    ckpt_dir = config['ckpt_dir']
-    seed = config['seed']
-
-    os.makedirs(ckpt_dir, exist_ok=True)
-
-    set_seed(seed)
-
-    model_interface = ModelInterface(config)
-    model_interface.setup()
-
-    policy = model_interface.make_policy()
-    policy.cuda()
-    optimizer = model_interface.make_optimizer(policy)
-    # print("cam names=====",config['camera_names'])
-    train_dataloader, val_dataloader, stats, _ = load_data(
-        config['dataset_dir'], 
-        config['num_episodes'], 
-        config['camera_names'], 
-        config['batch_size'], 
-        config['batch_size'])
-
-    IOUtils.save_stats(ckpt_dir, stats)
-
-    train_history = []
-    validation_history = []
-    min_val_loss = np.inf
-    min_train_loss = np.inf
-    best_ckpt_info = None
-
-    plt.ion()
-    fig, ax = plt.subplots()
-    train_losses, val_losses = [], []
-    train_line, = ax.plot([], [], label='Train Loss')
-    val_line, = ax.plot([], [], label='Validation Loss')
-    ax.autoscale_view()
-    ax.set_xlabel('Epoch')
-    ax.set_ylabel('Loss')
-    ax.legend()
-    ax.grid(True)
-    
-
-    train_annotation = ax.annotate('', xy=(0, 0), textcoords='offset points')
-    val_annotation = ax.annotate('', xy=(0, 0), textcoords='offset points')
-    
-
-    min_train_text = ax.text(0.85, 0.5, '', transform=ax.transAxes, fontsize=10, verticalalignment='center', horizontalalignment='left', bbox=dict(facecolor='white', alpha=0.5))
-    min_val_text = ax.text(0.85, 0.45, '', transform=ax.transAxes, fontsize=10, verticalalignment='center', horizontalalignment='left', bbox=dict(facecolor='white', alpha=0.5))
-
-    for epoch in tqdm(range(num_epochs)):
-        print(f'\nEpoch {epoch}')
-
-        # Validation
-        epoch_val_loss, epoch_summary = validate(policy, val_dataloader)
-        validation_history.append(epoch_summary)
-        val_losses.append(epoch_val_loss.cpu().item()) 
-
-        if epoch_val_loss < min_val_loss:
-            min_val_loss = epoch_val_loss
-            min_val_epoch = epoch
-            best_ckpt_info = (epoch, min_val_loss,
-                              deepcopy(policy.state_dict()))
-
-        print(f'Val loss:   {epoch_val_loss:.5f}')
-        print_summary(epoch_summary)
-
-        # Training
-        epoch_train_loss, epoch_summary = train_epoch(
-            policy, optimizer, train_dataloader)
-        train_history.append(epoch_summary)
-        train_losses.append(epoch_train_loss.cpu().item()) 
-
-        if epoch_train_loss < min_train_loss:
-            min_train_loss = epoch_train_loss
-            min_train_epoch = epoch
-
-        print(f'Train loss: {epoch_train_loss:.5f}')
-        print_summary(epoch_summary)
-
-        # Update the plot with the new data
-        train_line.set_xdata(range(len(train_losses)))
-        train_line.set_ydata(train_losses)
-        val_line.set_xdata(range(len(val_losses)))
-        val_line.set_ydata(val_losses)
-        
-        # Update annotations with the latest loss values at their respective positions
-        train_annotation.set_position((len(train_losses)-1, train_losses[-1]))
-        train_annotation.xy = (len(train_losses)-1, train_losses[-1])
-        train_annotation.set_text(f'{train_losses[-1]:.5f}')
-        
-        val_annotation.set_position((len(val_losses)-1, val_losses[-1]))
-        val_annotation.xy = (len(val_losses)-1, val_losses[-1])
-        val_annotation.set_text(f'{val_losses[-1]:.5f}')
-        
-        # Update text objects with the minimum loss values, fixed on the right side
-        min_train_text.set_text(f'Min Train Loss: {min_train_loss:.5f} (Epoch {min_train_epoch})')
-        min_val_text.set_text(f'Min Val Loss: {min_val_loss:.5f} (Epoch {min_val_epoch})')
-        
-        ax.relim()
-        ax.autoscale_view()
-        plt.draw()
-        plt.pause(0.1)
-
-
-    plt.ioff() 
-    IOUtils.save_checkpoint(policy, 'last', ckpt_dir, seed, 'last')
-
-    best_epoch, min_val_loss, best_state_dict = best_ckpt_info
-    IOUtils.save_checkpoint(best_state_dict, best_epoch,
-                            ckpt_dir, seed, 'best', min_val_loss)
-    print(
-        f'Training finished:\nSeed {seed}, val loss {min_val_loss:.6f} at epoch {best_epoch}')
-
-    IOUtils.plot_history(train_history, validation_history,
-                         num_epochs, ckpt_dir, seed)
-
-    return best_ckpt_info
-
-
-
-
-
-
-def validate(policy, dataloader):
-    policy.eval()
-    epoch_dicts = []
-    with torch.inference_mode():
-        for data in dataloader:
-            forward_dict = forward_pass(data, policy)
-            epoch_dicts.append(forward_dict)
-    epoch_summary = compute_dict_mean(epoch_dicts)
-    return epoch_summary['loss'], epoch_summary
-
-
-def train_epoch(policy, optimizer, dataloader):
-    policy.train()
-    epoch_dicts = []
-    for data in dataloader:
-        optimizer.zero_grad()
-        forward_dict = forward_pass(data, policy)
-        loss = forward_dict['loss']
-        loss.backward()
-        optimizer.step()
-        epoch_dicts.append(detach_dict(forward_dict))
-    epoch_summary = compute_dict_mean(epoch_dicts)
-    return epoch_summary['loss'], epoch_summary
-
-
-def forward_pass(data, policy):
-    image_data, qpos_data, action_data, is_pad = data
-    image_data, qpos_data, action_data, is_pad = image_data.cuda(
-    ), qpos_data.cuda(), action_data.cuda(), is_pad.cuda()
-    return policy(qpos_data, image_data, action_data, is_pad)
-
-
-def print_summary(summary):
-    summary_string = ' '.join(
-        [f'{k}: {v.item():.3f}' for k, v in summary.items()])
-    print(summary_string)
-
-
-if __name__ == '__main__':
-    io_utils = IOUtils()
-    config = io_utils.load_config()
-    train_bc(config)
--- a/roboimi/demos/view_raw_action_trajectory.py
+++ b/roboimi/demos/view_raw_action_trajectory.py
@@ -0,0 +1,36 @@
+import argparse
+import numpy as np
+
+from roboimi.utils.raw_action_trajectory_viewer import launch_raw_action_trajectory_viewer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Launch an interactive MuJoCo viewer with raw-action trajectory overlay.")
+    parser.add_argument("trajectory_path", help="Path to raw_action.npy or trajectory.npz")
+    parser.add_argument("--task-name", default="sim_transfer")
+    parser.add_argument("--line-radius", type=float, default=0.004)
+    parser.add_argument("--max-markers", type=int, default=1500)
+    parser.add_argument(
+        "--box-pos",
+        type=float,
+        nargs=3,
+        default=None,
+        help="Optional box xyz to use when resetting the environment",
+    )
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    box_pos = np.asarray(args.box_pos, dtype=np.float32) if args.box_pos is not None else None
+    launch_raw_action_trajectory_viewer(
+        args.trajectory_path,
+        task_name=args.task_name,
+        line_radius=args.line_radius,
+        max_markers=args.max_markers,
+        box_pos=box_pos,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/roboimi/demos/vla_scripts/eval_vla.py
+++ b/roboimi/demos/vla_scripts/eval_vla.py
@@ -0,0 +1,796 @@
+"""
+VLA 策略评估脚本（简化版）
+
+该脚本使用 agent 内置的队列管理来评估训练好的 VLA 策略。
+无需单独的评估器类 - agent 处理一切！
+
+使用方法:
+    python roboimi/demos/eval_vla_simple.py
+    python roboimi/demos/eval_vla_simple.py eval.ckpt_path=checkpoints/vla_model_final.pt
+    python roboimi/demos/eval_vla_simple.py eval.ckpt_path=checkpoints/vla_model_best.pt
+"""
+
+import sys
+import os
+import json
+import logging
+import time
+import torch
+import numpy as np
+import hydra
+from pathlib import Path
+from typing import Any, Dict, Optional
+from tqdm import tqdm
+from omegaconf import DictConfig, OmegaConf
+from hydra.utils import instantiate
+from einops import rearrange
+
+from roboimi.envs.double_pos_ctrl_env import make_sim_env
+from roboimi.utils.act_ex_utils import sample_transfer_pose
+from roboimi.vla.eval_utils import execute_policy_action
+
+sys.path.append(os.getcwd())
+
+log = logging.getLogger(__name__)
+
+if not OmegaConf.has_resolver("len"):
+    OmegaConf.register_new_resolver("len", lambda x: len(x))
+
+
+def load_checkpoint(
+    ckpt_path: str,
+    agent_cfg: DictConfig,
+    device: str = 'cuda'
+) -> torch.nn.Module:
+    """
+    从检查点加载训练好的 VLA 模型，使用 Hydra agent 配置。
+
+    Args:
+        ckpt_path: 检查点文件路径 (.pt)
+        agent_cfg: Hydra agent 配置，用于实例化
+        device: 加载模型的设备
+
+    Returns:
+        加载后的 VLAAgent 模型
+    """
+    from pathlib import Path as PathLib
+
+    ckpt_path = PathLib(ckpt_path).absolute()
+    if not ckpt_path.exists():
+        raise FileNotFoundError(f"检查点未找到: {ckpt_path}")
+
+    log.info(f"从 {ckpt_path} 加载检查点")
+    checkpoint = torch.load(ckpt_path, map_location=device, weights_only=False)
+    log.info(f"检查点键值: {checkpoint.keys()}")
+
+    # 加载数据集统计信息用于归一化
+    stats = checkpoint.get('dataset_stats', None)
+
+    # 使用数据集统计信息从 Hydra 配置实例化 agent
+    log.info("从配置实例化 agent...")
+    agent = instantiate(agent_cfg, dataset_stats=stats)
+
+    # 加载模型状态
+    agent.load_state_dict(checkpoint['model_state_dict'])
+    log.info(f"✅ 模型状态已加载 (步数: {checkpoint.get('step', 'unknown')})")
+
+    if stats is not None:
+        log.info(f"✅ 数据集统计信息已加载 (归一化: {stats.get('normalization_type', 'gaussian')})")
+    else:
+        # 后备方案：尝试从外部 JSON 文件加载（兼容旧检查点）
+        stats_path = ckpt_path.parent / 'dataset_stats.json'
+        if stats_path.exists():
+            with open(stats_path, 'r') as f:
+                stats = json.load(f)
+            log.info("✅ 数据集统计信息已从外部 JSON 加载（旧版本兼容）")
+        else:
+            log.warning("⚠️  未找到数据集统计信息。动作将无法反归一化！")
+
+    agent.eval()
+    agent.to(device)
+
+    log.info(f"✅ 模型已成功加载到 {device}")
+    return agent, stats
+
+
+def prepare_observation(obs: Dict, camera_names: list) -> Dict:
+    """
+    将环境观测转换为 agent 格式。
+
+    Args:
+        obs: 环境观测字典，包含图像和 qpos
+        camera_names: 摄像头名称列表
+
+    Returns:
+        agent 格式的观测字典
+    """
+    import cv2
+
+    # 转换图像: numpy -> tensor, HWC -> CHW
+    images = {}
+    for cam_name in camera_names:
+        img = obs['images'][cam_name]
+        # Resize 到 224x224（与训练时一致）
+        img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)
+        img = rearrange(img, 'h w c -> c h w')
+        img = torch.from_numpy(img / 255.0).float()
+        images[cam_name] = img
+
+    # 转换 qpos: numpy -> tensor
+    qpos = torch.from_numpy(obs['qpos']).float()
+
+    return {'qpos': qpos, 'images': images}
+
+
+def _to_numpy_action(action: Any) -> np.ndarray:
+    if isinstance(action, torch.Tensor):
+        return action.detach().cpu().numpy().astype(np.float32, copy=True)
+    return np.asarray(action, dtype=np.float32).copy()
+
+
+def _mean_or_zero(values: list[float]) -> float:
+    return float(np.mean(values)) if values else 0.0
+
+
+def _stats_or_zero(values: list[float]) -> dict[str, float]:
+    if not values:
+        return {
+            'mean': 0.0,
+            'std': 0.0,
+            'min': 0.0,
+            'max': 0.0,
+        }
+    array = np.asarray(values, dtype=np.float64)
+    return {
+        'mean': float(array.mean()),
+        'std': float(array.std()),
+        'min': float(array.min()),
+        'max': float(array.max()),
+    }
+
+
+def _summarize_timing_breakdown(
+    all_timings: dict[str, list[float]],
+    model_forward_flags: list[bool],
+) -> dict[str, Any]:
+    model_forward_flags = [bool(flag) for flag in model_forward_flags]
+    return {
+        'count': int(len(model_forward_flags)),
+        'model_forward_count': int(sum(model_forward_flags)),
+        'all_steps_ms': {
+            stage: _stats_or_zero(values)
+            for stage, values in all_timings.items()
+        },
+        'model_forward_steps_ms': {
+            stage: _stats_or_zero(
+                [value for value, should_keep in zip(values, model_forward_flags) if should_keep]
+            )
+            for stage, values in all_timings.items()
+        },
+    }
+
+
+def _json_friendly(value: Any) -> Any:
+    if isinstance(value, dict):
+        return {str(key): _json_friendly(item) for key, item in value.items()}
+    if isinstance(value, (list, tuple)):
+        return [_json_friendly(item) for item in value]
+    if isinstance(value, Path):
+        return str(value)
+    if isinstance(value, np.ndarray):
+        return value.tolist()
+    if isinstance(value, (np.integer, np.floating)):
+        return value.item()
+    return value
+
+
+def _resolve_artifact_paths(eval_cfg: DictConfig) -> dict[str, Optional[str]]:
+    save_timing = bool(eval_cfg.get('save_timing', False))
+    save_trajectory = bool(
+        eval_cfg.get('save_trajectory', False) or eval_cfg.get('save_trajectory_npz', False)
+    )
+    wants_artifacts = any([
+        bool(eval_cfg.get('save_artifacts', False)),
+        save_timing,
+        save_trajectory,
+        bool(eval_cfg.get('record_video', False)),
+    ])
+    output_dir: Optional[Path] = None
+    if wants_artifacts:
+        artifact_dir = eval_cfg.get('artifact_dir', None)
+        if artifact_dir:
+            output_dir = Path(str(artifact_dir)).expanduser().resolve()
+        else:
+            ckpt_stem = Path(str(eval_cfg.ckpt_path)).stem or 'rollout'
+            timestamp = time.strftime('%Y%m%d-%H%M%S')
+            output_dir = (Path.cwd() / 'rollout_artifacts' / f'{ckpt_stem}-{timestamp}').resolve()
+        output_dir.mkdir(parents=True, exist_ok=True)
+
+    video_camera_name = None
+    if bool(eval_cfg.get('record_video', False)):
+        configured_camera_name = eval_cfg.get('video_camera_name', None)
+        if configured_camera_name is None:
+            configured_camera_name = eval_cfg.get('video_camera', None)
+        if configured_camera_name is not None:
+            video_camera_name = str(configured_camera_name)
+        elif eval_cfg.get('camera_names'):
+            video_camera_name = str(eval_cfg.camera_names[0])
+        else:
+            raise ValueError('record_video=true requires eval.video_camera_name or a non-empty eval.camera_names')
+
+    return {
+        'output_dir': str(output_dir) if output_dir is not None else None,
+        'summary_json': (
+            str(output_dir / 'rollout_summary.json')
+            if output_dir is not None and bool(eval_cfg.get('save_summary_json', False))
+            else None
+        ),
+        'timing_json': (
+            str(output_dir / 'timing.json')
+            if output_dir is not None and save_timing
+            else None
+        ),
+        'trajectory_npz': (
+            str(output_dir / 'trajectory.npz')
+            if output_dir is not None and save_trajectory
+            else None
+        ),
+        'video_mp4': (
+            str(output_dir / f'rollout_{video_camera_name}.mp4')
+            if output_dir is not None and bool(eval_cfg.get('record_video', False))
+            and video_camera_name is not None
+            else None
+        ),
+        'video_camera_name': video_camera_name,
+    }
+
+
+def _get_video_frame(obs: Dict, camera_name: Optional[str]) -> Optional[np.ndarray]:
+    if camera_name is None:
+        return None
+    frame = obs['images'][camera_name]
+    frame = np.asarray(frame)
+    if frame.ndim != 3 or frame.shape[2] != 3:
+        raise ValueError(
+            f'Video frame for camera {camera_name} must have shape (H, W, 3), got {frame.shape}'
+        )
+    if frame.dtype != np.uint8:
+        frame = np.clip(frame, 0, 255).astype(np.uint8)
+    return frame
+
+
+def _open_video_writer(output_path: str, frame_size: tuple[int, int], fps: int):
+    import cv2
+
+    output_path = str(output_path)
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+    writer = cv2.VideoWriter(output_path, fourcc, float(fps), frame_size)
+    if not writer.isOpened():
+        raise RuntimeError(f'无法打开视频输出: {output_path}')
+    return writer
+
+
+class _RolloutVideoRecorder:
+    def __init__(self, output_path: Optional[str], fps: int):
+        self.output_path = output_path
+        self.fps = int(fps)
+        self.writer = None
+
+    def write(self, frame: Optional[np.ndarray]):
+        if self.output_path is None or frame is None:
+            return
+        if self.writer is None:
+            frame_size = (int(frame.shape[1]), int(frame.shape[0]))
+            self.writer = _open_video_writer(self.output_path, frame_size, self.fps)
+        self.writer.write(frame)
+
+    def close(self):
+        if self.writer is not None:
+            self.writer.release()
+            self.writer = None
+
+
+def _read_body_pose(env, body_name: str):
+    try:
+        if callable(getattr(env, 'getBodyPos', None)) and callable(getattr(env, 'getBodyQuat', None)):
+            pos = env.getBodyPos(body_name)
+            quat = env.getBodyQuat(body_name)
+        else:
+            body = env.mj_data.body(body_name)
+            pos = body.xpos
+            quat = body.xquat
+    except Exception:
+        return None
+
+    return {
+        'pos': np.asarray(pos, dtype=np.float32).copy(),
+        'quat': np.asarray(quat, dtype=np.float32).copy(),
+    }
+
+
+def _get_executed_ee_poses(env) -> dict[str, np.ndarray]:
+    candidates = {
+        'left_link7': ('left_link7', 'eef_left'),
+        'right_link7': ('right_link7', 'eef_right'),
+        'eef_left': ('eef_left', 'left_link7'),
+        'eef_right': ('eef_right', 'right_link7'),
+    }
+    poses = {}
+    for body_key, body_names in candidates.items():
+        pose = None
+        for body_name in body_names:
+            pose = _read_body_pose(env, body_name)
+            if pose is not None:
+                break
+        if pose is None:
+            pose = {
+                'pos': np.full(3, np.nan, dtype=np.float32),
+                'quat': np.full(4, np.nan, dtype=np.float32),
+            }
+        poses[f'{body_key}_pos'] = pose['pos']
+        poses[f'{body_key}_quat'] = pose['quat']
+    return poses
+
+
+def _empty_rollout_trajectory() -> dict[str, list]:
+    return {
+        'episode_index': [],
+        'step': [],
+        'reward': [],
+        'raw_action': [],
+        'applied_action': [],
+        'executed_left_link7_pos': [],
+        'executed_left_link7_quat': [],
+        'executed_right_link7_pos': [],
+        'executed_right_link7_quat': [],
+        'executed_eef_left_pos': [],
+        'executed_eef_left_quat': [],
+        'executed_eef_right_pos': [],
+        'executed_eef_right_quat': [],
+        'model_inference_triggered': [],
+        'obs_read_time_ms': [],
+        'preprocess_time_ms': [],
+        'inference_time_ms': [],
+        'env_step_time_ms': [],
+        'total_time_ms': [],
+    }
+
+
+def _append_rollout_step(
+    storage: dict[str, list],
+    episode_index: int,
+    timestep: int,
+    reward: Optional[float],
+    raw_action: np.ndarray,
+    executed_action: np.ndarray,
+    executed_poses: dict[str, np.ndarray],
+    timing_ms: dict[str, float],
+    model_inference_triggered: bool,
+):
+    storage['episode_index'].append(int(episode_index))
+    storage['step'].append(int(timestep))
+    storage['reward'].append(float(reward) if reward is not None else np.nan)
+    storage['raw_action'].append(raw_action.astype(np.float32, copy=True))
+    storage['applied_action'].append(executed_action.astype(np.float32, copy=True))
+    storage['executed_left_link7_pos'].append(executed_poses['left_link7_pos'])
+    storage['executed_left_link7_quat'].append(executed_poses['left_link7_quat'])
+    storage['executed_right_link7_pos'].append(executed_poses['right_link7_pos'])
+    storage['executed_right_link7_quat'].append(executed_poses['right_link7_quat'])
+    storage['executed_eef_left_pos'].append(executed_poses['eef_left_pos'])
+    storage['executed_eef_left_quat'].append(executed_poses['eef_left_quat'])
+    storage['executed_eef_right_pos'].append(executed_poses['eef_right_pos'])
+    storage['executed_eef_right_quat'].append(executed_poses['eef_right_quat'])
+    storage['model_inference_triggered'].append(bool(model_inference_triggered))
+    for key, value in timing_ms.items():
+        storage[key].append(float(value))
+
+
+def _save_rollout_trajectory_npz(output_path: str, storage: dict[str, list]):
+    step = np.asarray(storage['step'], dtype=np.int32)
+    raw_action = np.asarray(storage['raw_action'], dtype=np.float32)
+    applied_action = np.asarray(storage['applied_action'], dtype=np.float32)
+    executed_left_link7_pos = np.asarray(storage['executed_left_link7_pos'], dtype=np.float32)
+    executed_left_link7_quat = np.asarray(storage['executed_left_link7_quat'], dtype=np.float32)
+    executed_right_link7_pos = np.asarray(storage['executed_right_link7_pos'], dtype=np.float32)
+    executed_right_link7_quat = np.asarray(storage['executed_right_link7_quat'], dtype=np.float32)
+    executed_eef_left_pos = np.asarray(storage['executed_eef_left_pos'], dtype=np.float32)
+    executed_eef_left_quat = np.asarray(storage['executed_eef_left_quat'], dtype=np.float32)
+    executed_eef_right_pos = np.asarray(storage['executed_eef_right_pos'], dtype=np.float32)
+    executed_eef_right_quat = np.asarray(storage['executed_eef_right_quat'], dtype=np.float32)
+    np.savez_compressed(
+        output_path,
+        episode_index=np.asarray(storage['episode_index'], dtype=np.int32),
+        step=step,
+        timestep=step,
+        reward=np.asarray(storage['reward'], dtype=np.float32),
+        raw_action=raw_action,
+        raw_predicted_ee_action=raw_action,
+        applied_action=applied_action,
+        executed_ee_action=applied_action,
+        executed_left_link7_pos=executed_left_link7_pos,
+        executed_left_link7_quat=executed_left_link7_quat,
+        executed_right_link7_pos=executed_right_link7_pos,
+        executed_right_link7_quat=executed_right_link7_quat,
+        executed_eef_left_pos=executed_eef_left_pos,
+        executed_eef_left_quat=executed_eef_left_quat,
+        executed_eef_right_pos=executed_eef_right_pos,
+        executed_eef_right_quat=executed_eef_right_quat,
+        left_ee_pos=executed_eef_left_pos,
+        left_ee_quat=executed_eef_left_quat,
+        right_ee_pos=executed_eef_right_pos,
+        right_ee_quat=executed_eef_right_quat,
+        model_inference_triggered=np.asarray(storage['model_inference_triggered'], dtype=bool),
+        obs_read_time_ms=np.asarray(storage['obs_read_time_ms'], dtype=np.float32),
+        preprocess_time_ms=np.asarray(storage['preprocess_time_ms'], dtype=np.float32),
+        inference_time_ms=np.asarray(storage['inference_time_ms'], dtype=np.float32),
+        env_step_time_ms=np.asarray(storage['env_step_time_ms'], dtype=np.float32),
+        total_time_ms=np.asarray(storage['total_time_ms'], dtype=np.float32),
+    )
+
+
+def _save_summary_json(output_path: str, summary: dict[str, Any]):
+    with open(output_path, 'w', encoding='utf-8') as f:
+        json.dump(_json_friendly(summary), f, ensure_ascii=False, indent=2)
+
+
+class ActionSmoother:
+    """
+    动作平滑器（指数移动平均）
+    用于平滑执行动作以获得更稳定的控制
+    """
+
+    def __init__(self, alpha: float = 0.3):
+        """
+        Args:
+            alpha: 平滑系数 (0-1)，值越大越重视当前动作
+        """
+        self.alpha = alpha
+        self.prev_action = None
+
+    def smooth(self, action: np.ndarray) -> np.ndarray:
+        """
+        平滑动作
+
+        Args:
+            action: 当前动作
+
+        Returns:
+            平滑后的动作
+        """
+        if self.prev_action is None:
+            smoothed = action
+        else:
+            smoothed = self.alpha * action + (1 - self.alpha) * self.prev_action
+        self.prev_action = smoothed
+        return smoothed
+
+    def reset(self):
+        """重置平滑器状态"""
+        self.prev_action = None
+
+
+def _close_env(env):
+    if env is None:
+        return
+
+    if hasattr(env, 'exit_flag'):
+        env.exit_flag = True
+
+    cam_thread = getattr(env, 'cam_thread', None)
+    if cam_thread is not None and hasattr(cam_thread, 'join'):
+        cam_thread.join(timeout=1.0)
+
+    viewer = getattr(env, 'viewer', None)
+    if viewer is not None and hasattr(viewer, 'close'):
+        viewer.close()
+
+
+def _run_eval(cfg: DictConfig):
+    """
+    使用 agent 内置队列管理的简化版 VLA 评估
+
+    所有评估参数来自 vla/conf/eval.yaml，合并到 cfg 中。
+    命令行覆盖: python eval_vla_simple.py eval.ckpt_path=... eval.num_episodes=5
+    """
+
+    # 打印配置
+    print("=" * 80)
+    print("VLA 评估配置:")
+    print("=" * 80)
+    print(OmegaConf.to_yaml(cfg))
+    print("=" * 80)
+
+    eval_cfg = cfg.eval
+    device = eval_cfg.device
+    camera_names = list(eval_cfg.camera_names)
+    artifact_paths = _resolve_artifact_paths(eval_cfg)
+    video_recorder = _RolloutVideoRecorder(
+        output_path=artifact_paths['video_mp4'],
+        fps=int(eval_cfg.get('video_fps', 30)),
+    )
+    rollout_trajectory = _empty_rollout_trajectory()
+    global_obs_read_times_ms = []
+    global_preprocess_times_ms = []
+    global_inference_times_ms = []
+    global_env_step_times_ms = []
+    global_total_times_ms = []
+    global_model_forward_flags = []
+
+    # =========================================================================
+    # 加载模型
+    # =========================================================================
+    log.info(f"🚀 从 {eval_cfg.ckpt_path} 加载模型...")
+    agent, dataset_stats = load_checkpoint(
+        ckpt_path=eval_cfg.ckpt_path,
+        agent_cfg=cfg.agent,
+        device=device
+    )
+
+    # 重置 agent 的队列
+    agent.reset()
+
+    # 可选：动作平滑器
+    smoother = ActionSmoother(alpha=eval_cfg.smooth_alpha) if eval_cfg.use_smoothing else None
+
+    # =========================================================================
+    # 创建环境
+    # =========================================================================
+    env = make_sim_env(eval_cfg.task_name, headless=eval_cfg.headless)
+
+    # =========================================================================
+    # 运行评估回合
+    # =========================================================================
+    all_stats = []
+    episode_rewards = []
+    episode_max_rewards = []
+    try:
+        for episode_idx in range(eval_cfg.num_episodes):
+            print(f"\n{'='*60}")
+            print(f"回合 {episode_idx + 1}/{eval_cfg.num_episodes}")
+            print(f"{'='*60}\n")
+
+            box_pos = sample_transfer_pose()
+            env.reset(box_pos)
+
+            # 为新回合重置 agent 队列
+            agent.reset()
+            if smoother:
+                smoother.reset()
+
+            # 计时统计
+            obs_read_times_ms = []
+            preprocess_times_ms = []
+            inference_times_ms = []
+            env_step_times_ms = []
+            total_times_ms = []
+            model_forward_flags = []
+            episode_reward = 0.0
+            episode_max_reward = float('-inf')
+
+            with torch.inference_mode():
+                for t in tqdm(range(eval_cfg.max_timesteps), desc=f"回合 {episode_idx + 1}"):
+                    start_total = time.perf_counter()
+
+                    # 从环境获取观测
+                    obs = env._get_image_obs()
+                    qpos_obs = env._get_qpos_obs()
+                    obs['qpos'] = qpos_obs['qpos']
+                    end_obs_read = time.perf_counter()
+
+                    video_frame = _get_video_frame(obs, artifact_paths['video_camera_name'])
+                    video_recorder.write(video_frame)
+
+                    # 准备给 agent 的观测
+                    observation = prepare_observation(obs, camera_names)
+                    end_preprocess = time.perf_counter()
+
+                    # 选择动作（agent 内部处理队列管理）
+                    action_queue = getattr(agent, '_queues', {}).get('action', None)
+                    model_inference_triggered = len(action_queue) == 0 if action_queue is not None else True
+                    start_inference = time.perf_counter()
+                    action = agent.select_action(observation)
+
+                    if str(device).startswith('cuda') and torch.cuda.is_available():
+                        torch.cuda.synchronize()
+                    end_inference = time.perf_counter()
+
+                    # 转换为 numpy
+                    raw_action = _to_numpy_action(action)
+
+                    # 调试：打印当前时间步的动作（由配置控制）
+                    if eval_cfg.get('verbose_action', False):
+                        print(f"\n[Step {t:3d}] 预测动作: {raw_action}")
+                        print(f"  - 动作形状: {raw_action.shape}")
+                        print(f"  - 动作范围: [{raw_action.min():.4f}, {raw_action.max():.4f}]")
+                        print(f"  - 动作均值: {raw_action.mean():.4f}, 标准差: {raw_action.std():.4f}")
+
+                    # 可选：平滑动作
+                    executed_action = raw_action.copy()
+                    if smoother:
+                        executed_action = smoother.smooth(executed_action)
+
+                    # 执行动作
+                    start_env_step = time.perf_counter()
+                    execute_policy_action(env, executed_action)
+                    end_env_step = time.perf_counter()
+                    executed_poses = _get_executed_ee_poses(env)
+                    reward = getattr(env, 'rew', None)
+                    if reward is not None:
+                        reward = float(reward)
+                        episode_reward += reward
+                        episode_max_reward = max(episode_max_reward, reward)
+                    if not eval_cfg.headless:
+                        env.render()
+
+                    end_total = time.perf_counter()
+
+                    step_timing_ms = {
+                        'obs_read_time_ms': (end_obs_read - start_total) * 1000.0,
+                        'preprocess_time_ms': (end_preprocess - end_obs_read) * 1000.0,
+                        'inference_time_ms': (end_inference - start_inference) * 1000.0,
+                        'env_step_time_ms': (end_env_step - start_env_step) * 1000.0,
+                        'total_time_ms': (end_total - start_total) * 1000.0,
+                    }
+
+                    # 记录计时
+                    obs_read_times_ms.append(step_timing_ms['obs_read_time_ms'])
+                    preprocess_times_ms.append(step_timing_ms['preprocess_time_ms'])
+                    inference_times_ms.append(step_timing_ms['inference_time_ms'])
+                    env_step_times_ms.append(step_timing_ms['env_step_time_ms'])
+                    total_times_ms.append(step_timing_ms['total_time_ms'])
+                    model_forward_flags.append(bool(model_inference_triggered))
+                    global_obs_read_times_ms.append(step_timing_ms['obs_read_time_ms'])
+                    global_preprocess_times_ms.append(step_timing_ms['preprocess_time_ms'])
+                    global_inference_times_ms.append(step_timing_ms['inference_time_ms'])
+                    global_env_step_times_ms.append(step_timing_ms['env_step_time_ms'])
+                    global_total_times_ms.append(step_timing_ms['total_time_ms'])
+                    global_model_forward_flags.append(bool(model_inference_triggered))
+
+                    if artifact_paths['trajectory_npz'] is not None:
+                        _append_rollout_step(
+                            rollout_trajectory,
+                            episode_index=episode_idx,
+                            timestep=t,
+                            reward=reward,
+                            raw_action=raw_action,
+                            executed_action=executed_action,
+                            executed_poses=executed_poses,
+                            timing_ms=step_timing_ms,
+                            model_inference_triggered=model_inference_triggered,
+                        )
+
+            # =========================================================================
+            # 打印回合统计
+            # =========================================================================
+            avg_obs_read_time_ms = _mean_or_zero(obs_read_times_ms)
+            avg_preprocess_time_ms = _mean_or_zero(preprocess_times_ms)
+            avg_inference_time_ms = _mean_or_zero(inference_times_ms)
+            avg_env_step_time_ms = _mean_or_zero(env_step_times_ms)
+            avg_total_time_ms = _mean_or_zero(total_times_ms)
+            timing_breakdown = _summarize_timing_breakdown(
+                {
+                    'obs_read': obs_read_times_ms,
+                    'preprocess': preprocess_times_ms,
+                    'inference': inference_times_ms,
+                    'env_step': env_step_times_ms,
+                    'loop_total': total_times_ms,
+                },
+                model_forward_flags,
+            )
+            episode_artifact_paths = {
+                'video': artifact_paths['video_mp4'],
+                'trajectory': artifact_paths['trajectory_npz'],
+                'timing': artifact_paths['timing_json'] or artifact_paths['summary_json'],
+            }
+
+            stats = {
+                'inference_fps': 1000.0 / avg_inference_time_ms if avg_inference_time_ms > 0 else 0.0,
+                'control_fps': 1000.0 / avg_total_time_ms if avg_total_time_ms > 0 else 0.0,
+                'avg_obs_read_time_ms': avg_obs_read_time_ms,
+                'avg_preprocess_time_ms': avg_preprocess_time_ms,
+                'avg_inference_time_ms': avg_inference_time_ms,
+                'avg_env_step_time_ms': avg_env_step_time_ms,
+                'avg_total_time_ms': avg_total_time_ms,
+                'num_inferences': int(sum(model_forward_flags)),
+                'num_model_forwards': int(sum(model_forward_flags)),
+                'num_steps': len(total_times_ms),
+                'episode_reward': float(episode_reward),
+                'episode_max_reward': (
+                    float(episode_max_reward) if episode_max_reward != float('-inf') else None
+                ),
+                'artifact_paths': episode_artifact_paths,
+                'timing_breakdown_ms': timing_breakdown['all_steps_ms'],
+                'timing_summary': timing_breakdown,
+            }
+            all_stats.append(stats)
+            episode_rewards.append(float(episode_reward))
+            if episode_max_reward != float('-inf'):
+                episode_max_rewards.append(float(episode_max_reward))
+
+            print(f"\n回合 {episode_idx + 1} 完成 ({eval_cfg.max_timesteps} 时间步)")
+            print(f"  模型推理 FPS: {stats['inference_fps']:.2f} Hz")
+            print(f"  控制循环 FPS: {stats['control_fps']:.2f} Hz")
+            print(f"  平均读观测时间: {stats['avg_obs_read_time_ms']:.2f} ms")
+            print(f"  平均预处理时间: {stats['avg_preprocess_time_ms']:.2f} ms")
+            print(f"  平均推理时间: {stats['avg_inference_time_ms']:.2f} ms")
+            print(f"  平均环境步进时间: {stats['avg_env_step_time_ms']:.2f} ms")
+            print(f"  平均总时间: {stats['avg_total_time_ms']:.2f} ms")
+            print(f"  总推理次数: {stats['num_inferences']}")
+            print(f"  回合累计奖励: {stats['episode_reward']:.2f}")
+
+        # =========================================================================
+        # 总体统计
+        # =========================================================================
+        print(f"\n{'='*60}")
+        print("评估完成!")
+        print(f"{'='*60}")
+
+        summary = {
+            'num_episodes': int(eval_cfg.num_episodes),
+            'episode_rewards': episode_rewards,
+            'episode_max_rewards': episode_max_rewards,
+            'avg_reward': float(np.mean(episode_rewards)) if episode_rewards else 0.0,
+            'avg_max_reward': float(np.mean(episode_max_rewards)) if episode_max_rewards else 0.0,
+            'episodes': all_stats,
+            'artifact_dir': artifact_paths['output_dir'],
+            'artifacts': artifact_paths,
+        }
+
+        if all_stats:
+            avg_inference_fps = np.mean([s['inference_fps'] for s in all_stats])
+            avg_control_fps = np.mean([s['control_fps'] for s in all_stats])
+            avg_obs_read_time = _mean_or_zero(global_obs_read_times_ms)
+            avg_preprocess_time = _mean_or_zero(global_preprocess_times_ms)
+            avg_inference_time = _mean_or_zero(global_inference_times_ms)
+            avg_env_step_time = _mean_or_zero(global_env_step_times_ms)
+            avg_total_time = _mean_or_zero(global_total_times_ms)
+            summary.update({
+                'avg_inference_fps': float(avg_inference_fps),
+                'avg_control_fps': float(avg_control_fps),
+                'avg_obs_read_time_ms': float(avg_obs_read_time),
+                'avg_preprocess_time_ms': float(avg_preprocess_time),
+                'avg_inference_time_ms': float(avg_inference_time),
+                'avg_env_step_time_ms': float(avg_env_step_time),
+                'avg_total_time_ms': float(avg_total_time),
+                'timing_summary': _summarize_timing_breakdown(
+                    {
+                        'obs_read': global_obs_read_times_ms,
+                        'preprocess': global_preprocess_times_ms,
+                        'inference': global_inference_times_ms,
+                        'env_step': global_env_step_times_ms,
+                        'loop_total': global_total_times_ms,
+                    },
+                    global_model_forward_flags,
+                ),
+            })
+
+            print(f"\n总体统计 ({eval_cfg.num_episodes} 个回合):")
+            print(f"  平均模型推理 FPS: {avg_inference_fps:.2f} Hz")
+            print(f"  平均控制循环 FPS: {avg_control_fps:.2f} Hz")
+            print(f"  平均读观测时间: {avg_obs_read_time:.2f} ms")
+            print(f"  平均预处理时间: {avg_preprocess_time:.2f} ms")
+            print(f"  平均推理时间: {avg_inference_time:.2f} ms")
+            print(f"  平均环境步进时间: {avg_env_step_time:.2f} ms")
+            print(f"  平均总时间: {avg_total_time:.2f} ms")
+            print(f"  平均累计奖励: {summary['avg_reward']:.2f}")
+
+        if artifact_paths['trajectory_npz'] is not None:
+            _save_rollout_trajectory_npz(artifact_paths['trajectory_npz'], rollout_trajectory)
+        if artifact_paths['summary_json'] is not None:
+            _save_summary_json(artifact_paths['summary_json'], summary)
+        if artifact_paths['timing_json'] is not None:
+            _save_summary_json(artifact_paths['timing_json'], summary.get('timing_summary', {}))
+        print()
+        return _json_friendly(summary)
+    finally:
+        video_recorder.close()
+        _close_env(env)
+
+
+@hydra.main(version_base=None, config_path="../../vla/conf", config_name="config")
+def main(cfg: DictConfig):
+    return _run_eval(cfg)
+
+
+if __name__ == '__main__':
+    main()
--- a/roboimi/demos/vla_scripts/train_vla.py
+++ b/roboimi/demos/vla_scripts/train_vla.py
@@ -0,0 +1,856 @@
+import sys
+import os
+import logging
+import json
+import pickle
+import importlib
+import hydra
+import torch
+import re
+from tqdm import tqdm
+from omegaconf import DictConfig, OmegaConf
+from torch.utils.data import DataLoader, random_split
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import LambdaLR
+from pathlib import Path
+
+# 确保正确的导入路径
+sys.path.append(os.getcwd())
+
+from hydra.utils import instantiate
+
+log = logging.getLogger(__name__)
+
+# 注册列表长度解析器（用于配置中如 ${len:${data.camera_names}}）
+if not OmegaConf.has_resolver("len"):
+    OmegaConf.register_new_resolver("len", lambda x: len(x))
+
+
+def recursive_to_device(data, device):
+    """
+    递归地将嵌套字典/列表中的张量移动到指定设备。
+
+    Args:
+        data: 字典、列表或张量
+        device: 目标设备 (例如 'cuda', 'cpu')
+
+    Returns:
+        所有张量已移动到指定设备的数据结构
+    """
+    if isinstance(data, torch.Tensor):
+        return data.to(device)
+    elif isinstance(data, dict):
+        return {k: recursive_to_device(v, device) for k, v in data.items()}
+    elif isinstance(data, list):
+        return [recursive_to_device(v, device) for v in data]
+    return data
+
+
+def resolve_resume_checkpoint(resume_ckpt, checkpoint_dir):
+    """
+    解析恢复训练用的 checkpoint 路径。
+
+    Args:
+        resume_ckpt: 配置中的 resume_ckpt，支持路径或 "auto"
+        checkpoint_dir: 默认检查点目录
+
+    Returns:
+        Path 或 None
+    """
+    if resume_ckpt is None:
+        return None
+
+    if str(resume_ckpt).lower() != "auto":
+        return Path(resume_ckpt)
+
+    pattern = re.compile(r"vla_model_step_(\d+)\.pt$")
+    candidates = []
+    for ckpt_path in checkpoint_dir.glob("vla_model_step_*.pt"):
+        match = pattern.search(ckpt_path.name)
+        if match:
+            candidates.append((int(match.group(1)), ckpt_path))
+
+    if not candidates:
+        return None
+    return max(candidates, key=lambda x: x[0])[1]
+
+
+def get_lr_schedule_with_warmup(optimizer, warmup_steps, max_steps, scheduler_type='cosine', min_lr=0):
+    """
+    创建带预热的学习率调度器。
+
+    Args:
+        optimizer: PyTorch 优化器
+        warmup_steps: 预热步数
+        max_steps: 总训练步数
+        scheduler_type: 预热后的调度器类型 ('cosine' 或 'constant')
+        min_lr: 最小学习率（用于余弦衰减）
+
+    Returns:
+        LambdaLR 调度器
+    """
+    import math
+    # 在 LambdaLR 修改前捕获初始学习率
+    base_lr = optimizer.param_groups[0]['lr']
+    min_lr_ratio = min_lr / base_lr if base_lr > 0 else 0.0
+
+    def lr_lambda(step):
+        # 预热阶段：从 0 线性增加到 1
+        if step < warmup_steps:
+            return float(step) / float(max(1, warmup_steps))
+
+        # 预热后阶段
+        if scheduler_type == 'cosine':
+            # 从 1 到 min_lr_ratio 的余弦退火
+            progress = float(step - warmup_steps) / float(max(1, max_steps - warmup_steps))
+            cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
+            return max(min_lr_ratio, cosine_decay)
+        else:
+            # 恒定学习率
+            return 1.0
+
+    return LambdaLR(optimizer, lr_lambda)
+
+
+def build_training_optimizer(agent, lr, weight_decay):
+    """为训练脚本构建优化器，优先复用 transformer head 自带的参数分组。"""
+    trainable_params = [param for param in agent.parameters() if param.requires_grad]
+    noise_pred_net = getattr(agent, 'noise_pred_net', None)
+    get_optim_groups = getattr(noise_pred_net, 'get_optim_groups', None)
+    use_head_groups = (
+        getattr(agent, 'head_type', None) == 'transformer'
+        and callable(get_optim_groups)
+    )
+
+    if not use_head_groups:
+        return AdamW(trainable_params, lr=lr, weight_decay=weight_decay)
+
+    head_groups = []
+    grouped_param_ids = set()
+    for group in get_optim_groups(weight_decay=weight_decay):
+        params = [param for param in group['params'] if param.requires_grad]
+        if not params:
+            continue
+        normalized_group = dict(group)
+        normalized_group['params'] = params
+        head_groups.append(normalized_group)
+
+        for param in params:
+            param_id = id(param)
+            if param_id in grouped_param_ids:
+                raise ValueError('Transformer optimizer groups contain duplicate parameters')
+            grouped_param_ids.add(param_id)
+
+    head_trainable_param_ids = {
+        id(param) for param in noise_pred_net.parameters() if param.requires_grad
+    }
+    missing_head_param_ids = head_trainable_param_ids - grouped_param_ids
+    if missing_head_param_ids:
+        raise ValueError('Transformer optimizer groups missed trainable head parameters')
+
+    remaining_params = [
+        param for param in trainable_params
+        if id(param) not in grouped_param_ids
+    ]
+
+    optim_groups = head_groups
+    if remaining_params:
+        optim_groups = optim_groups + [{
+            'params': remaining_params,
+            'weight_decay': weight_decay,
+        }]
+        grouped_param_ids.update(id(param) for param in remaining_params)
+
+    all_trainable_param_ids = {id(param) for param in trainable_params}
+    if grouped_param_ids != all_trainable_param_ids:
+        raise ValueError('Optimizer parameter groups must include each trainable parameter exactly once')
+
+    return AdamW(optim_groups, lr=lr, weight_decay=weight_decay)
+
+
+def _init_swanlab(cfg):
+    """按需初始化 SwanLab，并在缺少依赖或认证失败时快速失败。"""
+    if not bool(cfg.train.get('use_swanlab', False)):
+        return None
+
+    try:
+        swanlab = importlib.import_module("swanlab")
+    except ImportError as exc:
+        raise RuntimeError(
+            "SwanLab logging is enabled, but the 'swanlab' package could not be imported."
+        ) from exc
+
+    def _to_plain_config(value):
+        if isinstance(value, dict):
+            return {key: _to_plain_config(val) for key, val in value.items()}
+        if isinstance(value, list):
+            return [_to_plain_config(item) for item in value]
+        if isinstance(value, tuple):
+            return tuple(_to_plain_config(item) for item in value)
+
+        items_method = getattr(value, 'items', None)
+        if callable(items_method):
+            try:
+                return {key: _to_plain_config(val) for key, val in items_method()}
+            except Exception:
+                pass
+
+        return value
+
+    swanlab_config = {
+        key: _to_plain_config(cfg[key])
+        for key in ('train', 'data', 'agent')
+        if key in cfg
+    }
+
+    init_kwargs = {
+        'project': cfg.train.get('swanlab_project', 'roboimi-vla'),
+        'config': swanlab_config,
+    }
+    run_name = cfg.train.get('swanlab_run_name', None)
+    if run_name:
+        init_kwargs['experiment_name'] = run_name
+
+    try:
+        swanlab.init(**init_kwargs)
+    except Exception as exc:
+        raise RuntimeError(
+            f"SwanLab logging is enabled, but SwanLab init/login failed: {exc}"
+        ) from exc
+
+    return swanlab
+
+
+def _log_to_swanlab(swanlab_module, payload, step=None):
+    if swanlab_module is None:
+        return
+    try:
+        swanlab_module.log(payload, step=step)
+    except Exception as exc:
+        log.warning(f"SwanLab log failed at step {step}: {exc}")
+
+
+def _finish_swanlab(swanlab_module):
+    if swanlab_module is None:
+        return
+    try:
+        swanlab_module.finish()
+    except Exception as exc:
+        log.warning(f"SwanLab finish failed: {exc}")
+
+
+def _run_training(cfg: DictConfig):
+    """
+    VLA 训练脚本（ResNet 骨干网络 + Diffusion 策略）
+
+    该脚本功能：
+    1. 从 HDF5 文件加载数据集
+    2. 实例化带 ResNet 视觉编码器的 VLAAgent
+    3. 训练基于扩散的动作预测模型
+    4. 定期保存检查点
+    """
+
+    # 打印配置
+    print("=" * 80)
+    print("VLA 训练配置:")
+    print("=" * 80)
+    print(OmegaConf.to_yaml(cfg))
+    print("=" * 80)
+
+    log.info(f"🚀 开始 VLA 训练 (设备: {cfg.train.device})")
+    swanlab_module = _init_swanlab(cfg)
+    try:
+        # 创建检查点目录
+        checkpoint_dir = Path("checkpoints")
+        checkpoint_dir.mkdir(exist_ok=True)
+        default_best_model_path = checkpoint_dir / "vla_model_best.pt"
+
+        # =========================================================================
+        # 1. 实例化数据集与 DataLoader
+        # =========================================================================
+        log.info("📦 加载数据集...")
+        try:
+            dataset = instantiate(cfg.data)
+            log.info(f"✅ 数据集加载成功。总样本数: {len(dataset)}")
+        except Exception as e:
+            log.error(f"❌ 数据集加载失败: {e}")
+            raise
+
+        # 训练/验证集划分
+        val_split = float(cfg.train.get('val_split', 0.1))
+        seed = int(cfg.train.get('seed', 42))
+        val_size = int(len(dataset) * val_split)
+        train_size = len(dataset) - val_size
+        if val_size > 0:
+            train_dataset, val_dataset = random_split(
+                dataset,
+                [train_size, val_size],
+                generator=torch.Generator().manual_seed(seed)
+            )
+            log.info(f"✅ 数据集划分: 训练集={train_size}, 验证集={val_size} (验证比例={val_split})")
+        else:
+            train_dataset, val_dataset = dataset, None
+            log.info("✅ 数据集划分: 全部用于训练, 验证集=0 (验证比例=0)")
+
+        train_batch_size = int(cfg.train.batch_size)
+        train_drop_last = len(train_dataset) >= train_batch_size
+        if not train_drop_last:
+            log.warning(
+                "⚠️  训练集样本数 (%s) 小于 batch_size (%s)，将保留最后一个不完整批次以避免空训练加载器",
+                len(train_dataset),
+                train_batch_size,
+            )
+
+        train_loader = DataLoader(
+            train_dataset,
+            batch_size=train_batch_size,
+            shuffle=True,
+            num_workers=cfg.train.num_workers,
+            pin_memory=(cfg.train.device != "cpu"),
+            persistent_workers=False,
+            drop_last=train_drop_last
+        )
+
+        val_loader = None
+        if val_dataset is not None:
+            val_loader = DataLoader(
+                val_dataset,
+                batch_size=train_batch_size,
+                shuffle=False,
+                num_workers=cfg.train.num_workers,
+                pin_memory=(cfg.train.device != "cpu"),
+                persistent_workers=False,
+                drop_last=False
+            )
+
+        log.info(f"✅ 训练加载器每轮批次数: {len(train_loader)}")
+        if val_loader is not None:
+            log.info(f"✅ 验证加载器每轮批次数: {len(val_loader)}")
+
+        # =========================================================================
+        # 2. 加载数据集统计信息（将传递给 agent）
+        # =========================================================================
+        log.info("💾 加载数据集统计信息...")
+        dataset_stats = None
+        try:
+            dataset_dir = cfg.data.get('dataset_dir', 'roboimi/demos/dataset/sim_transfer')
+            stats_path = Path(dataset_dir) / 'dataset_stats.pkl'
+
+            if stats_path.exists():
+                with open(stats_path, 'rb') as f:
+                    stats = pickle.load(f)
+
+                # 扁平化stats字典（嵌套结构→扁平结构）以匹配NormalizationModule的期望格式
+                dataset_stats = {
+                    'action_mean': stats['action_mean'].tolist(),
+                    'action_std': stats['action_std'].tolist(),
+                    'action_min': stats['action_min'].tolist(),
+                    'action_max': stats['action_max'].tolist(),
+                    'qpos_mean': stats['qpos_mean'].tolist(),
+                    'qpos_std': stats['qpos_std'].tolist(),
+                    'qpos_min': stats['qpos_min'].tolist(),
+                    'qpos_max': stats['qpos_max'].tolist(),
+                }
+                log.info(f"✅ 数据集统计信息加载完成 (归一化: {cfg.agent.normalization_type})")
+            else:
+                log.warning(f"⚠️  统计文件未找到: {stats_path}")
+                log.warning("⚠️  推理时动作将无法反归一化！")
+
+        except Exception as e:
+            log.warning(f"⚠️  统计信息加载失败: {e}")
+            log.warning("⚠️  训练将继续，但推理可能无法正常工作")
+
+        # =========================================================================
+        # 3. 实例化 VLA Agent
+        # =========================================================================
+        log.info("🤖 初始化 VLA Agent...")
+        try:
+            # 将 dataset_stats 和 normalization_type 传递给 agent
+            agent = instantiate(cfg.agent, dataset_stats=dataset_stats)
+            agent.to(cfg.train.device)
+            agent.train()
+            log.info(f"✅ Agent 初始化完成并已移至 {cfg.train.device}")
+
+            # 统计参数量
+            total_params = sum(p.numel() for p in agent.parameters())
+            trainable_params = sum(p.numel() for p in agent.parameters() if p.requires_grad)
+            log.info(f"📊 总参数量: {total_params:,}")
+            log.info(f"📊 可训练参数量: {trainable_params:,}")
+
+        except Exception as e:
+            log.error(f"❌ Agent 初始化失败: {e}")
+            raise
+
+        # =========================================================================
+        # 3.1 从预训练 checkpoint 加载权重（微调）
+        # =========================================================================
+        pretrained_ckpt = cfg.train.get('pretrained_ckpt', None)
+        if pretrained_ckpt is not None:
+            ckpt_path = Path(pretrained_ckpt)
+            if ckpt_path.exists():
+                log.info(f"🔄 [Finetune] 从预训练 checkpoint 加载权重: {ckpt_path}")
+                try:
+                    checkpoint = torch.load(ckpt_path, map_location=cfg.train.device)
+
+                    # 只加载模型权重（不加载 optimizer、scheduler）
+                    missing_keys, unexpected_keys = agent.load_state_dict(
+                        checkpoint['model_state_dict'],
+                        strict=False  # 允许部分加载（结构不完全匹配时）
+                    )
+
+                    log.info(f"✅ [Finetune] 模型权重加载成功")
+
+                    if missing_keys:
+                        log.warning(f"⚠️  [Finetune] 缺少的键 ({len(missing_keys)} 个): {missing_keys[:5]}...")
+                    if unexpected_keys:
+                        log.warning(f"⚠️  [Finetune] 多余的键 ({len(unexpected_keys)} 个): {unexpected_keys[:5]}...")
+
+                    log.info(f"📊 [Finetune] 预训练信息: 步骤={checkpoint.get('step', 'N/A')}, 损失={checkpoint.get('loss', 'N/A')}")
+                    log.info(f"📈 [Finetune] 使用新的训练配置（lr={cfg.train.lr}, max_steps={cfg.train.max_steps}）")
+
+                except Exception as e:
+                    log.error(f"❌ [Finetune] 加载 checkpoint 失败: {e}")
+                    log.warning("⚠️  将从头开始训练")
+            else:
+                log.error(f"❌ [Finetune] Checkpoint 文件不存在: {ckpt_path}")
+                log.warning("⚠️  将从头开始训练")
+
+        # =========================================================================
+        # 4. 设置优化器与学习率调度器
+        # =========================================================================
+        weight_decay = float(cfg.train.get('weight_decay', 1e-5))
+        grad_clip = float(cfg.train.get('grad_clip', 1.0))
+
+        optimizer = build_training_optimizer(agent, lr=cfg.train.lr, weight_decay=weight_decay)
+        log.info(f"🔧 优化器: AdamW (学习率={cfg.train.lr}, weight_decay={weight_decay})")
+
+        # 设置带预热的学習率调度器
+        warmup_steps = int(cfg.train.get('warmup_steps', 500))
+        scheduler_type = cfg.train.get('scheduler_type', 'cosine')
+        min_lr = float(cfg.train.get('min_lr', 1e-6))
+
+        scheduler = get_lr_schedule_with_warmup(
+            optimizer,
+            warmup_steps=warmup_steps,
+            max_steps=cfg.train.max_steps,
+            scheduler_type=scheduler_type,
+            min_lr=min_lr
+        )
+        log.info(f"📈 学习率调度器: {scheduler_type}，{warmup_steps} 步预热 (最小学习率={min_lr})")
+
+        # =========================================================================
+        # 4.1 断点续训（恢复模型、优化器、调度器、步数）
+        # =========================================================================
+        def extract_checkpoint_metric_baseline(checkpoint):
+            checkpoint_loss = checkpoint.get('loss', None)
+            checkpoint_val_loss = checkpoint.get('val_loss', None)
+            checkpoint_rollout_reward = checkpoint.get('rollout_avg_reward', None)
+
+            baseline_loss = float('inf')
+            baseline_rollout_reward = float('-inf')
+            if checkpoint_rollout_reward is not None:
+                baseline_rollout_reward = float(checkpoint_rollout_reward)
+            if checkpoint_val_loss is not None:
+                baseline_loss = float(checkpoint_val_loss)
+            elif checkpoint_loss is not None:
+                baseline_loss = float(checkpoint_loss)
+            return baseline_loss, baseline_rollout_reward
+
+        start_step = 0
+        resume_loss = None
+        resume_best_loss = float('inf')
+        resume_best_rollout_reward = float('-inf')
+        best_model_path = None
+
+        resume_ckpt = cfg.train.get('resume_ckpt', None)
+        resume_path = resolve_resume_checkpoint(resume_ckpt, checkpoint_dir)
+        if resume_ckpt is not None:
+            if pretrained_ckpt is not None:
+                log.warning("⚠️  [Resume] 同时设置了 pretrained_ckpt 与 resume_ckpt，将优先使用 resume_ckpt 进行断点续训")
+            if resume_path is None:
+                log.warning("⚠️  [Resume] 未找到可恢复的 checkpoint，将从头开始训练")
+            elif not resume_path.exists():
+                log.error(f"❌ [Resume] Checkpoint 文件不存在: {resume_path}")
+                log.warning("⚠️  将从头开始训练")
+            else:
+                log.info(f"🔄 [Resume] 从 checkpoint 恢复训练: {resume_path}")
+                try:
+                    checkpoint = torch.load(resume_path, map_location=cfg.train.device)
+
+                    agent.load_state_dict(checkpoint['model_state_dict'], strict=True)
+                    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+                    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+
+                    resume_step = int(checkpoint['step'])
+                    start_step = resume_step + 1
+
+                    loaded_loss = checkpoint.get('loss', None)
+                    resume_loss = float(loaded_loss) if loaded_loss is not None else None
+                    resume_best_loss, resume_best_rollout_reward = extract_checkpoint_metric_baseline(checkpoint)
+                    if (
+                        resume_best_rollout_reward != float('-inf')
+                        or resume_best_loss != float('inf')
+                    ):
+                        best_model_path = resume_path
+
+                    if default_best_model_path.exists():
+                        try:
+                            best_checkpoint = torch.load(default_best_model_path, map_location=cfg.train.device)
+                            _, best_checkpoint_rollout_reward = (
+                                extract_checkpoint_metric_baseline(best_checkpoint)
+                            )
+                            if best_checkpoint_rollout_reward != float('-inf'):
+                                resume_best_rollout_reward = best_checkpoint_rollout_reward
+                                best_model_path = default_best_model_path
+                                log.info(
+                                    "📈 [Resume] 从最佳 checkpoint 恢复最佳 rollout 基线: %s",
+                                    default_best_model_path,
+                                )
+                        except Exception as e:
+                            log.warning(
+                                f"⚠️  [Resume] 读取最佳 checkpoint 失败，将回退到恢复 checkpoint 的验证基线: {e}"
+                            )
+
+                    log.info(f"✅ [Resume] 恢复成功: 上次步骤={resume_step}, 本次从步骤 {start_step} 开始")
+                    log.info(f"📈 [Resume] 当前学习率: {optimizer.param_groups[0]['lr']:.2e}")
+                except Exception as e:
+                    log.error(f"❌ [Resume] 恢复失败: {e}")
+                    log.warning("⚠️  将从头开始训练")
+                    start_step = 0
+                    resume_loss = None
+                    resume_best_loss = float('inf')
+                    resume_best_rollout_reward = float('-inf')
+
+        # =========================================================================
+        # 5. 训练循环
+        # =========================================================================
+        log.info("🏋️ 开始训练循环...")
+
+        def build_agent_input(batch_data):
+            """构建 agent 输入格式"""
+            images = {}
+            # SimpleRobotDataset 返回 observation.{cam_name} 格式
+            for cam_name in cfg.data.camera_names:
+                key = f"observation.{cam_name}"
+                if key in batch_data:
+                    images[cam_name] = batch_data[key]
+
+            return {
+                'images': images,
+                'qpos': batch_data['observation.state'],  # SimpleRobotDataset 使用 observation.state
+                'action': batch_data['action'],
+                'action_is_pad': batch_data.get('action_is_pad', None)  # 传递padding mask
+            }
+
+        def save_checkpoint(checkpoint_path: Path, step: int, loss_value, val_loss=None, rollout_avg_reward=None):
+            agent_stats = agent.get_normalization_stats()
+            torch.save({
+                'step': step,
+                'model_state_dict': agent.state_dict(),
+                'optimizer_state_dict': optimizer.state_dict(),
+                'scheduler_state_dict': scheduler.state_dict(),
+                'loss': loss_value,
+                'val_loss': val_loss,
+                'rollout_avg_reward': rollout_avg_reward,
+                'dataset_stats': agent_stats,  # 保存agent的统计信息
+                'current_lr': optimizer.param_groups[0]['lr'],
+            }, checkpoint_path)
+            return checkpoint_path
+
+        def run_validation():
+            """运行验证"""
+            if val_loader is None:
+                return None
+            agent.eval()
+
+            # 设置确定性种子以获得可重现的损失
+            # 这确保验证损失在不同步骤之间可比较
+            torch.manual_seed(42)
+            if torch.cuda.is_available():
+                torch.cuda.manual_seed(42)
+
+            total_loss = 0.0
+            num_batches = 0
+            with torch.no_grad():
+                for val_batch in val_loader:
+                    val_batch = recursive_to_device(val_batch, cfg.train.device)
+                    val_input = build_agent_input(val_batch)
+                    val_loss = agent.compute_loss(val_input)
+                    total_loss += val_loss.item()
+                    num_batches += 1
+            agent.train()
+            return total_loss / max(num_batches, 1)
+
+        def run_rollout_validation(checkpoint_path: Path):
+            from roboimi.demos.vla_scripts import eval_vla
+
+            rollout_cfg = OmegaConf.create(OmegaConf.to_container(cfg, resolve=False))
+            rollout_cfg.eval.ckpt_path = str(checkpoint_path)
+            rollout_cfg.eval.num_episodes = int(cfg.train.get('rollout_num_episodes', 1))
+            rollout_cfg.eval.headless = True
+            rollout_cfg.eval.device = 'cpu'
+            rollout_cfg.eval.verbose_action = False
+
+            log.info(
+                "🎯 开始 checkpoint rollout 验证: %s (episodes=%s, headless=True)",
+                checkpoint_path,
+                rollout_cfg.eval.num_episodes,
+            )
+            return eval_vla._run_eval(rollout_cfg)
+
+        def run_checkpoint_rollout_validation(checkpoint_path: Path):
+            if not bool(cfg.train.get('rollout_validate_on_checkpoint', False)):
+                return None
+            return run_rollout_validation(checkpoint_path)
+
+        data_iter = iter(train_loader)
+        pbar = tqdm(range(start_step, cfg.train.max_steps), desc="训练中", ncols=100)
+
+        steps_per_epoch = len(train_loader)
+        rollout_val_freq_epochs = int(cfg.train.get('rollout_val_freq_epochs', 0) or 0)
+        rollout_validation_enabled = rollout_val_freq_epochs > 0
+        best_loss = resume_best_loss
+        best_rollout_reward = resume_best_rollout_reward
+        last_loss = resume_loss
+
+        if start_step >= cfg.train.max_steps:
+            log.warning(
+                f"⚠️  [Resume] start_step={start_step} 已达到/超过 max_steps={cfg.train.max_steps}，跳过训练循环"
+            )
+
+        for step in pbar:
+            try:
+                batch = next(data_iter)
+            except StopIteration:
+                # 轮次结束时重启迭代器
+                data_iter = iter(train_loader)
+                batch = next(data_iter)
+
+            # =====================================================================
+            # 将批次移至设备
+            # =====================================================================
+            batch = recursive_to_device(batch, cfg.train.device)
+
+            # =====================================================================
+            # 准备 agent 输入
+            # =====================================================================
+            # 数据集返回: {action, qpos, image_<cam_name>, ...}
+            # Agent 期望: {images: dict, qpos: tensor, action: tensor}
+
+            # 准备 agent 输入
+            agent_input = build_agent_input(batch)
+
+            # =====================================================================
+            # 前向传播与损失计算
+            # =====================================================================
+            try:
+                loss = agent.compute_loss(agent_input)
+            except Exception as e:
+                log.error(f"❌ 步骤 {step} 前向传播失败: {e}")
+                raise
+
+            last_loss = loss.item()
+
+            # =====================================================================
+            # 反向传播与优化
+            # =====================================================================
+            optimizer.zero_grad()
+            loss.backward()
+
+            # 梯度裁剪以稳定训练
+            torch.nn.utils.clip_grad_norm_(agent.parameters(), max_norm=grad_clip)
+
+            optimizer.step()
+            scheduler.step()
+
+            # =====================================================================
+            # 日志记录
+            # =====================================================================
+            if step % cfg.train.log_freq == 0:
+                current_lr = optimizer.param_groups[0]['lr']
+                best_loss_to_log = best_loss if best_loss != float('inf') else loss.item()
+                pbar.set_postfix({
+                    "loss": f"{loss.item():.4f}",
+                    "lr": f"{current_lr:.2e}",
+                    "best_loss": f"{best_loss_to_log:.4f}"
+                })
+                log.info(f"步骤 {step}/{cfg.train.max_steps} | 损失: {loss.item():.4f} | 学习率: {current_lr:.2e}")
+                _log_to_swanlab(
+                    swanlab_module,
+                    {
+                        'train/loss': loss.item(),
+                        'train/lr': current_lr,
+                        'train/best_loss': best_loss_to_log,
+                        'train/step': step,
+                    },
+                    step=step,
+                )
+
+            # =====================================================================
+            # 检查点保存与验证
+            # =====================================================================
+            checkpoint_path = None
+            val_loss = None
+            if step > 0 and step % cfg.train.save_freq == 0:
+                # 运行验证
+                val_loss = run_validation()
+                if val_loss is not None:
+                    log.info(f"步骤 {step}/{cfg.train.max_steps} | 验证损失: {val_loss:.4f}")
+                    _log_to_swanlab(
+                        swanlab_module,
+                        {'val/loss': val_loss},
+                        step=step,
+                    )
+
+                checkpoint_path = checkpoint_dir / f"vla_model_step_{step}.pt"
+                save_checkpoint(
+                    checkpoint_path,
+                    step,
+                    loss.item(),
+                    val_loss=val_loss,
+                )
+                log.info(f"💾 检查点已保存: {checkpoint_path}")
+
+                # 在首次拿到 rollout 平均奖励之前，使用损失作为最佳模型回退指标
+                if best_rollout_reward == float('-inf'):
+                    eval_loss = val_loss if val_loss is not None else loss.item()
+                    if eval_loss < best_loss:
+                        best_loss = eval_loss
+                        best_model_path = default_best_model_path
+                        save_checkpoint(
+                            best_model_path,
+                            step,
+                            loss.item(),
+                            val_loss=val_loss,
+                        )
+                        log.info(f"🌟 最佳模型已更新: {best_model_path} (验证损失: {best_loss:.4f})")
+
+                checkpoint_rollout_stats = run_checkpoint_rollout_validation(checkpoint_path)
+                checkpoint_rollout_avg_reward = (
+                    checkpoint_rollout_stats.get('avg_reward')
+                    if checkpoint_rollout_stats is not None else None
+                )
+                if checkpoint_rollout_avg_reward is not None:
+                    log.info(
+                        f"步骤 {step}/{cfg.train.max_steps} | checkpoint rollout 平均奖励: "
+                        f"{checkpoint_rollout_avg_reward:.4f}"
+                    )
+                    _log_to_swanlab(
+                        swanlab_module,
+                        {'rollout/avg_reward': checkpoint_rollout_avg_reward},
+                        step=step,
+                    )
+                    if checkpoint_rollout_avg_reward > best_rollout_reward:
+                        best_rollout_reward = checkpoint_rollout_avg_reward
+                        best_model_path = default_best_model_path
+                        save_checkpoint(
+                            best_model_path,
+                            step,
+                            loss.item(),
+                            val_loss=val_loss,
+                            rollout_avg_reward=checkpoint_rollout_avg_reward,
+                        )
+                        log.info(
+                            f"🌟 最佳模型已更新: {best_model_path} "
+                            f"(checkpoint rollout 平均奖励: {best_rollout_reward:.4f})"
+                        )
+
+            completed_steps = step + 1
+            completed_epoch = (
+                completed_steps // steps_per_epoch
+                if steps_per_epoch > 0 else 0
+            )
+            should_run_epoch_rollout = (
+                rollout_validation_enabled
+                and steps_per_epoch > 0
+                and completed_steps % steps_per_epoch == 0
+                and completed_epoch > 0
+                and completed_epoch % rollout_val_freq_epochs == 0
+            )
+            if should_run_epoch_rollout:
+                if checkpoint_path is None:
+                    checkpoint_path = checkpoint_dir / f"vla_model_step_{step}.pt"
+                    save_checkpoint(
+                        checkpoint_path,
+                        step,
+                        loss.item(),
+                        val_loss=val_loss,
+                    )
+                    log.info(f"💾 Epoch rollout 验证前检查点已保存: {checkpoint_path}")
+
+                rollout_stats = run_rollout_validation(checkpoint_path)
+                rollout_avg_reward = (
+                    rollout_stats.get('avg_reward')
+                    if rollout_stats is not None else None
+                )
+                if rollout_avg_reward is not None:
+                    log.info(
+                        f"步骤 {step}/{cfg.train.max_steps} | Epoch {completed_epoch} "
+                        f"rollout 平均奖励: {rollout_avg_reward:.4f}"
+                    )
+                    _log_to_swanlab(
+                        swanlab_module,
+                        {
+                            'rollout/avg_reward': rollout_avg_reward,
+                            'rollout/epoch': completed_epoch,
+                        },
+                        step=step,
+                    )
+                    if rollout_avg_reward > best_rollout_reward:
+                        best_rollout_reward = rollout_avg_reward
+                        best_model_path = default_best_model_path
+                        save_checkpoint(
+                            best_model_path,
+                            step,
+                            loss.item(),
+                            val_loss=val_loss,
+                            rollout_avg_reward=rollout_avg_reward,
+                        )
+                        log.info(
+                            f"🌟 最佳模型已更新: {best_model_path} "
+                            f"(Epoch {completed_epoch} rollout 平均奖励: {best_rollout_reward:.4f})"
+                        )
+
+        # =========================================================================
+        # 6. 保存最终模型
+        # =========================================================================
+        final_model_path = checkpoint_dir / "vla_model_final.pt"
+        save_checkpoint(
+            final_model_path,
+            cfg.train.max_steps,
+            last_loss,
+        )
+        log.info(f"💾 最终模型已保存: {final_model_path}")
+        _log_to_swanlab(
+            swanlab_module,
+            {
+                'final/checkpoint_path': str(final_model_path),
+                'final/best_checkpoint_path': (
+                    str(best_model_path) if best_model_path is not None else ''
+                ),
+            },
+            step=cfg.train.max_steps,
+        )
+
+        log.info("✅ 训练成功完成!")
+        if last_loss is not None:
+            log.info(f"📊 最终损失: {last_loss:.4f}")
+        else:
+            log.info("📊 最终损失: N/A（未执行训练步）")
+        if best_rollout_reward != float('-inf'):
+            log.info(f"📊 最佳 rollout 平均奖励: {best_rollout_reward:.4f}")
+        elif best_loss != float('inf'):
+            log.info(f"📊 最佳损失: {best_loss:.4f}")
+        else:
+            log.info("📊 最佳验证指标: N/A（无有效 rollout/验证损失）")
+    finally:
+        _finish_swanlab(swanlab_module)
+
+
+@hydra.main(version_base=None, config_path="../../vla/conf", config_name="config")
+def main(cfg: DictConfig):
+    _run_training(cfg)
+
+
+if __name__ == "__main__":
+    main()
--- a/roboimi/detr/LICENSE
+++ b/roboimi/detr/LICENSE
@@ -1,201 +0,0 @@
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
-   1. Definitions.
-
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
-
-      "Contributor" shall mean Licensor and any individual or Legal Entity
-      on behalf of whom a Contribution has been received by Licensor and
-      subsequently incorporated within the Work.
-
-   2. Grant of Copyright License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      copyright license to reproduce, prepare Derivative Works of,
-      publicly display, publicly perform, sublicense, and distribute the
-      Work and such Derivative Works in Source or Object form.
-
-   3. Grant of Patent License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      (except as stated in this section) patent license to make, have made,
-      use, offer to sell, sell, import, and otherwise transfer the Work,
-      where such license applies only to those patent claims licensable
-      by such Contributor that are necessarily infringed by their
-      Contribution(s) alone or by combination of their Contribution(s)
-      with the Work to which such Contribution(s) was submitted. If You
-      institute patent litigation against any entity (including a
-      cross-claim or counterclaim in a lawsuit) alleging that the Work
-      or a Contribution incorporated within the Work constitutes direct
-      or contributory patent infringement, then any patent licenses
-      granted to You under this License for that Work shall terminate
-      as of the date such litigation is filed.
-
-   4. Redistribution. You may reproduce and distribute copies of the
-      Work or Derivative Works thereof in any medium, with or without
-      modifications, and in Source or Object form, provided that You
-      meet the following conditions:
-
-      (a) You must give any other recipients of the Work or
-          Derivative Works a copy of this License; and
-
-      (b) You must cause any modified files to carry prominent notices
-          stating that You changed the files; and
-
-      (c) You must retain, in the Source form of any Derivative Works
-          that You distribute, all copyright, patent, trademark, and
-          attribution notices from the Source form of the Work,
-          excluding those notices that do not pertain to any part of
-          the Derivative Works; and
-
-      (d) If the Work includes a "NOTICE" text file as part of its
-          distribution, then any Derivative Works that You distribute must
-          include a readable copy of the attribution notices contained
-          within such NOTICE file, excluding those notices that do not
-          pertain to any part of the Derivative Works, in at least one
-          of the following places: within a NOTICE text file distributed
-          as part of the Derivative Works; within the Source form or
-          documentation, if provided along with the Derivative Works; or,
-          within a display generated by the Derivative Works, if and
-          wherever such third-party notices normally appear. The contents
-          of the NOTICE file are for informational purposes only and
-          do not modify the License. You may add Your own attribution
-          notices within Derivative Works that You distribute, alongside
-          or as an addendum to the NOTICE text from the Work, provided
-          that such additional attribution notices cannot be construed
-          as modifying the License.
-
-      You may add Your own copyright statement to Your modifications and
-      may provide additional or different license terms and conditions
-      for use, reproduction, or distribution of Your modifications, or
-      for any such Derivative Works as a whole, provided Your use,
-      reproduction, and distribution of the Work otherwise complies with
-      the conditions stated in this License.
-
-   5. Submission of Contributions. Unless You explicitly state otherwise,
-      any Contribution intentionally submitted for inclusion in the Work
-      by You to the Licensor shall be under the terms and conditions of
-      this License, without any additional terms or conditions.
-      Notwithstanding the above, nothing herein shall supersede or modify
-      the terms of any separate license agreement you may have executed
-      with Licensor regarding such Contributions.
-
-   6. Trademarks. This License does not grant permission to use the trade
-      names, trademarks, service marks, or product names of the Licensor,
-      except as required for reasonable and customary use in describing the
-      origin of the Work and reproducing the content of the NOTICE file.
-
-   7. Disclaimer of Warranty. Unless required by applicable law or
-      agreed to in writing, Licensor provides the Work (and each
-      Contributor provides its Contributions) on an "AS IS" BASIS,
-      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
-      implied, including, without limitation, any warranties or conditions
-      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
-      PARTICULAR PURPOSE. You are solely responsible for determining the
-      appropriateness of using or redistributing the Work and assume any
-      risks associated with Your exercise of permissions under this License.
-
-   8. Limitation of Liability. In no event and under no legal theory,
-      whether in tort (including negligence), contract, or otherwise,
-      unless required by applicable law (such as deliberate and grossly
-      negligent acts) or agreed to in writing, shall any Contributor be
-      liable to You for damages, including any direct, indirect, special,
-      incidental, or consequential damages of any character arising as a
-      result of this License or out of the use or inability to use the
-      Work (including but not limited to damages for loss of goodwill,
-      work stoppage, computer failure or malfunction, or any and all
-      other commercial damages or losses), even if such Contributor
-      has been advised of the possibility of such damages.
-
-   9. Accepting Warranty or Additional Liability. While redistributing
-      the Work or Derivative Works thereof, You may choose to offer,
-      and charge a fee for, acceptance of support, warranty, indemnity,
-      or other liability obligations and/or rights consistent with this
-      License. However, in accepting such obligations, You may act only
-      on Your own behalf and on Your sole responsibility, not on behalf
-      of any other Contributor, and only if You agree to indemnify,
-      defend, and hold each Contributor harmless for any liability
-      incurred by, or claims asserted against, such Contributor by reason
-      of your accepting any such warranty or additional liability.
-
-   END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-      To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "[]"
-      replaced with your own identifying information. (Don't include
-      the brackets!)  The text should be enclosed in the appropriate
-      comment syntax for the file format. We also recommend that a
-      file or class name and description of purpose be included on the
-      same "printed page" as the copyright notice for easier
-      identification within third-party archives.
-
-   Copyright 2020 - present, Facebook, Inc
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--- a/roboimi/detr/README.md
+++ b/roboimi/detr/README.md
@@ -1,9 +0,0 @@
-This part of the codebase is modified from DETR https://github.com/facebookresearch/detr under APACHE 2.0.
-
-    @article{Carion2020EndtoEndOD,
-      title={End-to-End Object Detection with Transformers},
-      author={Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko},
-      journal={ArXiv},
-      year={2020},
-      volume={abs/2005.12872}
-    }
--- a/roboimi/detr/main.py
+++ b/roboimi/detr/main.py
@@ -1,106 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-import argparse
-from pathlib import Path
-
-import numpy as np
-import torch
-from .models import build_ACT_model, build_CNNMLP_model
-
-
-def get_args_parser():
-    parser = argparse.ArgumentParser('Set transformer detector', add_help=False)
-    parser.add_argument('--lr', default=1e-4, type=float) # will be overridden
-    parser.add_argument('--lr_backbone', default=1e-5, type=float) # will be overridden
-    parser.add_argument('--batch_size', default=2, type=int) # not used
-    parser.add_argument('--weight_decay', default=1e-4, type=float)
-    parser.add_argument('--epochs', default=300, type=int) # not used
-    parser.add_argument('--lr_drop', default=200, type=int) # not used
-    parser.add_argument('--clip_max_norm', default=0.1, type=float, # not used
-                        help='gradient clipping max norm')
-    parser.add_argument('--qpos_noise_std', action='store', default=0, type=float, help='lr', required=False)
-
-    # Model parameters
-    # * Backbone
-    parser.add_argument('--backbone', default='resnet18', type=str, # will be overridden
-                        help="Name of the convolutional backbone to use")
-    parser.add_argument('--dilation', action='store_true',
-                        help="If true, we replace stride with dilation in the last convolutional block (DC5)")
-    parser.add_argument('--position_embedding', default='sine', type=str, choices=('sine', 'learned'),
-                        help="Type of positional embedding to use on top of the image features")
-    parser.add_argument('--camera_names', default=[], type=list, # will be overridden
-                        help="A list of camera names")
-
-    # * Transformer
-    parser.add_argument('--enc_layers', default=4, type=int, # will be overridden
-                        help="Number of encoding layers in the transformer")
-    parser.add_argument('--dec_layers', default=6, type=int, # will be overridden
-                        help="Number of decoding layers in the transformer")
-    parser.add_argument('--dim_feedforward', default=2048, type=int, # will be overridden
-                        help="Intermediate size of the feedforward layers in the transformer blocks")
-    parser.add_argument('--hidden_dim', default=256, type=int, # will be overridden
-                        help="Size of the embeddings (dimension of the transformer)")
-    parser.add_argument('--dropout', default=0.1, type=float,
-                        help="Dropout applied in the transformer")
-    parser.add_argument('--nheads', default=8, type=int, # will be overridden
-                        help="Number of attention heads inside the transformer's attentions")
-    parser.add_argument('--num_queries', default=400, type=int, # will be overridden
-                        help="Number of query slots")
-    parser.add_argument('--pre_norm', action='store_true')
-    parser.add_argument('--state_dim', default=14, type=int)
-    parser.add_argument('--action_dim', default=14, type=int)
-
-
-    # * Segmentation
-    parser.add_argument('--masks', action='store_true',
-                        help="Train segmentation head if the flag is provided")
-
-
-
-    return parser
-
-
-def build_ACT_model_and_optimizer(args_override):
-    parser = argparse.ArgumentParser('DETR training and evaluation script', parents=[get_args_parser()])
-    args = parser.parse_args()
-
-    for k, v in args_override.items():
-        setattr(args, k, v)
-
-    model = build_ACT_model(args)
-    model.cuda()
-
-    param_dicts = [
-        {"params": [p for n, p in model.named_parameters() if "backbone" not in n and p.requires_grad]},
-        {
-            "params": [p for n, p in model.named_parameters() if "backbone" in n and p.requires_grad],
-            "lr": args.lr_backbone,
-        },
-    ]
-    optimizer = torch.optim.AdamW(param_dicts, lr=args.lr,
-                                  weight_decay=args.weight_decay)
-
-    return model, optimizer
-
-
-def build_CNNMLP_model_and_optimizer(args_override):
-    parser = argparse.ArgumentParser('DETR training and evaluation script', parents=[get_args_parser()])
-    args = parser.parse_args()
-
-    for k, v in args_override.items():
-        setattr(args, k, v)
-
-    model = build_CNNMLP_model(args)
-    model.cuda()
-
-    param_dicts = [
-        {"params": [p for n, p in model.named_parameters() if "backbone" not in n and p.requires_grad]},
-        {
-            "params": [p for n, p in model.named_parameters() if "backbone" in n and p.requires_grad],
-            "lr": args.lr_backbone,
-        },
-    ]
-    optimizer = torch.optim.AdamW(param_dicts, lr=args.lr,
-                                  weight_decay=args.weight_decay)
-
-    return model, optimizer
-
--- a/roboimi/detr/models/init.py
+++ b/roboimi/detr/models/init.py
@@ -1,9 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-from .detr_vae import build as build_vae
-from .detr_vae import build_cnnmlp as build_cnnmlp
-
-def build_ACT_model(args):
-    return build_vae(args)
-
-def build_CNNMLP_model(args):
-    return build_cnnmlp(args)
--- a/roboimi/detr/models/detr_vae.py
+++ b/roboimi/detr/models/detr_vae.py
@@ -1,300 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-"""
-DETR model and criterion classes.
-"""
-import torch
-from torch import nn
-from torch.autograd import Variable
-from .backbone import build_backbone
-from .transformer import build_transformer, TransformerEncoder, TransformerEncoderLayer
-
-import numpy as np
-
-
-def reparametrize(mu, logvar):
-    std = logvar.div(2).exp()
-    eps = Variable(std.data.new(std.size()).normal_())
-    return mu + std * eps
-
-
-def get_sinusoid_encoding_table(n_position, d_hid):
-    def get_position_angle_vec(position):
-        return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
-
-    sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
-    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
-    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
-
-    return torch.FloatTensor(sinusoid_table).unsqueeze(0)
-
-
-class DETRVAE(nn.Module):
-    """ This is the DETR module that performs object detection """
-    def __init__(self, backbones, transformer, encoder, state_dim, action_dim, num_queries, camera_names):
-        """ Initializes the model.
-        Parameters:
-            backbones: torch module of the backbone to be used. See backbone.py
-            transformer: torch module of the transformer architecture. See transformer.py
-            state_dim: robot state dimension of the environment
-            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
-                         DETR can detect in a single image. For COCO, we recommend 100 queries.
-            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
-        """
-        super().__init__()
-        self.num_queries = num_queries
-        self.camera_names = camera_names
-        self.transformer = transformer
-        self.encoder = encoder
-        hidden_dim = transformer.d_model
-        self.action_head = nn.Linear(hidden_dim, action_dim)
-        self.is_pad_head = nn.Linear(hidden_dim, 1)
-        self.query_embed = nn.Embedding(num_queries, hidden_dim)
-        if backbones is not None:
-            self.input_proj = nn.Conv2d(backbones[0].num_channels, hidden_dim, kernel_size=1)
-            self.backbones = nn.ModuleList(backbones)
-            self.input_proj_robot_state = nn.Linear(state_dim, hidden_dim)
-        else:
-            raise NotImplementedError
-            # input_dim = 14 + 7 # robot_state + env_state
-            # self.input_proj_robot_state = nn.Linear(state_dim, hidden_dim)
-            # self.input_proj_env_state = nn.Linear(7, hidden_dim)
-            # self.pos = torch.nn.Embedding(2, hidden_dim)
-            # self.backbones = None
-
-        # encoder extra parameters
-        self.latent_dim = 32 # final size of latent z # TODO tune
-        self.cls_embed = nn.Embedding(1, hidden_dim) # extra cls token embedding
-        self.encoder_action_proj = nn.Linear(action_dim, hidden_dim) # project action to embedding
-        self.encoder_joint_proj = nn.Linear(state_dim, hidden_dim)  # project qpos to embedding
-        self.latent_proj = nn.Linear(hidden_dim, self.latent_dim*2) # project hidden state to latent std, var
-        self.register_buffer('pos_table', get_sinusoid_encoding_table(1+1+num_queries, hidden_dim)) # [CLS], qpos, a_seq
-
-        # decoder extra parameters
-        self.latent_out_proj = nn.Linear(self.latent_dim, hidden_dim) # project latent sample to embedding
-        self.additional_pos_embed = nn.Embedding(2, hidden_dim) # learned position embedding for proprio and latent
-
-    def forward(self, qpos, image, env_state, actions=None, is_pad=None):
-        """
-        qpos: batch, qpos_dim
-        image: batch, num_cam, channel, height, width
-        env_state: None
-        actions: batch, seq, action_dim
-        """
-        is_training = actions is not None # train or val
-        bs, _ = qpos.shape
-        ### Obtain latent z from action sequence
-        if is_training:
-            # project action sequence to embedding dim, and concat with a CLS token
-            action_embed = self.encoder_action_proj(actions) # (bs, seq, hidden_dim)
-            qpos_embed = self.encoder_joint_proj(qpos)  # (bs, hidden_dim)
-            qpos_embed = torch.unsqueeze(qpos_embed, axis=1)  # (bs, 1, hidden_dim)
-            cls_embed = self.cls_embed.weight # (1, hidden_dim)
-            cls_embed = torch.unsqueeze(cls_embed, axis=0).repeat(bs, 1, 1) # (bs, 1, hidden_dim)
-            encoder_input = torch.cat([cls_embed, qpos_embed, action_embed], axis=1) # (bs, seq+1, hidden_dim)
-            encoder_input = encoder_input.permute(1, 0, 2) # (seq+1, bs, hidden_dim)
-            # do not mask cls token
-            cls_joint_is_pad = torch.full((bs, 2), False).to(qpos.device) # False: not a padding
-            is_pad = torch.cat([cls_joint_is_pad, is_pad], axis=1)  # (bs, seq+1)
-            # obtain position embedding
-            pos_embed = self.pos_table.clone().detach()
-            pos_embed = pos_embed.permute(1, 0, 2)  # (seq+1, 1, hidden_dim)
-            # query model
-            encoder_output = self.encoder(encoder_input, pos=pos_embed, src_key_padding_mask=is_pad)
-            encoder_output = encoder_output[0] # take cls output only
-            latent_info = self.latent_proj(encoder_output)
-            mu = latent_info[:, :self.latent_dim]
-            logvar = latent_info[:, self.latent_dim:]
-            latent_sample = reparametrize(mu, logvar)
-            latent_input = self.latent_out_proj(latent_sample)
-        else:
-            mu = logvar = None
-            latent_sample = torch.zeros([bs, self.latent_dim], dtype=torch.float32).to(qpos.device)
-            latent_input = self.latent_out_proj(latent_sample)
-
-        if self.backbones is not None:
-            # Image observation features and position embeddings
-            all_cam_features = []
-            all_cam_pos = []
-
-
-
-
-            # print(f"Image shape: {image.shape}, Number of cameras: {len(self.camera_names)}")
-
-            
-            for cam_id, cam_name in enumerate(self.camera_names):
-                # features, pos = self.backbones[0](image[:, cam_id]) # HARDCODED
-                features, pos = self.backbones[cam_id](image[:, cam_id])
-                features = features[0] # take the last layer feature
-                pos = pos[0]
-                all_cam_features.append(self.input_proj(features))
-                all_cam_pos.append(pos)
-
-
-
-
-
-
-
-
-
-
-
-            # proprioception features
-            proprio_input = self.input_proj_robot_state(qpos)
-            # fold camera dimension into width dimension
-            src = torch.cat(all_cam_features, axis=3)
-            pos = torch.cat(all_cam_pos, axis=3)
-            hs = self.transformer(src, None, self.query_embed.weight, pos, latent_input, proprio_input, self.additional_pos_embed.weight)[0]
-        else:
-            qpos = self.input_proj_robot_state(qpos)
-            env_state = self.input_proj_env_state(env_state)
-            transformer_input = torch.cat([qpos, env_state], axis=1) # seq length = 2
-            hs = self.transformer(transformer_input, None, self.query_embed.weight, self.pos.weight)[0]
-        a_hat = self.action_head(hs)
-        is_pad_hat = self.is_pad_head(hs)
-        return a_hat, is_pad_hat, [mu, logvar]
-
-
-
-class CNNMLP(nn.Module):
-    def __init__(self, backbones, state_dim, camera_names):
-        """ Initializes the model.
-        Parameters:
-            backbones: torch module of the backbone to be used. See backbone.py
-            transformer: torch module of the transformer architecture. See transformer.py
-            state_dim: robot state dimension of the environment
-            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
-                         DETR can detect in a single image. For COCO, we recommend 100 queries.
-            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
-        """
-        super().__init__()
-        self.camera_names = camera_names
-        self.action_head = nn.Linear(1000, state_dim) # TODO add more
-        if backbones is not None:
-            self.backbones = nn.ModuleList(backbones)
-            backbone_down_projs = []
-            for backbone in backbones:
-                down_proj = nn.Sequential(
-                    nn.Conv2d(backbone.num_channels, 128, kernel_size=5),
-                    nn.Conv2d(128, 64, kernel_size=5),
-                    nn.Conv2d(64, 32, kernel_size=5)
-                )
-                backbone_down_projs.append(down_proj)
-            self.backbone_down_projs = nn.ModuleList(backbone_down_projs)
-
-            mlp_in_dim = 768 * len(backbones) + 14
-            self.mlp = mlp(input_dim=mlp_in_dim, hidden_dim=1024, output_dim=14, hidden_depth=2)
-        else:
-            raise NotImplementedError
-
-    def forward(self, qpos, image, env_state, actions=None):
-        """
-        qpos: batch, qpos_dim
-        image: batch, num_cam, channel, height, width
-        env_state: None
-        actions: batch, seq, action_dim
-        """
-        is_training = actions is not None # train or val
-        bs, _ = qpos.shape
-        # Image observation features and position embeddings
-        all_cam_features = []
-        for cam_id, cam_name in enumerate(self.camera_names):
-            features, pos = self.backbones[cam_id](image[:, cam_id])
-            features = features[0] # take the last layer feature
-            pos = pos[0] # not used
-            all_cam_features.append(self.backbone_down_projs[cam_id](features))
-        # flatten everything
-        flattened_features = []
-        for cam_feature in all_cam_features:
-            flattened_features.append(cam_feature.reshape([bs, -1]))
-        flattened_features = torch.cat(flattened_features, axis=1) # 768 each
-        features = torch.cat([flattened_features, qpos], axis=1) # qpos: 14
-        a_hat = self.mlp(features)
-        return a_hat
-
-
-def mlp(input_dim, hidden_dim, output_dim, hidden_depth):
-    if hidden_depth == 0:
-        mods = [nn.Linear(input_dim, output_dim)]
-    else:
-        mods = [nn.Linear(input_dim, hidden_dim), nn.ReLU(inplace=True)]
-        for i in range(hidden_depth - 1):
-            mods += [nn.Linear(hidden_dim, hidden_dim), nn.ReLU(inplace=True)]
-        mods.append(nn.Linear(hidden_dim, output_dim))
-    trunk = nn.Sequential(*mods)
-    return trunk
-
-
-def build_encoder(args):
-    d_model = args.hidden_dim # 256
-    dropout = args.dropout # 0.1
-    nhead = args.nheads # 8
-    dim_feedforward = args.dim_feedforward # 2048
-    num_encoder_layers = args.enc_layers # 4 # TODO shared with VAE decoder
-    normalize_before = args.pre_norm # False
-    activation = "relu"
-
-    encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
-                                            dropout, activation, normalize_before)
-    encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
-    encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
-
-    return encoder
-
-
-def build(args):
-    state_dim = args.state_dim
-    action_dim = args.action_dim
-
-    # From state
-    # backbone = None # from state for now, no need for conv nets
-    # From image
-    backbones = []
-    # backbone = build_backbone(args)
-    # backbones.append(backbone)
-    for _ in args.camera_names:
-        backbone = build_backbone(args)
-        backbones.append(backbone)
-
-    transformer = build_transformer(args)
-
-    encoder = build_encoder(args)
-
-    model = DETRVAE(
-        backbones,
-        transformer,
-        encoder,
-        state_dim=state_dim,
-        action_dim=action_dim,
-        num_queries=args.num_queries,
-        camera_names=args.camera_names,
-    )
-
-    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    print("number of parameters: %.2fM" % (n_parameters/1e6,))
-
-    return model
-
-def build_cnnmlp(args):
-    state_dim = 14 # TODO hardcode
-
-    # From state
-    # backbone = None # from state for now, no need for conv nets
-    # From image
-    backbones = []
-    for _ in args.camera_names:
-        backbone = build_backbone(args)
-        backbones.append(backbone)
-
-    model = CNNMLP(
-        backbones,
-        state_dim=state_dim,
-        camera_names=args.camera_names,
-    )
-
-    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    print("number of parameters: %.2fM" % (n_parameters/1e6,))
-
-    return model
-
--- a/roboimi/detr/models/transformer.py
+++ b/roboimi/detr/models/transformer.py
@@ -1,312 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-"""
-DETR Transformer class.
-
-Copy-paste from torch.nn.Transformer with modifications:
-    * positional encodings are passed in MHattention
-    * extra LN at the end of encoder is removed
-    * decoder returns a stack of activations from all decoding layers
-"""
-import copy
-from typing import Optional, List
-
-import torch
-import torch.nn.functional as F
-from torch import nn, Tensor
-
-
-class Transformer(nn.Module):
-
-    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
-                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
-                 activation="relu", normalize_before=False,
-                 return_intermediate_dec=False):
-        super().__init__()
-
-        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
-                                                dropout, activation, normalize_before)
-        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
-        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
-
-        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
-                                                dropout, activation, normalize_before)
-        decoder_norm = nn.LayerNorm(d_model)
-        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
-                                          return_intermediate=return_intermediate_dec)
-
-        self._reset_parameters()
-
-        self.d_model = d_model
-        self.nhead = nhead
-
-    def _reset_parameters(self):
-        for p in self.parameters():
-            if p.dim() > 1:
-                nn.init.xavier_uniform_(p)
-
-    def forward(self, src, mask, query_embed, pos_embed, latent_input=None, proprio_input=None, additional_pos_embed=None):
-        # TODO flatten only when input has H and W
-        if len(src.shape) == 4: # has H and W
-            # flatten NxCxHxW to HWxNxC
-            bs, c, h, w = src.shape
-            src = src.flatten(2).permute(2, 0, 1)
-            pos_embed = pos_embed.flatten(2).permute(2, 0, 1).repeat(1, bs, 1)
-            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
-            # mask = mask.flatten(1)
-
-            additional_pos_embed = additional_pos_embed.unsqueeze(1).repeat(1, bs, 1) # seq, bs, dim
-            pos_embed = torch.cat([additional_pos_embed, pos_embed], axis=0)
-
-            addition_input = torch.stack([latent_input, proprio_input], axis=0)
-            src = torch.cat([addition_input, src], axis=0)
-        else:
-            assert len(src.shape) == 3
-            # flatten NxHWxC to HWxNxC
-            bs, hw, c = src.shape
-            src = src.permute(1, 0, 2)
-            pos_embed = pos_embed.unsqueeze(1).repeat(1, bs, 1)
-            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
-
-        tgt = torch.zeros_like(query_embed)
-        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
-        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
-                          pos=pos_embed, query_pos=query_embed)
-        hs = hs.transpose(1, 2)
-        return hs
-
-class TransformerEncoder(nn.Module):
-
-    def __init__(self, encoder_layer, num_layers, norm=None):
-        super().__init__()
-        self.layers = _get_clones(encoder_layer, num_layers)
-        self.num_layers = num_layers
-        self.norm = norm
-
-    def forward(self, src,
-                mask: Optional[Tensor] = None,
-                src_key_padding_mask: Optional[Tensor] = None,
-                pos: Optional[Tensor] = None):
-        output = src
-
-        for layer in self.layers:
-            output = layer(output, src_mask=mask,
-                           src_key_padding_mask=src_key_padding_mask, pos=pos)
-
-        if self.norm is not None:
-            output = self.norm(output)
-
-        return output
-
-
-class TransformerDecoder(nn.Module):
-
-    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
-        super().__init__()
-        self.layers = _get_clones(decoder_layer, num_layers)
-        self.num_layers = num_layers
-        self.norm = norm
-        self.return_intermediate = return_intermediate
-
-    def forward(self, tgt, memory,
-                tgt_mask: Optional[Tensor] = None,
-                memory_mask: Optional[Tensor] = None,
-                tgt_key_padding_mask: Optional[Tensor] = None,
-                memory_key_padding_mask: Optional[Tensor] = None,
-                pos: Optional[Tensor] = None,
-                query_pos: Optional[Tensor] = None):
-        output = tgt
-
-        intermediate = []
-
-        for layer in self.layers:
-            output = layer(output, memory, tgt_mask=tgt_mask,
-                           memory_mask=memory_mask,
-                           tgt_key_padding_mask=tgt_key_padding_mask,
-                           memory_key_padding_mask=memory_key_padding_mask,
-                           pos=pos, query_pos=query_pos)
-            if self.return_intermediate:
-                intermediate.append(self.norm(output))
-
-        if self.norm is not None:
-            output = self.norm(output)
-            if self.return_intermediate:
-                intermediate.pop()
-                intermediate.append(output)
-
-        if self.return_intermediate:
-            return torch.stack(intermediate)
-
-        return output.unsqueeze(0)
-
-
-class TransformerEncoderLayer(nn.Module):
-
-    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
-                 activation="relu", normalize_before=False):
-        super().__init__()
-        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
-        # Implementation of Feedforward model
-        self.linear1 = nn.Linear(d_model, dim_feedforward)
-        self.dropout = nn.Dropout(dropout)
-        self.linear2 = nn.Linear(dim_feedforward, d_model)
-
-        self.norm1 = nn.LayerNorm(d_model)
-        self.norm2 = nn.LayerNorm(d_model)
-        self.dropout1 = nn.Dropout(dropout)
-        self.dropout2 = nn.Dropout(dropout)
-
-        self.activation = _get_activation_fn(activation)
-        self.normalize_before = normalize_before
-
-    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
-        return tensor if pos is None else tensor + pos
-
-    def forward_post(self,
-                     src,
-                     src_mask: Optional[Tensor] = None,
-                     src_key_padding_mask: Optional[Tensor] = None,
-                     pos: Optional[Tensor] = None):
-        q = k = self.with_pos_embed(src, pos)
-        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
-                              key_padding_mask=src_key_padding_mask)[0]
-        src = src + self.dropout1(src2)
-        src = self.norm1(src)
-        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
-        src = src + self.dropout2(src2)
-        src = self.norm2(src)
-        return src
-
-    def forward_pre(self, src,
-                    src_mask: Optional[Tensor] = None,
-                    src_key_padding_mask: Optional[Tensor] = None,
-                    pos: Optional[Tensor] = None):
-        src2 = self.norm1(src)
-        q = k = self.with_pos_embed(src2, pos)
-        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
-                              key_padding_mask=src_key_padding_mask)[0]
-        src = src + self.dropout1(src2)
-        src2 = self.norm2(src)
-        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
-        src = src + self.dropout2(src2)
-        return src
-
-    def forward(self, src,
-                src_mask: Optional[Tensor] = None,
-                src_key_padding_mask: Optional[Tensor] = None,
-                pos: Optional[Tensor] = None):
-        if self.normalize_before:
-            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
-        return self.forward_post(src, src_mask, src_key_padding_mask, pos)
-
-
-class TransformerDecoderLayer(nn.Module):
-
-    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
-                 activation="relu", normalize_before=False):
-        super().__init__()
-        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
-        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
-        # Implementation of Feedforward model
-        self.linear1 = nn.Linear(d_model, dim_feedforward)
-        self.dropout = nn.Dropout(dropout)
-        self.linear2 = nn.Linear(dim_feedforward, d_model)
-
-        self.norm1 = nn.LayerNorm(d_model)
-        self.norm2 = nn.LayerNorm(d_model)
-        self.norm3 = nn.LayerNorm(d_model)
-        self.dropout1 = nn.Dropout(dropout)
-        self.dropout2 = nn.Dropout(dropout)
-        self.dropout3 = nn.Dropout(dropout)
-
-        self.activation = _get_activation_fn(activation)
-        self.normalize_before = normalize_before
-
-    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
-        return tensor if pos is None else tensor + pos
-
-    def forward_post(self, tgt, memory,
-                     tgt_mask: Optional[Tensor] = None,
-                     memory_mask: Optional[Tensor] = None,
-                     tgt_key_padding_mask: Optional[Tensor] = None,
-                     memory_key_padding_mask: Optional[Tensor] = None,
-                     pos: Optional[Tensor] = None,
-                     query_pos: Optional[Tensor] = None):
-        q = k = self.with_pos_embed(tgt, query_pos)
-        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
-                              key_padding_mask=tgt_key_padding_mask)[0]
-        tgt = tgt + self.dropout1(tgt2)
-        tgt = self.norm1(tgt)
-        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
-                                   key=self.with_pos_embed(memory, pos),
-                                   value=memory, attn_mask=memory_mask,
-                                   key_padding_mask=memory_key_padding_mask)[0]
-        tgt = tgt + self.dropout2(tgt2)
-        tgt = self.norm2(tgt)
-        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
-        tgt = tgt + self.dropout3(tgt2)
-        tgt = self.norm3(tgt)
-        return tgt
-
-    def forward_pre(self, tgt, memory,
-                    tgt_mask: Optional[Tensor] = None,
-                    memory_mask: Optional[Tensor] = None,
-                    tgt_key_padding_mask: Optional[Tensor] = None,
-                    memory_key_padding_mask: Optional[Tensor] = None,
-                    pos: Optional[Tensor] = None,
-                    query_pos: Optional[Tensor] = None):
-        tgt2 = self.norm1(tgt)
-        q = k = self.with_pos_embed(tgt2, query_pos)
-        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
-                              key_padding_mask=tgt_key_padding_mask)[0]
-        tgt = tgt + self.dropout1(tgt2)
-        tgt2 = self.norm2(tgt)
-        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
-                                   key=self.with_pos_embed(memory, pos),
-                                   value=memory, attn_mask=memory_mask,
-                                   key_padding_mask=memory_key_padding_mask)[0]
-        tgt = tgt + self.dropout2(tgt2)
-        tgt2 = self.norm3(tgt)
-        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
-        tgt = tgt + self.dropout3(tgt2)
-        return tgt
-
-    def forward(self, tgt, memory,
-                tgt_mask: Optional[Tensor] = None,
-                memory_mask: Optional[Tensor] = None,
-                tgt_key_padding_mask: Optional[Tensor] = None,
-                memory_key_padding_mask: Optional[Tensor] = None,
-                pos: Optional[Tensor] = None,
-                query_pos: Optional[Tensor] = None):
-        if self.normalize_before:
-            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
-                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
-        return self.forward_post(tgt, memory, tgt_mask, memory_mask,
-                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
-
-
-def _get_clones(module, N):
-    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
-
-
-def build_transformer(args):
-    return Transformer(
-        d_model=args.hidden_dim,
-        dropout=args.dropout,
-        nhead=args.nheads,
-        dim_feedforward=args.dim_feedforward,
-        num_encoder_layers=args.enc_layers,
-        num_decoder_layers=args.dec_layers,
-        normalize_before=args.pre_norm,
-        return_intermediate_dec=True,
-    )
-
-
-def _get_activation_fn(activation):
-    """Return an activation function given a string"""
-    if activation == "relu":
-        return F.relu
-    if activation == "gelu":
-        return F.gelu
-    if activation == "glu":
-        return F.glu
-    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")
--- a/roboimi/detr/policy.py
+++ b/roboimi/detr/policy.py
@@ -1,163 +0,0 @@
-import torch.nn as nn
-from torch.nn import functional as F
-import torchvision.transforms as transforms
-from torchvision.transforms import v2
-import torch
-from roboimi.detr.main import build_ACT_model_and_optimizer, build_CNNMLP_model_and_optimizer
-
-
-class ACTPolicy(nn.Module):
-    def __init__(self, args_override):
-        super().__init__()
-        model, optimizer = build_ACT_model_and_optimizer(args_override)
-        self.model = model # CVAE decoder
-        self.optimizer = optimizer
-        self.kl_weight = args_override['kl_weight']
-        print(f'KL Weight {self.kl_weight}')
-
-    def __call__(self, qpos, image, actions=None, is_pad=None):
-        env_state = None
-        normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
-                                         std=[0.229, 0.224, 0.225])
-        image = normalize(image)
-        if actions is not None: # training time
-            actions = actions[:, :self.model.num_queries]
-            is_pad = is_pad[:, :self.model.num_queries]
-
-            a_hat, is_pad_hat, (mu, logvar) = self.model(qpos, image, env_state, actions, is_pad)
-            total_kld, dim_wise_kld, mean_kld = kl_divergence(mu, logvar)
-            loss_dict = dict()
-            all_l1 = F.l1_loss(actions, a_hat, reduction='none')
-            l1 = (all_l1 * ~is_pad.unsqueeze(-1)).mean()
-            loss_dict['l1'] = l1
-            loss_dict['kl'] = total_kld[0]
-            loss_dict['loss'] = loss_dict['l1'] + loss_dict['kl'] * self.kl_weight
-            return loss_dict
-        else: # inference time
-            a_hat, _, (_, _) = self.model(qpos, image, env_state) # no action, sample from prior
-            return a_hat
-
-    def configure_optimizers(self):
-        return self.optimizer
-
-class ACTTVPolicy(nn.Module):
-    def __init__(self, args_override):
-        super().__init__()
-        model, optimizer = build_ACT_model_and_optimizer(args_override)
-        self.model = model # CVAE decoder
-        self.optimizer = optimizer
-        self.kl_weight = args_override['kl_weight']
-        self.qpos_noise_std = args_override['qpos_noise_std']
-        print(f'KL Weight {self.kl_weight}')
-
-    def __call__(self, qpos, image, actions=None, is_pad=None):
-        env_state = None
-
-
-
-
-
-
-
-
-
-        # normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
-        #                                  std=[0.229, 0.224, 0.225])
-        # image = normalize(image)
-
-
-        patch_h = 16
-        patch_w = 22
-        if actions is not None:
-            transform = v2.Compose([
-                v2.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),
-                v2.RandomPerspective(distortion_scale=0.5),
-                v2.RandomAffine(degrees=10, translate=(0.1,0.1), scale=(0.9,1.1)),
-                v2.GaussianBlur(kernel_size=(9,9), sigma=(0.1,2.0)),
-                v2.Resize((patch_h * 14, patch_w * 14)),
-                # v2.CenterCrop((patch_h * 14, patch_w * 14)),
-                v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
-            ])
-            qpos += (self.qpos_noise_std**0.5)*torch.randn_like(qpos)
-        else: # inference time
-            transform = v2.Compose([
-                v2.Resize((patch_h * 14, patch_w * 14)),
-                # v2.CenterCrop((patch_h * 14, patch_w * 14)),
-                v2.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
-            ])
-            
-        image = transform(image)
-
-
-
-
-
-
-
-
-
-
-
-
-        if actions is not None: # training time
-            actions = actions[:, :self.model.num_queries]
-            is_pad = is_pad[:, :self.model.num_queries]
-
-            a_hat, is_pad_hat, (mu, logvar) = self.model(qpos, image, env_state, actions, is_pad)
-            total_kld, dim_wise_kld, mean_kld = kl_divergence(mu, logvar)
-            loss_dict = dict()
-            all_l1 = F.l1_loss(actions, a_hat, reduction='none')
-            l1 = (all_l1 * ~is_pad.unsqueeze(-1)).mean()
-            loss_dict['l1'] = l1
-            loss_dict['kl'] = total_kld[0]
-            loss_dict['loss'] = loss_dict['l1'] + loss_dict['kl'] * self.kl_weight
-            return loss_dict
-        else: # inference time
-            a_hat, _, (_, _) = self.model(qpos, image, env_state) # no action, sample from prior
-            return a_hat
-
-    def configure_optimizers(self):
-        return self.optimizer
-    
-
-class CNNMLPPolicy(nn.Module):
-    def __init__(self, args_override):
-        super().__init__()
-        model, optimizer = build_CNNMLP_model_and_optimizer(args_override)
-        self.model = model # decoder
-        self.optimizer = optimizer
-
-    def __call__(self, qpos, image, actions=None, is_pad=None):
-        env_state = None # TODO
-        normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
-                                         std=[0.229, 0.224, 0.225])
-        image = normalize(image)
-        if actions is not None: # training time
-            actions = actions[:, 0]
-            a_hat = self.model(qpos, image, env_state, actions)
-            mse = F.mse_loss(actions, a_hat)
-            loss_dict = dict()
-            loss_dict['mse'] = mse
-            loss_dict['loss'] = loss_dict['mse']
-            return loss_dict
-        else: # inference time
-            a_hat = self.model(qpos, image, env_state) # no action, sample from prior
-            return a_hat
-
-    def configure_optimizers(self):
-        return self.optimizer
-
-def kl_divergence(mu, logvar):
-    batch_size = mu.size(0)
-    assert batch_size != 0
-    if mu.data.ndimension() == 4:
-        mu = mu.view(mu.size(0), mu.size(1))
-    if logvar.data.ndimension() == 4:
-        logvar = logvar.view(logvar.size(0), logvar.size(1))
-
-    klds = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp())
-    total_kld = klds.sum(1).mean(0, True)
-    dimension_wise_kld = klds.mean(0)
-    mean_kld = klds.mean(1).mean(0, True)
-
-    return total_kld, dimension_wise_kld, mean_kld
--- a/roboimi/detr/setup.py
+++ b/roboimi/detr/setup.py
@@ -1,10 +0,0 @@
-from distutils.core import setup
-from setuptools import find_packages
-
-setup(
-    name='detr',
-    version='0.0.0',
-    packages=find_packages(),
-    license='MIT License',
-    long_description=open('README.md').read(),
-)
--- a/roboimi/detr/util/init.py
+++ b/roboimi/detr/util/init.py
@@ -1 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
--- a/roboimi/detr/util/box_ops.py
+++ b/roboimi/detr/util/box_ops.py
@@ -1,88 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-"""
-Utilities for bounding box manipulation and GIoU.
-"""
-import torch
-from torchvision.ops.boxes import box_area
-
-
-def box_cxcywh_to_xyxy(x):
-    x_c, y_c, w, h = x.unbind(-1)
-    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
-         (x_c + 0.5 * w), (y_c + 0.5 * h)]
-    return torch.stack(b, dim=-1)
-
-
-def box_xyxy_to_cxcywh(x):
-    x0, y0, x1, y1 = x.unbind(-1)
-    b = [(x0 + x1) / 2, (y0 + y1) / 2,
-         (x1 - x0), (y1 - y0)]
-    return torch.stack(b, dim=-1)
-
-
-# modified from torchvision to also return the union
-def box_iou(boxes1, boxes2):
-    area1 = box_area(boxes1)
-    area2 = box_area(boxes2)
-
-    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
-    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
-
-    wh = (rb - lt).clamp(min=0)  # [N,M,2]
-    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
-
-    union = area1[:, None] + area2 - inter
-
-    iou = inter / union
-    return iou, union
-
-
-def generalized_box_iou(boxes1, boxes2):
-    """
-    Generalized IoU from https://giou.stanford.edu/
-
-    The boxes should be in [x0, y0, x1, y1] format
-
-    Returns a [N, M] pairwise matrix, where N = len(boxes1)
-    and M = len(boxes2)
-    """
-    # degenerate boxes gives inf / nan results
-    # so do an early check
-    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
-    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
-    iou, union = box_iou(boxes1, boxes2)
-
-    lt = torch.min(boxes1[:, None, :2], boxes2[:, :2])
-    rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:])
-
-    wh = (rb - lt).clamp(min=0)  # [N,M,2]
-    area = wh[:, :, 0] * wh[:, :, 1]
-
-    return iou - (area - union) / area
-
-
-def masks_to_boxes(masks):
-    """Compute the bounding boxes around the provided masks
-
-    The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.
-
-    Returns a [N, 4] tensors, with the boxes in xyxy format
-    """
-    if masks.numel() == 0:
-        return torch.zeros((0, 4), device=masks.device)
-
-    h, w = masks.shape[-2:]
-
-    y = torch.arange(0, h, dtype=torch.float)
-    x = torch.arange(0, w, dtype=torch.float)
-    y, x = torch.meshgrid(y, x)
-
-    x_mask = (masks * x.unsqueeze(0))
-    x_max = x_mask.flatten(1).max(-1)[0]
-    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
-
-    y_mask = (masks * y.unsqueeze(0))
-    y_max = y_mask.flatten(1).max(-1)[0]
-    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
-
-    return torch.stack([x_min, y_min, x_max, y_max], 1)
--- a/roboimi/detr/util/misc.py
+++ b/roboimi/detr/util/misc.py
@@ -1,468 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-"""
-Misc functions, including distributed helpers.
-
-Mostly copy-paste from torchvision references.
-"""
-import os
-import subprocess
-import time
-from collections import defaultdict, deque
-import datetime
-import pickle
-from packaging import version
-from typing import Optional, List
-
-import torch
-import torch.distributed as dist
-from torch import Tensor
-
-# needed due to empty tensor bug in pytorch and torchvision 0.5
-import torchvision
-if version.parse(torchvision.__version__) < version.parse('0.7'):
-    from torchvision.ops import _new_empty_tensor
-    from torchvision.ops.misc import _output_size
-
-
-class SmoothedValue(object):
-    """Track a series of values and provide access to smoothed values over a
-    window or the global series average.
-    """
-
-    def __init__(self, window_size=20, fmt=None):
-        if fmt is None:
-            fmt = "{median:.4f} ({global_avg:.4f})"
-        self.deque = deque(maxlen=window_size)
-        self.total = 0.0
-        self.count = 0
-        self.fmt = fmt
-
-    def update(self, value, n=1):
-        self.deque.append(value)
-        self.count += n
-        self.total += value * n
-
-    def synchronize_between_processes(self):
-        """
-        Warning: does not synchronize the deque!
-        """
-        if not is_dist_avail_and_initialized():
-            return
-        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
-        dist.barrier()
-        dist.all_reduce(t)
-        t = t.tolist()
-        self.count = int(t[0])
-        self.total = t[1]
-
-    @property
-    def median(self):
-        d = torch.tensor(list(self.deque))
-        return d.median().item()
-
-    @property
-    def avg(self):
-        d = torch.tensor(list(self.deque), dtype=torch.float32)
-        return d.mean().item()
-
-    @property
-    def global_avg(self):
-        return self.total / self.count
-
-    @property
-    def max(self):
-        return max(self.deque)
-
-    @property
-    def value(self):
-        return self.deque[-1]
-
-    def __str__(self):
-        return self.fmt.format(
-            median=self.median,
-            avg=self.avg,
-            global_avg=self.global_avg,
-            max=self.max,
-            value=self.value)
-
-
-def all_gather(data):
-    """
-    Run all_gather on arbitrary picklable data (not necessarily tensors)
-    Args:
-        data: any picklable object
-    Returns:
-        list[data]: list of data gathered from each rank
-    """
-    world_size = get_world_size()
-    if world_size == 1:
-        return [data]
-
-    # serialized to a Tensor
-    buffer = pickle.dumps(data)
-    storage = torch.ByteStorage.from_buffer(buffer)
-    tensor = torch.ByteTensor(storage).to("cuda")
-
-    # obtain Tensor size of each rank
-    local_size = torch.tensor([tensor.numel()], device="cuda")
-    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
-    dist.all_gather(size_list, local_size)
-    size_list = [int(size.item()) for size in size_list]
-    max_size = max(size_list)
-
-    # receiving Tensor from all ranks
-    # we pad the tensor because torch all_gather does not support
-    # gathering tensors of different shapes
-    tensor_list = []
-    for _ in size_list:
-        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
-    if local_size != max_size:
-        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
-        tensor = torch.cat((tensor, padding), dim=0)
-    dist.all_gather(tensor_list, tensor)
-
-    data_list = []
-    for size, tensor in zip(size_list, tensor_list):
-        buffer = tensor.cpu().numpy().tobytes()[:size]
-        data_list.append(pickle.loads(buffer))
-
-    return data_list
-
-
-def reduce_dict(input_dict, average=True):
-    """
-    Args:
-        input_dict (dict): all the values will be reduced
-        average (bool): whether to do average or sum
-    Reduce the values in the dictionary from all processes so that all processes
-    have the averaged results. Returns a dict with the same fields as
-    input_dict, after reduction.
-    """
-    world_size = get_world_size()
-    if world_size < 2:
-        return input_dict
-    with torch.no_grad():
-        names = []
-        values = []
-        # sort the keys so that they are consistent across processes
-        for k in sorted(input_dict.keys()):
-            names.append(k)
-            values.append(input_dict[k])
-        values = torch.stack(values, dim=0)
-        dist.all_reduce(values)
-        if average:
-            values /= world_size
-        reduced_dict = {k: v for k, v in zip(names, values)}
-    return reduced_dict
-
-
-class MetricLogger(object):
-    def __init__(self, delimiter="\t"):
-        self.meters = defaultdict(SmoothedValue)
-        self.delimiter = delimiter
-
-    def update(self, **kwargs):
-        for k, v in kwargs.items():
-            if isinstance(v, torch.Tensor):
-                v = v.item()
-            assert isinstance(v, (float, int))
-            self.meters[k].update(v)
-
-    def __getattr__(self, attr):
-        if attr in self.meters:
-            return self.meters[attr]
-        if attr in self.__dict__:
-            return self.__dict__[attr]
-        raise AttributeError("'{}' object has no attribute '{}'".format(
-            type(self).__name__, attr))
-
-    def __str__(self):
-        loss_str = []
-        for name, meter in self.meters.items():
-            loss_str.append(
-                "{}: {}".format(name, str(meter))
-            )
-        return self.delimiter.join(loss_str)
-
-    def synchronize_between_processes(self):
-        for meter in self.meters.values():
-            meter.synchronize_between_processes()
-
-    def add_meter(self, name, meter):
-        self.meters[name] = meter
-
-    def log_every(self, iterable, print_freq, header=None):
-        i = 0
-        if not header:
-            header = ''
-        start_time = time.time()
-        end = time.time()
-        iter_time = SmoothedValue(fmt='{avg:.4f}')
-        data_time = SmoothedValue(fmt='{avg:.4f}')
-        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
-        if torch.cuda.is_available():
-            log_msg = self.delimiter.join([
-                header,
-                '[{0' + space_fmt + '}/{1}]',
-                'eta: {eta}',
-                '{meters}',
-                'time: {time}',
-                'data: {data}',
-                'max mem: {memory:.0f}'
-            ])
-        else:
-            log_msg = self.delimiter.join([
-                header,
-                '[{0' + space_fmt + '}/{1}]',
-                'eta: {eta}',
-                '{meters}',
-                'time: {time}',
-                'data: {data}'
-            ])
-        MB = 1024.0 * 1024.0
-        for obj in iterable:
-            data_time.update(time.time() - end)
-            yield obj
-            iter_time.update(time.time() - end)
-            if i % print_freq == 0 or i == len(iterable) - 1:
-                eta_seconds = iter_time.global_avg * (len(iterable) - i)
-                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
-                if torch.cuda.is_available():
-                    print(log_msg.format(
-                        i, len(iterable), eta=eta_string,
-                        meters=str(self),
-                        time=str(iter_time), data=str(data_time),
-                        memory=torch.cuda.max_memory_allocated() / MB))
-                else:
-                    print(log_msg.format(
-                        i, len(iterable), eta=eta_string,
-                        meters=str(self),
-                        time=str(iter_time), data=str(data_time)))
-            i += 1
-            end = time.time()
-        total_time = time.time() - start_time
-        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
-        print('{} Total time: {} ({:.4f} s / it)'.format(
-            header, total_time_str, total_time / len(iterable)))
-
-
-def get_sha():
-    cwd = os.path.dirname(os.path.abspath(__file__))
-
-    def _run(command):
-        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()
-    sha = 'N/A'
-    diff = "clean"
-    branch = 'N/A'
-    try:
-        sha = _run(['git', 'rev-parse', 'HEAD'])
-        subprocess.check_output(['git', 'diff'], cwd=cwd)
-        diff = _run(['git', 'diff-index', 'HEAD'])
-        diff = "has uncommited changes" if diff else "clean"
-        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])
-    except Exception:
-        pass
-    message = f"sha: {sha}, status: {diff}, branch: {branch}"
-    return message
-
-
-def collate_fn(batch):
-    batch = list(zip(*batch))
-    batch[0] = nested_tensor_from_tensor_list(batch[0])
-    return tuple(batch)
-
-
-def _max_by_axis(the_list):
-    # type: (List[List[int]]) -> List[int]
-    maxes = the_list[0]
-    for sublist in the_list[1:]:
-        for index, item in enumerate(sublist):
-            maxes[index] = max(maxes[index], item)
-    return maxes
-
-
-class NestedTensor(object):
-    def __init__(self, tensors, mask: Optional[Tensor]):
-        self.tensors = tensors
-        self.mask = mask
-
-    def to(self, device):
-        # type: (Device) -> NestedTensor # noqa
-        cast_tensor = self.tensors.to(device)
-        mask = self.mask
-        if mask is not None:
-            assert mask is not None
-            cast_mask = mask.to(device)
-        else:
-            cast_mask = None
-        return NestedTensor(cast_tensor, cast_mask)
-
-    def decompose(self):
-        return self.tensors, self.mask
-
-    def __repr__(self):
-        return str(self.tensors)
-
-
-def nested_tensor_from_tensor_list(tensor_list: List[Tensor]):
-    # TODO make this more general
-    if tensor_list[0].ndim == 3:
-        if torchvision._is_tracing():
-            # nested_tensor_from_tensor_list() does not export well to ONNX
-            # call _onnx_nested_tensor_from_tensor_list() instead
-            return _onnx_nested_tensor_from_tensor_list(tensor_list)
-
-        # TODO make it support different-sized images
-        max_size = _max_by_axis([list(img.shape) for img in tensor_list])
-        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))
-        batch_shape = [len(tensor_list)] + max_size
-        b, c, h, w = batch_shape
-        dtype = tensor_list[0].dtype
-        device = tensor_list[0].device
-        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)
-        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)
-        for img, pad_img, m in zip(tensor_list, tensor, mask):
-            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
-            m[: img.shape[1], :img.shape[2]] = False
-    else:
-        raise ValueError('not supported')
-    return NestedTensor(tensor, mask)
-
-
-# _onnx_nested_tensor_from_tensor_list() is an implementation of
-# nested_tensor_from_tensor_list() that is supported by ONNX tracing.
-@torch.jit.unused
-def _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> NestedTensor:
-    max_size = []
-    for i in range(tensor_list[0].dim()):
-        max_size_i = torch.max(torch.stack([img.shape[i] for img in tensor_list]).to(torch.float32)).to(torch.int64)
-        max_size.append(max_size_i)
-    max_size = tuple(max_size)
-
-    # work around for
-    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
-    # m[: img.shape[1], :img.shape[2]] = False
-    # which is not yet supported in onnx
-    padded_imgs = []
-    padded_masks = []
-    for img in tensor_list:
-        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]
-        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))
-        padded_imgs.append(padded_img)
-
-        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)
-        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), "constant", 1)
-        padded_masks.append(padded_mask.to(torch.bool))
-
-    tensor = torch.stack(padded_imgs)
-    mask = torch.stack(padded_masks)
-
-    return NestedTensor(tensor, mask=mask)
-
-
-def setup_for_distributed(is_master):
-    """
-    This function disables printing when not in master process
-    """
-    import builtins as __builtin__
-    builtin_print = __builtin__.print
-
-    def print(*args, **kwargs):
-        force = kwargs.pop('force', False)
-        if is_master or force:
-            builtin_print(*args, **kwargs)
-
-    __builtin__.print = print
-
-
-def is_dist_avail_and_initialized():
-    if not dist.is_available():
-        return False
-    if not dist.is_initialized():
-        return False
-    return True
-
-
-def get_world_size():
-    if not is_dist_avail_and_initialized():
-        return 1
-    return dist.get_world_size()
-
-
-def get_rank():
-    if not is_dist_avail_and_initialized():
-        return 0
-    return dist.get_rank()
-
-
-def is_main_process():
-    return get_rank() == 0
-
-
-def save_on_master(*args, **kwargs):
-    if is_main_process():
-        torch.save(*args, **kwargs)
-
-
-def init_distributed_mode(args):
-    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
-        args.rank = int(os.environ["RANK"])
-        args.world_size = int(os.environ['WORLD_SIZE'])
-        args.gpu = int(os.environ['LOCAL_RANK'])
-    elif 'SLURM_PROCID' in os.environ:
-        args.rank = int(os.environ['SLURM_PROCID'])
-        args.gpu = args.rank % torch.cuda.device_count()
-    else:
-        print('Not using distributed mode')
-        args.distributed = False
-        return
-
-    args.distributed = True
-
-    torch.cuda.set_device(args.gpu)
-    args.dist_backend = 'nccl'
-    print('| distributed init (rank {}): {}'.format(
-        args.rank, args.dist_url), flush=True)
-    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
-                                         world_size=args.world_size, rank=args.rank)
-    torch.distributed.barrier()
-    setup_for_distributed(args.rank == 0)
-
-
-@torch.no_grad()
-def accuracy(output, target, topk=(1,)):
-    """Computes the precision@k for the specified values of k"""
-    if target.numel() == 0:
-        return [torch.zeros([], device=output.device)]
-    maxk = max(topk)
-    batch_size = target.size(0)
-
-    _, pred = output.topk(maxk, 1, True, True)
-    pred = pred.t()
-    correct = pred.eq(target.view(1, -1).expand_as(pred))
-
-    res = []
-    for k in topk:
-        correct_k = correct[:k].view(-1).float().sum(0)
-        res.append(correct_k.mul_(100.0 / batch_size))
-    return res
-
-
-def interpolate(input, size=None, scale_factor=None, mode="nearest", align_corners=None):
-    # type: (Tensor, Optional[List[int]], Optional[float], str, Optional[bool]) -> Tensor
-    """
-    Equivalent to nn.functional.interpolate, but with support for empty batch sizes.
-    This will eventually be supported natively by PyTorch, and this
-    class can go away.
-    """
-    if version.parse(torchvision.__version__) < version.parse('0.7'):
-        if input.numel() > 0:
-            return torch.nn.functional.interpolate(
-                input, size, scale_factor, mode, align_corners
-            )
-
-        output_shape = _output_size(2, input, size, scale_factor)
-        output_shape = list(input.shape[:-2]) + list(output_shape)
-        return _new_empty_tensor(input, output_shape)
-    else:
-        return torchvision.ops.misc.interpolate(input, size, scale_factor, mode, align_corners)
--- a/roboimi/detr/util/plot_utils.py
+++ b/roboimi/detr/util/plot_utils.py
@@ -1,107 +0,0 @@
-"""
-Plotting utilities to visualize training logs.
-"""
-import torch
-import pandas as pd
-import numpy as np
-import seaborn as sns
-import matplotlib.pyplot as plt
-
-from pathlib import Path, PurePath
-
-
-def plot_logs(logs, fields=('class_error', 'loss_bbox_unscaled', 'mAP'), ewm_col=0, log_name='log.txt'):
-    '''
-    Function to plot specific fields from training log(s). Plots both training and test results.
-
-    :: Inputs - logs = list containing Path objects, each pointing to individual dir with a log file
-              - fields = which results to plot from each log file - plots both training and test for each field.
-              - ewm_col = optional, which column to use as the exponential weighted smoothing of the plots
-              - log_name = optional, name of log file if different than default 'log.txt'.
-
-    :: Outputs - matplotlib plots of results in fields, color coded for each log file.
-               - solid lines are training results, dashed lines are test results.
-
-    '''
-    func_name = "plot_utils.py::plot_logs"
-
-    # verify logs is a list of Paths (list[Paths]) or single Pathlib object Path,
-    # convert single Path to list to avoid 'not iterable' error
-
-    if not isinstance(logs, list):
-        if isinstance(logs, PurePath):
-            logs = [logs]
-            print(f"{func_name} info: logs param expects a list argument, converted to list[Path].")
-        else:
-            raise ValueError(f"{func_name} - invalid argument for logs parameter.\n \
-            Expect list[Path] or single Path obj, received {type(logs)}")
-
-    # Quality checks - verify valid dir(s), that every item in list is Path object, and that log_name exists in each dir
-    for i, dir in enumerate(logs):
-        if not isinstance(dir, PurePath):
-            raise ValueError(f"{func_name} - non-Path object in logs argument of {type(dir)}: \n{dir}")
-        if not dir.exists():
-            raise ValueError(f"{func_name} - invalid directory in logs argument:\n{dir}")
-        # verify log_name exists
-        fn = Path(dir / log_name)
-        if not fn.exists():
-            print(f"-> missing {log_name}.  Have you gotten to Epoch 1 in training?")
-            print(f"--> full path of missing log file: {fn}")
-            return
-
-    # load log file(s) and plot
-    dfs = [pd.read_json(Path(p) / log_name, lines=True) for p in logs]
-
-    fig, axs = plt.subplots(ncols=len(fields), figsize=(16, 5))
-
-    for df, color in zip(dfs, sns.color_palette(n_colors=len(logs))):
-        for j, field in enumerate(fields):
-            if field == 'mAP':
-                coco_eval = pd.DataFrame(
-                    np.stack(df.test_coco_eval_bbox.dropna().values)[:, 1]
-                ).ewm(com=ewm_col).mean()
-                axs[j].plot(coco_eval, c=color)
-            else:
-                df.interpolate().ewm(com=ewm_col).mean().plot(
-                    y=[f'train_{field}', f'test_{field}'],
-                    ax=axs[j],
-                    color=[color] * 2,
-                    style=['-', '--']
-                )
-    for ax, field in zip(axs, fields):
-        ax.legend([Path(p).name for p in logs])
-        ax.set_title(field)
-
-
-def plot_precision_recall(files, naming_scheme='iter'):
-    if naming_scheme == 'exp_id':
-        # name becomes exp_id
-        names = [f.parts[-3] for f in files]
-    elif naming_scheme == 'iter':
-        names = [f.stem for f in files]
-    else:
-        raise ValueError(f'not supported {naming_scheme}')
-    fig, axs = plt.subplots(ncols=2, figsize=(16, 5))
-    for f, color, name in zip(files, sns.color_palette("Blues", n_colors=len(files)), names):
-        data = torch.load(f)
-        # precision is n_iou, n_points, n_cat, n_area, max_det
-        precision = data['precision']
-        recall = data['params'].recThrs
-        scores = data['scores']
-        # take precision for all classes, all areas and 100 detections
-        precision = precision[0, :, :, 0, -1].mean(1)
-        scores = scores[0, :, :, 0, -1].mean(1)
-        prec = precision.mean()
-        rec = data['recall'][0, :, 0, -1].mean()
-        print(f'{naming_scheme} {name}: mAP@50={prec * 100: 05.1f}, ' +
-              f'score={scores.mean():0.3f}, ' +
-              f'f1={2 * prec * rec / (prec + rec + 1e-8):0.3f}'
-              )
-        axs[0].plot(recall, precision, c=color)
-        axs[1].plot(recall, scores, c=color)
-
-    axs[0].set_title('Precision / Recall')
-    axs[0].legend(names)
-    axs[1].set_title('Scores / Recall')
-    axs[1].legend(names)
-    return fig, axs
--- a/roboimi/envs/double_base.py
+++ b/roboimi/envs/double_base.py
@@ -53,6 +53,7 @@ class DualDianaMed(MujocoEnv):
        self.l_vis = None
        self.top = None
        self.angle = None
+        self.front = None
        self.obs = None
        
        self.rew = None
@@ -168,6 +169,7 @@ class DualDianaMed(MujocoEnv):
        obs['images']['angle'] = self.angle
        obs['images']['r_vis'] = self.r_vis
        obs['images']['l_vis'] = self.l_vis
+        obs['images']['front'] = self.front
        return obs

    def _get_image_obs(self):
@@ -177,6 +179,7 @@ class DualDianaMed(MujocoEnv):
        obs['images']['angle'] = self.angle
        obs['images']['r_vis'] = self.r_vis
        obs['images']['l_vis'] = self.l_vis
+        obs['images']['front'] = self.front
        return obs
    
    def _get_qpos_obs(self):
@@ -202,12 +205,16 @@ class DualDianaMed(MujocoEnv):
            return self.r_vis
        elif self.cam == 'l_vis':
            return self.l_vis
+        elif self.cam == 'front':
+            return self.front
        else:
            raise AttributeError("please input right name")


    def camera_viewer(self):
        img_renderer = mj.Renderer(self.mj_model,height=480,width=640)
+        show_gui = self.is_render
+        if show_gui:
            cv2.namedWindow('Cam view',cv2.WINDOW_NORMAL)
        while not self.exit_flag:
            img_renderer.update_scene(self.mj_data,camera="rs_cam_right")
@@ -222,6 +229,11 @@ class DualDianaMed(MujocoEnv):
            img_renderer.update_scene(self.mj_data,camera="angle")
            self.angle = img_renderer.render()
            self.angle = self.angle[:, :, ::-1]
+            img_renderer.update_scene(self.mj_data,camera="front")
+            self.front = img_renderer.render()
+            self.front = self.front[:, :, ::-1]
+            if show_gui:
+                if self.cam_view is not None:
                    cv2.imshow('Cam view', self.cam_view)
                cv2.waitKey(1)

--- a/roboimi/envs/double_pos_ctrl_env.py
+++ b/roboimi/envs/double_pos_ctrl_env.py
@@ -72,12 +72,17 @@ class DualDianaMed_Pos_Ctrl(DualDianaMed):
        self.mj_data.joint('red_box_joint').qpos[5] = 0.0
        self.mj_data.joint('red_box_joint').qpos[6] = 0.0
        super().reset()
+        self.top = None
+        self.angle = None
+        self.r_vis = None
+        self.front = None
        self.cam_flage = True
        t=0
        while self.cam_flage:
            if(type(self.top)==type(None) 
               or type(self.angle)==type(None) 
-               or type(self.r_vis)==type(None)):
+               or type(self.r_vis)==type(None)
+               or type(self.front)==type(None)):
                time.sleep(0.001)
                t+=1
            else:
@@ -128,12 +133,12 @@ class DualDianaMed_Pos_Ctrl(DualDianaMed):
        return reward
        

-def make_sim_env(task_name):
+def make_sim_env(task_name, headless=False):
    if 'sim_transfer' in task_name:
        from roboimi.assets.robots.diana_med import BiDianaMed
        env = DualDianaMed_Pos_Ctrl(
            robot=BiDianaMed(),
-            is_render=True,
+            is_render=not headless,
            control_freq=30,
            is_interpolate=True,
            cam_view='angle'
--- a/roboimi/utils/act_ex_utils.py
+++ b/roboimi/utils/act_ex_utils.py
@@ -27,8 +27,8 @@ def sample_insertion_pose():

 def sample_transfer_pose():
        # Box
-        x_range = [0.0, 0.05]
-        y_range = [0.95, 1.05]
+        x_range = [-0.2, 0.2]
+        y_range = [0.7, 1.1]
        z_range = [0.47, 0.47]

        ranges = np.vstack([x_range, y_range, z_range])
--- a/roboimi/utils/constants.py
+++ b/roboimi/utils/constants.py
@@ -18,9 +18,9 @@ SIM_TASK_CONFIGS = {
    # },
    'sim_transfer': {
        'dataset_dir': DATASET_DIR + '/sim_transfer',
-        'num_episodes': 7,
+        'num_episodes': 20,
        'episode_len': 700,
-        'camera_names': ['angle','r_vis'],
+        'camera_names': ['top','r_vis','front'],
        'xml_dir': HOME_PATH + '/assets'
    },

--- a/roboimi/utils/model_interface.py
+++ b/roboimi/utils/model_interface.py
@@ -2,6 +2,7 @@ import os
 import torch
 from roboimi.utils.utils import load_data, set_seed
 from roboimi.detr.policy import ACTPolicy, CNNMLPPolicy, ACTTVPolicy
+from roboimi.gr00t.policy import gr00tPolicy

 class ModelInterface:
    def __init__(self, config):
@@ -65,6 +66,26 @@ class ModelInterface:
                'num_queries': 1,
                'camera_names': self.config['camera_names'],
            }
+        elif self.config['policy_class'] == 'GR00T':
+            # GR00T uses its own config section from config.yaml
+            gr00t_config = self.config.get('gr00t', {})
+            self.config['policy_config'] = {
+                'lr': gr00t_config.get('lr', self.config['lr']),
+                'lr_backbone': gr00t_config.get('lr_backbone', self.config['lr_backbone']),
+                'weight_decay': gr00t_config.get('weight_decay', 1e-4),
+                'embed_dim': gr00t_config.get('embed_dim', 1536),
+                'hidden_dim': gr00t_config.get('hidden_dim', 1024),
+                'state_dim': gr00t_config.get('state_dim', 16),
+                'action_dim': gr00t_config.get('action_dim', 16),
+                'num_queries': gr00t_config.get('num_queries', 16),
+                'num_layers': gr00t_config.get('num_layers', 16),
+                'nheads': gr00t_config.get('nheads', 32),
+                'mlp_ratio': gr00t_config.get('mlp_ratio', 4),
+                'dropout': gr00t_config.get('dropout', 0.2),
+                'backbone': gr00t_config.get('backbone', 'dino_v2'),
+                'position_embedding': gr00t_config.get('position_embedding', 'sine'),
+                'camera_names': self.config['camera_names'],
+            }
        else:
            raise NotImplementedError

@@ -75,6 +96,8 @@ class ModelInterface:
            return ACTTVPolicy(self.config['policy_config'])
        elif self.config['policy_class'] == 'CNNMLP':
            return CNNMLPPolicy(self.config['policy_config'])
+        elif self.config['policy_class'] == 'GR00T':
+            return gr00tPolicy(self.config['policy_config'])
        else:
            raise NotImplementedError

--- a/roboimi/utils/raw_action_trajectory_viewer.py
+++ b/roboimi/utils/raw_action_trajectory_viewer.py
@@ -0,0 +1,176 @@
+from __future__ import annotations
+
+import math
+import time
+from pathlib import Path
+from typing import Iterable
+
+import cv2
+import mujoco
+import numpy as np
+
+from roboimi.assets.robots.diana_med import BiDianaMed
+from roboimi.envs.mujoco_base import MujocoEnv
+from roboimi.envs.double_pos_ctrl_env import make_sim_env
+from roboimi.utils.act_ex_utils import sample_transfer_pose
+
+
+def _load_raw_action_array(path: str | Path) -> np.ndarray:
+    path = Path(path)
+    if path.suffix == ".npy":
+        raw_action = np.load(path)
+    elif path.suffix == ".npz":
+        archive = np.load(path)
+        if "raw_action" in archive:
+            raw_action = archive["raw_action"]
+        elif "raw_predicted_ee_action" in archive:
+            raw_action = archive["raw_predicted_ee_action"]
+        else:
+            raise KeyError(f"{path} does not contain raw_action")
+    else:
+        raise ValueError(f"unsupported trajectory file: {path}")
+    raw_action = np.asarray(raw_action, dtype=np.float32)
+    if raw_action.ndim != 2 or raw_action.shape[1] < 10:
+        raise ValueError(f"raw_action must have shape (T, 16)-like, got {raw_action.shape}")
+    return raw_action
+
+
+def disable_cv2_highgui(cv2_module=cv2):
+    original = {
+        "namedWindow": cv2_module.namedWindow,
+        "imshow": cv2_module.imshow,
+        "waitKey": cv2_module.waitKey,
+    }
+
+    cv2_module.namedWindow = lambda *args, **kwargs: None
+    cv2_module.imshow = lambda *args, **kwargs: None
+    cv2_module.waitKey = lambda *args, **kwargs: 1
+
+    def restore():
+        cv2_module.namedWindow = original["namedWindow"]
+        cv2_module.imshow = original["imshow"]
+        cv2_module.waitKey = original["waitKey"]
+
+    return restore
+
+
+def set_transfer_box_pose(mj_data, box_pos: np.ndarray) -> None:
+    box_pos = np.asarray(box_pos, dtype=np.float64)
+    if box_pos.shape != (3,):
+        raise ValueError(f"box_pos must have shape (3,), got {box_pos.shape}")
+    joint = mj_data.joint("red_box_joint")
+    joint.qpos[0] = box_pos[0]
+    joint.qpos[1] = box_pos[1]
+    joint.qpos[2] = box_pos[2]
+    joint.qpos[3] = 1.0
+    joint.qpos[4] = 0.0
+    joint.qpos[5] = 0.0
+    joint.qpos[6] = 0.0
+
+
+def load_raw_action_positions(path: str | Path) -> dict[str, np.ndarray]:
+    raw_action = _load_raw_action_array(path)
+    return {
+        "left": raw_action[:, :3].astype(np.float32, copy=True),
+        "right": raw_action[:, 7:10].astype(np.float32, copy=True),
+    }
+
+
+def _downsample_points(points: np.ndarray, stride: int) -> np.ndarray:
+    sampled = points[::stride]
+    if len(sampled) == 0:
+        return points
+    if not np.array_equal(sampled[-1], points[-1]):
+        sampled = np.concatenate([sampled, points[-1:]], axis=0)
+    return sampled
+
+
+def build_trajectory_capsule_markers(
+    positions: dict[str, np.ndarray],
+    *,
+    max_markers: int,
+    radius: float = 0.003,
+    rgba: tuple[float, float, float, float] = (1.0, 0.0, 0.0, 1.0),
+) -> list[dict]:
+    total_segments = sum(max(len(points) - 1, 0) for points in positions.values())
+    if total_segments == 0:
+        return []
+    stride = max(1, math.ceil(total_segments / max_markers))
+    markers = []
+    for points in positions.values():
+        sampled = _downsample_points(np.asarray(points, dtype=np.float64), stride)
+        for idx in range(len(sampled) - 1):
+            markers.append(
+                {
+                    "from": sampled[idx],
+                    "to": sampled[idx + 1],
+                    "rgba": rgba,
+                    "radius": float(radius),
+                }
+            )
+    return markers[:max_markers]
+
+
+def apply_capsule_markers_to_scene(user_scn, markers: Iterable[dict]) -> None:
+    user_scn.ngeom = 0
+    for marker in markers:
+        if user_scn.ngeom >= user_scn.maxgeom:
+            break
+        geom = user_scn.geoms[user_scn.ngeom]
+        mujoco.mjv_initGeom(
+            geom,
+            mujoco.mjtGeom.mjGEOM_CAPSULE,
+            np.zeros(3, dtype=np.float64),
+            np.zeros(3, dtype=np.float64),
+            np.eye(3, dtype=np.float64).reshape(-1),
+            np.asarray(marker["rgba"], dtype=np.float32),
+        )
+        mujoco.mjv_connector(
+            geom,
+            mujoco.mjtGeom.mjGEOM_CAPSULE,
+            float(marker["radius"]),
+            np.asarray(marker["from"], dtype=np.float64),
+            np.asarray(marker["to"], dtype=np.float64),
+        )
+        user_scn.ngeom += 1
+
+
+def launch_raw_action_trajectory_viewer(
+    trajectory_path: str | Path,
+    *,
+    task_name: str = "sim_transfer",
+    line_radius: float = 0.004,
+    max_markers: int = 1500,
+    box_pos: np.ndarray | None = None,
+    disable_camera_window: bool = True,
+):
+    positions = load_raw_action_positions(trajectory_path)
+    if task_name != "sim_transfer":
+        raise NotImplementedError(f"unsupported task_name: {task_name}")
+    if box_pos is None:
+        box_pos = sample_transfer_pose()
+
+    robot = BiDianaMed()
+    viewer_env = MujocoEnv(robot=robot, is_render=True, renderer="viewer", control_freq=30)
+    viewer_env.reset()
+    set_transfer_box_pose(viewer_env.mj_data, box_pos)
+    mujoco.mj_forward(viewer_env.mj_model, viewer_env.mj_data)
+    markers = build_trajectory_capsule_markers(
+        positions,
+        max_markers=max_markers,
+        radius=line_radius,
+    )
+
+    if viewer_env.viewer is None or getattr(viewer_env.viewer, "user_scn", None) is None:
+        raise RuntimeError("viewer does not expose user_scn; cannot render trajectory overlay")
+
+    try:
+        while viewer_env.viewer.is_running() and not viewer_env.exit_flag:
+            with viewer_env.viewer.lock():
+                apply_capsule_markers_to_scene(viewer_env.viewer.user_scn, markers)
+            viewer_env.render()
+            time.sleep(1 / 60.0)
+    finally:
+        viewer_env.exit_flag = True
+        if getattr(viewer_env, "viewer", None) is not None:
+            viewer_env.viewer.close()
--- a/roboimi/utils/streaming_episode_writer.py
+++ b/roboimi/utils/streaming_episode_writer.py
@@ -0,0 +1,113 @@
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import cv2
+import h5py
+import numpy as np
+
+
+class StreamingEpisodeWriter:
+    """逐帧写入 episode 数据，成功后提交，失败时丢弃临时文件。"""
+
+    def __init__(
+        self,
+        dataset_path: str | os.PathLike[str],
+        max_timesteps: int,
+        camera_names: list[str],
+        image_size: tuple[int, int] = (256, 256),
+    ) -> None:
+        self.dataset_path = Path(dataset_path)
+        self.tmp_path = Path(f"{self.dataset_path}.tmp")
+        self.max_timesteps = int(max_timesteps)
+        self.camera_names = list(camera_names)
+        self.image_height = int(image_size[0])
+        self.image_width = int(image_size[1])
+        self.frame_index = 0
+        self._committed = False
+        self._closed = False
+
+        self.dataset_path.parent.mkdir(parents=True, exist_ok=True)
+        if self.tmp_path.exists():
+            self.tmp_path.unlink()
+
+        self._file = h5py.File(self.tmp_path, "w", rdcc_nbytes=1024**2 * 2)
+        self._file.attrs["sim"] = True
+        self._file.attrs["action_repr"] = "ee_pose_xyz_quat_gripper"
+        self._file.attrs["image_height"] = self.image_height
+        self._file.attrs["image_width"] = self.image_width
+        self._file.attrs["camera_names"] = np.asarray(self.camera_names, dtype="S")
+
+        observations = self._file.create_group("observations")
+        images = observations.create_group("images")
+        for cam_name in self.camera_names:
+            images.create_dataset(
+                cam_name,
+                (self.max_timesteps, self.image_height, self.image_width, 3),
+                dtype="uint8",
+                chunks=(1, self.image_height, self.image_width, 3),
+            )
+        observations.create_dataset(
+            "qpos",
+            (self.max_timesteps, 16),
+            dtype="float32",
+            chunks=(min(128, self.max_timesteps), 16),
+        )
+        self._file.create_dataset(
+            "action",
+            (self.max_timesteps, 16),
+            dtype="float32",
+            chunks=(min(128, self.max_timesteps), 16),
+        )
+
+    def append(self, qpos: np.ndarray, action: np.ndarray, images: dict[str, np.ndarray]) -> None:
+        if self._closed:
+            raise RuntimeError("writer is already closed")
+        if self.frame_index >= self.max_timesteps:
+            raise IndexError("frame index exceeds max_timesteps")
+
+        qpos = np.asarray(qpos, dtype=np.float32)
+        action = np.asarray(action, dtype=np.float32)
+        if qpos.shape != (16,):
+            raise ValueError(f"qpos shape must be (16,), got {qpos.shape}")
+        if action.shape != (16,):
+            raise ValueError(f"action shape must be (16,), got {action.shape}")
+
+        self._file["observations/qpos"][self.frame_index] = qpos
+        self._file["action"][self.frame_index] = action
+
+        for cam_name in self.camera_names:
+            if cam_name not in images:
+                raise KeyError(f"missing image for camera '{cam_name}'")
+            self._file[f"observations/images/{cam_name}"][self.frame_index] = self._resize_image(images[cam_name])
+
+        self.frame_index += 1
+
+    def commit(self) -> None:
+        if self._closed:
+            return
+        self._file.flush()
+        self._file.close()
+        self._closed = True
+        os.replace(self.tmp_path, self.dataset_path)
+        self._committed = True
+
+    def discard(self) -> None:
+        if not self._closed:
+            self._file.close()
+            self._closed = True
+        if self.tmp_path.exists():
+            self.tmp_path.unlink()
+
+    def _resize_image(self, image: np.ndarray) -> np.ndarray:
+        image = np.asarray(image, dtype=np.uint8)
+        if image.ndim != 3 or image.shape[2] != 3:
+            raise ValueError(f"image shape must be HxWx3, got {image.shape}")
+        if image.shape[:2] == (self.image_height, self.image_width):
+            return image
+
+        interpolation = cv2.INTER_AREA
+        if image.shape[0] < self.image_height or image.shape[1] < self.image_width:
+            interpolation = cv2.INTER_LINEAR
+        return cv2.resize(image, (self.image_width, self.image_height), interpolation=interpolation)
--- a/roboimi/vla/init.py
+++ b/roboimi/vla/init.py
@@ -0,0 +1 @@
+# export VLAAgent, VLAModelConfig
--- a/roboimi/vla/agent.py
+++ b/roboimi/vla/agent.py
@@ -0,0 +1,446 @@
+import torch
+import torch.nn as nn
+import numpy as np
+from collections import deque
+from typing import Dict, Optional, Any, Tuple
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+from roboimi.vla.models.normalization import NormalizationModule
+
+class VLAAgent(nn.Module):
+
+    def __init__(
+        self,
+        vision_backbone,      # 视觉编码器（ResNet 等）
+        state_encoder,
+        action_encoder,
+        head,
+        action_dim,           # 机器人动作维度 (例如 7: xyz + rpy + gripper)
+        obs_dim,              # 本体感知维度 (例如 关节角度)
+        pred_horizon=16,      # 预测未来多少步动作
+        obs_horizon=4,        # 使用多少步历史观测
+        diffusion_steps=100,  # DDPM 加噪步数
+        inference_steps=10,   # DDIM 推理步数
+        num_cams=3,          # 视觉输入的摄像头数量
+        camera_names: Optional[Tuple[str, ...]] = None,  # 条件相机顺序
+        dataset_stats=None,   # 数据集统计信息，用于归一化
+        normalization_type='min_max',  # 归一化类型: 'gaussian' 或 'min_max'
+        num_action_steps=8,   # 每次推理实际执行多少步动作
+        head_type='unet',     # Policy head类型: 'unet' 或 'transformer'
+    ):
+        super().__init__()
+        # 保存参数
+        self.action_dim = action_dim
+        self.obs_dim = obs_dim
+        self.pred_horizon = pred_horizon
+        self.obs_horizon = obs_horizon
+        self.num_cams = num_cams
+        self.num_action_steps = num_action_steps
+        self.inference_steps = inference_steps
+        self.head_type = head_type  # 'unet' 或 'transformer'
+        agent_camera_names = tuple(camera_names) if camera_names is not None else None
+        backbone_camera_names = getattr(vision_backbone, 'camera_names', None)
+        backbone_camera_names = tuple(backbone_camera_names) if backbone_camera_names is not None else None
+        backbone_num_cameras = getattr(vision_backbone, 'num_cameras', None)
+        if backbone_num_cameras is not None and backbone_num_cameras != self.num_cams:
+            raise ValueError(
+                f"agent.num_cams({self.num_cams}) 与 "
+                f"vision_backbone.num_cameras({backbone_num_cameras}) 不一致"
+            )
+        if (
+            agent_camera_names is not None
+            and backbone_camera_names is not None
+            and agent_camera_names != backbone_camera_names
+        ):
+            raise ValueError(
+                f"agent.camera_names({list(agent_camera_names)}) 与 "
+                f"vision_backbone.camera_names({list(backbone_camera_names)}) 不一致"
+            )
+        self.camera_names = (
+            agent_camera_names if agent_camera_names is not None else backbone_camera_names
+        )
+        if self.camera_names is not None and len(self.camera_names) != self.num_cams:
+            raise ValueError(
+                f"camera_names 长度({len(self.camera_names)})与 num_cams({self.num_cams})不一致"
+            )
+
+
+        # 归一化模块 - 统一训练和推理的归一化逻辑
+        self.normalization = NormalizationModule(
+            stats=dataset_stats,
+            normalization_type=normalization_type
+        )
+
+        self.vision_encoder = vision_backbone
+        if self.camera_names is not None:
+            self.vision_encoder.camera_names = self.camera_names
+        single_cam_feat_dim = self.vision_encoder.output_dim
+        # global_cond_dim: 展平后的总维度（用于UNet）
+        total_vision_dim = single_cam_feat_dim * num_cams * obs_horizon
+        total_prop_dim = obs_dim * obs_horizon
+        self.global_cond_dim = total_vision_dim + total_prop_dim
+
+        # per_step_cond_dim: 每步的条件维度（用于Transformer）
+        # 注意：这里不乘以obs_horizon，因为Transformer的输入是序列形式
+        self.per_step_cond_dim = single_cam_feat_dim * num_cams + obs_dim
+
+        self.noise_scheduler = DDPMScheduler(
+                    num_train_timesteps=diffusion_steps,
+                    beta_schedule='squaredcos_cap_v2', # 机器人任务常用的 schedule
+                    clip_sample=True,
+                    prediction_type='epsilon' # 预测噪声
+                )
+
+        # DDIM 调度器用于快速推理
+        self.infer_scheduler = DDIMScheduler(
+            num_train_timesteps=diffusion_steps,
+            beta_schedule='squaredcos_cap_v2',
+            clip_sample=True,
+            prediction_type='epsilon'
+        )
+
+        # 根据head类型初始化不同的参数
+        if head_type == 'transformer':
+            # 如果head已经是nn.Module实例，直接使用；否则需要初始化
+            if isinstance(head, nn.Module):
+                # 已经是实例化的模块（测试时直接传入<E4BCA0><E585A5>
+                self.noise_pred_net = head
+            else:
+                # Hydra部分初始化的对象，调用时传入参数
+                self.noise_pred_net = head(
+                    input_dim=action_dim,
+                    output_dim=action_dim,
+                    horizon=pred_horizon,
+                    n_obs_steps=obs_horizon,
+                    cond_dim=self.per_step_cond_dim  # 每步的条件维度
+                )
+        else:  # 'unet' (default)
+            # UNet接口: input_dim, global_cond_dim
+            self.noise_pred_net = head(
+                input_dim=action_dim,
+                global_cond_dim=self.global_cond_dim
+            )
+
+        self.state_encoder = state_encoder
+        self.action_encoder = action_encoder
+
+        # 初始化队列（用于在线推理）
+        self.reset()
+
+    def _get_model_device(self) -> torch.device:
+        """获取模型当前所在设备。"""
+        return next(self.parameters()).device
+
+    def _move_to_device(self, data, device: torch.device):
+        """递归地将张量数据移动到指定设备。"""
+        if torch.is_tensor(data):
+            return data.to(device)
+        if isinstance(data, dict):
+            return {k: self._move_to_device(v, device) for k, v in data.items()}
+        if isinstance(data, list):
+            return [self._move_to_device(v, device) for v in data]
+        if isinstance(data, tuple):
+            return tuple(self._move_to_device(v, device) for v in data)
+        return data
+
+    def _order_images(self, images: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+        """按显式配置的相机顺序返回图像字典。"""
+        if self.camera_names is None:
+            camera_names = tuple(sorted(images.keys()))
+            if len(camera_names) != self.num_cams:
+                raise ValueError(
+                    f"图像条件相机数量({len(camera_names)})与 num_cams({self.num_cams})不一致"
+                )
+            return {cam_name: images[cam_name] for cam_name in camera_names}
+
+        missing = [cam_name for cam_name in self.camera_names if cam_name not in images]
+        if missing:
+            raise ValueError(
+                f"图像条件缺少必需相机。missing={missing}, expected={list(self.camera_names)}"
+            )
+        return {cam_name: images[cam_name] for cam_name in self.camera_names}
+
+    def _build_cond(self, images: Dict[str, torch.Tensor], states: torch.Tensor) -> torch.Tensor:
+        """构造每步条件，确保图像条件顺序稳定。"""
+        ordered_images = self._order_images(images)
+        visual_features = self.vision_encoder(ordered_images)
+        state_features = self.state_encoder(states)
+        cond = torch.cat([visual_features, state_features], dim=-1)
+        if cond.shape[-1] != self.per_step_cond_dim:
+            raise RuntimeError(
+                f"条件维度不匹配: got {cond.shape[-1]}, expected {self.per_step_cond_dim}"
+            )
+        return cond
+
+    # ==========================
+    # 训练阶段 (Training)
+    # ==========================
+    def compute_loss(self, batch):
+        """
+        计算训练损失
+
+        Args:
+            batch: 包含 images, qpos (本体感知), action, action_is_pad 的字典
+        """
+        actions, states, images = batch['action'], batch['qpos'], batch['images']
+        action_is_pad = batch.get('action_is_pad', None)  # 获取padding mask
+        B = actions.shape[0]
+
+        # 归一化 states (qpos) 和 actions
+        states = self.normalization.normalize_qpos(states)
+        actions = self.normalization.normalize_action(actions)
+
+        # 1. 提取视觉特征
+        per_step_cond = self._build_cond(images, states)
+        action_features = self.action_encoder(actions)
+
+        # 2. 采样噪声
+        noise = torch.randn_like(action_features)
+
+        # 3. 随机采样时间步 (Timesteps)
+        timesteps = torch.randint(
+            0, self.noise_scheduler.config.num_train_timesteps,
+            (B,), device=action_features.device
+        ).long()
+
+        # 4. 给动作加噪 (Forward Diffusion)
+        noisy_actions = self.noise_scheduler.add_noise(
+            action_features, noise, timesteps
+        )
+
+        # 拼接全局条件并展平
+        # per_step_cond: (B, obs_horizon, vision_dim * num_cams + obs_dim)
+        # 展平后用于 UNet，全序列形式用于 Transformer
+        global_cond = per_step_cond.flatten(start_dim=1)
+
+        # 5. 网络预测噪声（根据head类型选择接口）
+        if self.head_type == 'transformer':
+            pred_noise = self.noise_pred_net(
+                sample=noisy_actions,
+                timestep=timesteps,
+                cond=per_step_cond
+            )
+        else:  # 'unet'
+            pred_noise = self.noise_pred_net(
+                sample=noisy_actions,
+                timestep=timesteps,
+                global_cond=global_cond
+            )
+
+        # 6. 计算 Loss (MSE)，支持 padding mask
+        loss = nn.functional.mse_loss(pred_noise, noise, reduction='none')
+
+        # 如果提供了 action_is_pad，对padding位置进行mask
+        if action_is_pad is not None:
+            # action_is_pad: (B, pred_horizon)，扩展到 (B, pred_horizon, action_dim)
+            mask = (~action_is_pad).unsqueeze(-1).to(loss.dtype)  # 1.0表示有效数据
+            valid_count = mask.sum() * loss.shape[-1]
+            loss = (loss * mask).sum() / valid_count.clamp_min(1.0)
+        else:
+            loss = loss.mean()
+
+        return loss
+
+    # ==========================
+    # 队列管理 (Queue Management)
+    # ==========================
+    def reset(self):
+        """清空观测和动作队列。应在 env.reset() 时调用"""
+        self._queues = {
+            'qpos': deque(maxlen=self.obs_horizon),
+            'images': deque(maxlen=self.obs_horizon),
+            'action': deque(maxlen=self.pred_horizon - self.obs_horizon + 1),  # 可执行的动作缓存
+        }
+
+    def _populate_queues(self, observation: Dict[str, torch.Tensor]) -> None:
+        """
+        将新的观测添加到队列中。
+
+        Args:
+            observation: 包含 'qpos' 和 'images' 的字典
+        """
+        # 添加本体感知
+        if 'qpos' in observation:
+            self._queues['qpos'].append(observation['qpos'].clone())
+
+        # 添加图像
+        if 'images' in observation:
+            ordered_images = self._order_images(observation['images'])
+            self._queues['images'].append({k: v.clone() for k, v in ordered_images.items()})
+
+    def _prepare_observation_batch(self) -> Dict[str, torch.Tensor]:
+        """
+        从队列中准备用于推理的批量观测。
+        如果队列未满（首次调用时），用最新观测重复填充。
+
+        Returns:
+            batch: 包含堆叠后的历史观测的字典
+        """
+        # 堆叠历史本体感知
+        qpos_list = list(self._queues['qpos'])
+        if len(qpos_list) == 0:
+            raise ValueError("观测队列为空，请先调用 _populate_queues 添加观测")
+        # 如果队列未满，用最后一个观测填充
+        while len(qpos_list) < self.obs_horizon:
+            qpos_list.append(qpos_list[-1])
+        batch_qpos = torch.stack(qpos_list, dim=0).unsqueeze(0)  # (1, obs_horizon, obs_dim)
+
+        # 堆叠历史图像
+        images_list = list(self._queues['images'])
+        if len(images_list) == 0:
+            raise ValueError("图像队列为空，请先调用 _populate_queues 添加观测")
+        # 如果队列未满，用最后一个观测填充
+        while len(images_list) < self.obs_horizon:
+            images_list.append(images_list[-1])
+
+        batch_images = {}
+        camera_names = self.camera_names if self.camera_names is not None else tuple(sorted(images_list[0].keys()))
+        for cam_name in camera_names:
+            batch_images[cam_name] = torch.stack([img[cam_name] for img in images_list], dim=0).unsqueeze(0)
+
+        return {'qpos': batch_qpos, 'images': batch_images}
+
+    # ==========================
+    # 在线推理 (Online Inference)
+    # ==========================
+    @torch.no_grad()
+    def select_action(self, observation: Dict[str, torch.Tensor]) -> torch.Tensor:
+        """
+        根据当前观测选择单个动作。
+
+        这个方法维护一个历史观测和生成动作轨迹的缓存。工作流程：
+          - 缓存 `obs_horizon` 步的历史观测
+          - Diffusion 模型生成 `pred_horizon` 步的动作
+          - 实际执行 `num_action_steps` 步动作
+
+        示意图:
+            --------------------------------------------------------------
+            (图例: o=obs_horizon, h=pred_horizon, a=num_action_steps)
+            |时间步              | 0     | 1     | ... | o-1   | o     | ... | h-1   |
+            |观测是否使用         | 是    | 是    | 是  | 是    | 否    | 否   | 否    |
+            |动作是否生成         | 是    | 是    | 是  | 是    | 是    | 是   | 是    |
+            |动作是否执行         | 否    | 否    | 否   | 否    | 是    | 是   | 是    |
+            --------------------------------------------------------------
+
+        Args:
+            observation: 包含 'qpos' 和 'images' 的字典
+
+        Returns:
+            action: (action_dim,) 单个动作
+        """
+        # 使用模型当前设备作为唯一真值，将输入移动到模型设备
+        # 避免根据CPU观测把模型错误搬回CPU。
+        device = self._get_model_device()
+        observation = self._move_to_device(observation, device)
+
+        # 将新观测添加到队列
+        self._populate_queues(observation)
+
+        # 如果动作队列为空，生成新的动作序列
+        if len(self._queues['action']) == 0:
+            # 从队列准备批量观测
+            batch = self._prepare_observation_batch()
+
+            # 生成动作块
+            actions = self.predict_action_chunk(batch)  # (1, pred_horizon, action_dim)
+
+            # 提取可执行的动作部分
+            # 从 obs_horizon-1 开始，因为前面的动作对应过去的观测
+            start = self.obs_horizon - 1
+            end = start + self.num_action_steps
+            executable_actions = actions[:, start:end]  # (1, num_action_steps, action_dim)
+
+            # 将动作添加到队列
+            for i in range(executable_actions.shape[1]):
+                self._queues['action'].append(executable_actions[:, i].squeeze(0))  # (action_dim,)
+
+        # 从队列中取出一个动作
+        action = self._queues['action'].popleft()  # (action_dim,)
+
+        return action
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:
+        """
+        预测一个动作块（用于在线推理）。
+
+        Args:
+            batch: 包含 'qpos' 和 'images' 的字典
+                  - qpos: (B, obs_horizon, obs_dim)
+                  - images: Dict[str, (B, obs_horizon, C, H, W)]
+
+        Returns:
+            actions: (B, pred_horizon, action_dim) 预测的动作序列
+        """
+        return self.predict_action(batch['images'], batch['qpos'])
+
+    # ==========================
+    # 批量推理 (Batch Inference - 原有方法)
+    # ==========================
+    @torch.no_grad()
+    def predict_action(self, images, proprioception):
+        """
+        批量预测动作序列（用于训练和离线评估）
+
+        Args:
+            images: 图像观测字典
+            proprioception: 本体感知观测 (qpos)
+
+        Returns:
+            denormalized_actions: 反归一化后的动作序列
+        """
+        B = proprioception.shape[0]
+
+        # 归一化 proprioception (qpos)
+        proprioception = self.normalization.normalize_qpos(proprioception)
+
+        # 1. 提取当前观测特征（只提取一次）
+        per_step_cond = self._build_cond(images, proprioception)
+
+        # 拼接条件（只计算一次）
+        global_cond_flat = per_step_cond.flatten(start_dim=1)
+        if self.head_type == 'transformer':
+            cond = per_step_cond
+        else:
+            cond = None
+
+        # 2. 初始化纯高斯噪声动作
+        # 形状: (B, pred_horizon, action_dim)
+        device = per_step_cond.device
+        current_actions = torch.randn(
+            (B, self.pred_horizon, self.action_dim), device=device
+        )
+
+        # 3. 逐步去噪循环 (Reverse Diffusion)
+        self.infer_scheduler.set_timesteps(self.inference_steps) # DDIM 推理步数
+
+        for t in self.infer_scheduler.timesteps:
+            model_input = current_actions
+
+            # 预测噪声（根据head类型选择接口）
+            if self.head_type == 'transformer':
+                noise_pred = self.noise_pred_net(
+                    sample=model_input,
+                    timestep=t,
+                    cond=cond
+                )
+            else:  # 'unet'
+                noise_pred = self.noise_pred_net(
+                    sample=model_input,
+                    timestep=t,
+                    global_cond=global_cond_flat
+                )
+
+            # 移除噪声，更新 current_actions
+            current_actions = self.infer_scheduler.step(
+                noise_pred, t, current_actions
+            ).prev_sample
+
+        # 4. 反归一化动作序列
+        denormalized_actions = self.normalization.denormalize_action(current_actions)
+
+        return denormalized_actions
+
+    def get_normalization_stats(self):
+        """获取归一化统计信息（用于保存到 checkpoint）"""
+        return self.normalization.get_stats()
--- a/roboimi/vla/agent_gr00t_dit.py
+++ b/roboimi/vla/agent_gr00t_dit.py
@@ -0,0 +1,217 @@
+import torch
+import torch.nn as nn
+from collections import deque
+from typing import Dict
+
+from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
+from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+
+from roboimi.vla.models.normalization import NormalizationModule
+
+
+class VLAAgentGr00tDiT(nn.Module):
+    """
+    VLA Agent variant that swaps Transformer1D head with gr00t DiT head.
+    Other components (backbone/encoders/scheduler/queue logic) stay aligned
+    with the existing VLAAgent implementation.
+    """
+
+    def __init__(
+        self,
+        vision_backbone,
+        state_encoder,
+        action_encoder,
+        head,
+        action_dim,
+        obs_dim,
+        pred_horizon=16,
+        obs_horizon=4,
+        diffusion_steps=100,
+        inference_steps=10,
+        num_cams=3,
+        dataset_stats=None,
+        normalization_type="min_max",
+        num_action_steps=8,
+    ):
+        super().__init__()
+        self.action_dim = action_dim
+        self.obs_dim = obs_dim
+        self.pred_horizon = pred_horizon
+        self.obs_horizon = obs_horizon
+        self.num_cams = num_cams
+        self.num_action_steps = num_action_steps
+        self.inference_steps = inference_steps
+
+        self.normalization = NormalizationModule(
+            stats=dataset_stats,
+            normalization_type=normalization_type,
+        )
+
+        self.vision_encoder = vision_backbone
+        single_cam_feat_dim = self.vision_encoder.output_dim
+        self.per_step_cond_dim = single_cam_feat_dim * num_cams + obs_dim
+
+        self.noise_scheduler = DDPMScheduler(
+            num_train_timesteps=diffusion_steps,
+            beta_schedule="squaredcos_cap_v2",
+            clip_sample=True,
+            prediction_type="epsilon",
+        )
+        self.infer_scheduler = DDIMScheduler(
+            num_train_timesteps=diffusion_steps,
+            beta_schedule="squaredcos_cap_v2",
+            clip_sample=True,
+            prediction_type="epsilon",
+        )
+
+        if isinstance(head, nn.Module):
+            self.noise_pred_net = head
+        else:
+            self.noise_pred_net = head(
+                input_dim=action_dim,
+                output_dim=action_dim,
+                horizon=pred_horizon,
+                n_obs_steps=obs_horizon,
+                cond_dim=self.per_step_cond_dim,
+            )
+
+        self.state_encoder = state_encoder
+        self.action_encoder = action_encoder
+        self.reset()
+
+    def _get_model_device(self) -> torch.device:
+        return next(self.parameters()).device
+
+    def _move_to_device(self, data, device: torch.device):
+        if torch.is_tensor(data):
+            return data.to(device)
+        if isinstance(data, dict):
+            return {k: self._move_to_device(v, device) for k, v in data.items()}
+        if isinstance(data, list):
+            return [self._move_to_device(v, device) for v in data]
+        if isinstance(data, tuple):
+            return tuple(self._move_to_device(v, device) for v in data)
+        return data
+
+    def _build_cond(self, images: Dict[str, torch.Tensor], states: torch.Tensor) -> torch.Tensor:
+        visual_features = self.vision_encoder(images)
+        state_features = self.state_encoder(states)
+        return torch.cat([visual_features, state_features], dim=-1)
+
+    def compute_loss(self, batch):
+        actions, states, images = batch["action"], batch["qpos"], batch["images"]
+        action_is_pad = batch.get("action_is_pad", None)
+        bsz = actions.shape[0]
+
+        states = self.normalization.normalize_qpos(states)
+        actions = self.normalization.normalize_action(actions)
+
+        action_features = self.action_encoder(actions)
+        cond = self._build_cond(images, states)
+
+        noise = torch.randn_like(action_features)
+        timesteps = torch.randint(
+            0,
+            self.noise_scheduler.config.num_train_timesteps,
+            (bsz,),
+            device=action_features.device,
+        ).long()
+        noisy_actions = self.noise_scheduler.add_noise(action_features, noise, timesteps)
+
+        pred_noise = self.noise_pred_net(
+            sample=noisy_actions,
+            timestep=timesteps,
+            cond=cond,
+        )
+        loss = nn.functional.mse_loss(pred_noise, noise, reduction="none")
+
+        if action_is_pad is not None:
+            mask = (~action_is_pad).unsqueeze(-1).to(loss.dtype)
+            valid_count = mask.sum() * loss.shape[-1]
+            loss = (loss * mask).sum() / valid_count.clamp_min(1.0)
+        else:
+            loss = loss.mean()
+
+        return loss
+
+    def reset(self):
+        self._queues = {
+            "qpos": deque(maxlen=self.obs_horizon),
+            "images": deque(maxlen=self.obs_horizon),
+            "action": deque(maxlen=self.pred_horizon - self.obs_horizon + 1),
+        }
+
+    def _populate_queues(self, observation: Dict[str, torch.Tensor]) -> None:
+        if "qpos" in observation:
+            self._queues["qpos"].append(observation["qpos"].clone())
+        if "images" in observation:
+            self._queues["images"].append({k: v.clone() for k, v in observation["images"].items()})
+
+    def _prepare_observation_batch(self) -> Dict[str, torch.Tensor]:
+        qpos_list = list(self._queues["qpos"])
+        if len(qpos_list) == 0:
+            raise ValueError("observation queue is empty.")
+        while len(qpos_list) < self.obs_horizon:
+            qpos_list.append(qpos_list[-1])
+        batch_qpos = torch.stack(qpos_list, dim=0).unsqueeze(0)
+
+        images_list = list(self._queues["images"])
+        if len(images_list) == 0:
+            raise ValueError("image queue is empty.")
+        while len(images_list) < self.obs_horizon:
+            images_list.append(images_list[-1])
+
+        batch_images = {}
+        for cam_name in images_list[0].keys():
+            batch_images[cam_name] = torch.stack(
+                [img[cam_name] for img in images_list], dim=0
+            ).unsqueeze(0)
+
+        return {"qpos": batch_qpos, "images": batch_images}
+
+    @torch.no_grad()
+    def select_action(self, observation: Dict[str, torch.Tensor]) -> torch.Tensor:
+        device = self._get_model_device()
+        observation = self._move_to_device(observation, device)
+        self._populate_queues(observation)
+
+        if len(self._queues["action"]) == 0:
+            batch = self._prepare_observation_batch()
+            actions = self.predict_action_chunk(batch)
+            start = self.obs_horizon - 1
+            end = start + self.num_action_steps
+            executable_actions = actions[:, start:end]
+            for i in range(executable_actions.shape[1]):
+                self._queues["action"].append(executable_actions[:, i].squeeze(0))
+
+        return self._queues["action"].popleft()
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:
+        return self.predict_action(batch["images"], batch["qpos"])
+
+    @torch.no_grad()
+    def predict_action(self, images, proprioception):
+        bsz = proprioception.shape[0]
+        proprioception = self.normalization.normalize_qpos(proprioception)
+        cond = self._build_cond(images, proprioception)
+
+        device = cond.device
+        current_actions = torch.randn((bsz, self.pred_horizon, self.action_dim), device=device)
+        self.infer_scheduler.set_timesteps(self.inference_steps)
+
+        for t in self.infer_scheduler.timesteps:
+            noise_pred = self.noise_pred_net(
+                sample=current_actions,
+                timestep=t,
+                cond=cond,
+            )
+            current_actions = self.infer_scheduler.step(
+                noise_pred, t, current_actions
+            ).prev_sample
+
+        return self.normalization.denormalize_action(current_actions)
+
+    def get_normalization_stats(self):
+        return self.normalization.get_stats()
+
--- a/roboimi/vla/conf/agent/resnet_diffusion.yaml
+++ b/roboimi/vla/conf/agent/resnet_diffusion.yaml
@@ -0,0 +1,39 @@
+# @package agent
+defaults:
+  # - /backbone@vision_backbone: resnet
+  - /backbone@vision_backbone: resnet_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /head: conditional_unet1d
+  - _self_
+
+_target_: roboimi.vla.agent.VLAAgent
+
+# ====================
+# 模型维度配置
+# ====================
+action_dim: 16              # 动作维度（机器人关节数）
+obs_dim: 16                 # 本体感知维度（关节位置）
+
+# ====================
+# 
+# ====================
+normalization_type: "min_max" # "min_max" or "gaussian"
+
+# ====================
+# 时间步配置
+# ====================
+pred_horizon: 16           # 预测未来多少步动作
+obs_horizon: 2              # 使用多少步历史观测
+num_action_steps: 8         # 每次推理实际执行多少步动作（应 <= pred_horizon - obs_horizon + 1）
+
+# ====================
+# 相机配置
+# ====================
+num_cams: 3                 # 摄像头数量 (r_vis, top, front)
+
+# ====================
+# 扩散过程配置
+# ====================
+diffusion_steps: 100       # 扩散训练步数（DDPM）
+inference_steps: 10         # 推理时的去噪步数（DDIM，固定为 10）
--- a/roboimi/vla/conf/agent/resnet_gr00t_dit.yaml
+++ b/roboimi/vla/conf/agent/resnet_gr00t_dit.yaml
@@ -0,0 +1,37 @@
+# @package agent
+defaults:
+  - /backbone@vision_backbone: resnet_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /head: gr00t_dit1d
+  - _self_
+
+_target_: roboimi.vla.agent_gr00t_dit.VLAAgentGr00tDiT
+
+# Model dimensions
+action_dim: 16
+obs_dim: 16
+
+# Normalization
+normalization_type: "min_max"
+
+# Horizons
+pred_horizon: 16
+obs_horizon: 2
+num_action_steps: 8
+
+# Cameras
+num_cams: 3
+
+# Diffusion
+diffusion_steps: 100
+inference_steps: 10
+
+# Head overrides
+head:
+  input_dim: ${agent.action_dim}
+  output_dim: ${agent.action_dim}
+  horizon: ${agent.pred_horizon}
+  n_obs_steps: ${agent.obs_horizon}
+  cond_dim: 208
+
--- a/roboimi/vla/conf/agent/resnet_transformer.yaml
+++ b/roboimi/vla/conf/agent/resnet_transformer.yaml
@@ -0,0 +1,62 @@
+# @package agent
+defaults:
+  - /backbone@vision_backbone: resnet_diffusion
+  - /modules@state_encoder: identity_state_encoder
+  - /modules@action_encoder: identity_action_encoder
+  - /head: transformer1d
+  - _self_
+
+_target_: roboimi.vla.agent.VLAAgent
+
+# ====================
+# 模型维度配置
+# ====================
+action_dim: 16              # 动作维度（机器人关节数）
+obs_dim: 16                 # 本体感知维度（关节位置）
+
+# ====================
+# 归一化配置
+# ====================
+normalization_type: "min_max" # "min_max" or "gaussian"
+
+# ====================
+# 时间步配置
+# ====================
+pred_horizon: 16           # 预测未来多少步动作
+obs_horizon: 2              # 使用多少步历史观测
+num_action_steps: 8         # 每次推理实际执行多少步动作（应 <= pred_horizon - obs_horizon + 1）
+
+# ====================
+# 相机配置
+# ====================
+camera_names: ${data.camera_names}  # 条件相机顺序固定为 r_vis, top, front
+num_cams: 3                 # 摄像头数量 (r_vis, top, front)
+
+vision_backbone:
+  num_cameras: ${agent.num_cams}
+  camera_names: ${agent.camera_names}
+
+# ====================
+# 扩散过程配置
+# ====================
+diffusion_steps: 100       # 扩散训练步数（DDPM）
+inference_steps: 10         # 推理时的去噪步数（DDIM，<4D><EFBC8C>定为 10）
+
+# ====================
+# Head 类型标识（用于VLAAgent选择调用方式）
+# ====================
+head_type: "transformer"    # "unet" 或 "transformer"
+
+# Head 参数覆盖
+head:
+  input_dim: ${agent.action_dim}
+  output_dim: ${agent.action_dim}
+  horizon: ${agent.pred_horizon}
+  n_obs_steps: ${agent.obs_horizon}
+  # Transformer的cond_dim是每步的维度
+  # ResNet18 + SpatialSoftmax(32 keypoints) = 64维/相机
+  # 计算方式：单相机特征(64) * 相机数(3) + obs_dim(16) = 208
+  cond_dim: 208
+  causal_attn: false
+  time_as_cond: true
+  obs_as_cond: true
--- a/roboimi/vla/conf/backbone/resnet_diffusion.yaml
+++ b/roboimi/vla/conf/backbone/resnet_diffusion.yaml
@@ -0,0 +1,33 @@
+_target_: roboimi.vla.models.backbones.resnet_diffusion.ResNetDiffusionBackbone
+
+# ====================
+# 骨干网络选择
+# ====================
+vision_backbone: "resnet18"  # torchvision 模型名称: resnet18, resnet34, resnet50
+pretrained_backbone_weights: "IMAGENET1K_V1"  # 使用ImageNet预训练权重（torchvision>=0.13）
+
+# ====================
+# 冻结设置
+# ====================
+freeze_backbone: true  # 冻结ResNet参数，只训练后面的pool和out层（推荐：true）
+
+# ====================
+# 输入配置
+# ====================
+input_shape: [3, 224, 224]  # 输入图像形状 (C, H, W) - ImageNet标准尺寸
+crop_shape: null      # 裁剪后的图像形状 (H, W) - 设为null禁用裁剪
+crop_is_random: true      # 训练时使用随机裁剪，评估时使用中心裁剪（crop_shape=null时无效）
+
+# ====================
+# 归一化和特征提取
+# ====================
+use_group_norm: true              # 使用 GroupNorm 替代 BatchNorm（更适合小批次训练）
+spatial_softmax_num_keypoints: 32  # Spatial Softmax 关键点数量
+
+# ====================
+# 编码器模式
+# ====================
+# false: 共享编码器（所有摄像头共享一个 ResNet，参数少但容量受限）推荐！
+# true: 独立编码器（每个摄像头有独立的 ResNet，参数多但容量大）
+use_separate_rgb_encoder_per_camera: true
+num_cameras: 3  # 摄像头数量
--- a/roboimi/vla/conf/config.yaml
+++ b/roboimi/vla/conf/config.yaml
@@ -0,0 +1,50 @@
+defaults:
+  - agent: resnet_transformer
+  - data: simpe_robot_dataset
+  - eval: eval
+  - _self_
+
+# ====================
+# 训练配置
+# ====================
+train:
+  # 基础训练参数
+  batch_size: 16      # 批次大小
+  lr: 1e-4            # 学习率
+  max_steps: 100000    # 最大训练步数
+  device: "cuda"      # 设备: "cuda" 或 "cpu"
+
+  # 数据加载
+  num_workers: 12     # DataLoader 工作进程数（调试时设为 0）
+  val_split: 0.0      # 验证集比例；默认使用全量数据训练
+  seed: 42            # 随机种子（用于数据划分）
+
+  # 日志和检查点
+  log_freq: 100       # 日志记录频率（步数）
+  save_freq: 2000     # 保存检查点频率（步数）
+  use_swanlab: false  # 是否启用 SwanLab 标量日志
+  swanlab_project: "roboimi-vla"  # SwanLab project 名称
+  swanlab_run_name: null  # 可选的 SwanLab 运行名
+  rollout_val_freq_epochs: 50  # 每隔多少个 epoch 执行一次 rollout 验证
+  rollout_validate_on_checkpoint: false  # 是否在保存 checkpoint 后立即运行 rollout 验证
+  rollout_num_episodes: 3  # rollout 验证的回合数
+
+  # 学习率调度器（带预热）
+  warmup_steps: 2000  # 预热步数（Transformer建议更长）
+  scheduler_type: "cosine"  # 预热后的调度器: "constant" 或 "cosine"
+  min_lr: 1e-6        # 最小学习率（用于余弦退火）
+
+  # 优化器
+  weight_decay: 1e-5  # 权重衰减（L2 正则化）
+  grad_clip: 1.0       # 梯度裁剪阈值
+
+  # 微调配置
+  pretrained_ckpt: null  # 预训练 checkpoint 路径（用于微调），例如: "checkpoints/vla_model_step_8000.pt"
+
+# ====================
+# 实验配置
+# ====================
+experiment:
+  name: "vla_diffusion"  # 实验名称
+  notes: ""              # 实验备注
+  tags: []               # 实验标签
--- a/roboimi/vla/conf/data/simpe_robot_dataset.yaml
+++ b/roboimi/vla/conf/data/simpe_robot_dataset.yaml
@@ -0,0 +1,21 @@
+# @package data
+_target_: roboimi.vla.data.simpe_robot_dataset.SimpleRobotDataset
+
+# ====================
+# 数据集路径
+# ====================
+dataset_dir: "roboimi/demos/dataset/sim_transfer"
+
+# ====================
+# 时间步参数（从 agent 配置引用）
+# ====================
+pred_horizon: ${agent.pred_horizon}  # 预测步数
+obs_horizon: ${agent.obs_horizon}      # 观测步数
+
+# ====================
+# 相机配置
+# ====================
+camera_names:
+  - r_vis     # 机器人视角相机
+  - top       # 顶部相机
+  - front     # 前方相机
--- a/roboimi/vla/conf/eval/eval.yaml
+++ b/roboimi/vla/conf/eval/eval.yaml
@@ -0,0 +1,47 @@
+# @package eval
+# 评估配置
+ckpt_path: "checkpoints/vla_model_best.pt"  # 模型检查点路径
+num_episodes: 3           # 评估回合数
+max_timesteps: 700        # 每回合最大时间步
+device: ${train.device}   # 与训练保持一致
+task_name: "sim_transfer" # 环境任务名称
+
+# ====================
+# 策略执行参数
+# ====================
+# num_queries 已废弃，现在使用 agent 的 select_action() 自动管理队列
+# 以下参数仅用于兼容旧代码，实际使用 agent.num_action_steps
+num_queries: ${agent.num_action_steps}
+obs_horizon: ${agent.obs_horizon}
+
+# ====================
+# 相机配置
+# ====================
+camera_names: ${data.camera_names}
+
+# ====================
+# 动作平滑
+# ====================
+use_smoothing: false
+smooth_method: "ema"
+smooth_alpha: 0.3
+
+# ====================
+# 调试选项
+# ====================
+headless: false      # 是否禁用 MuJoCo / OpenCV GUI 渲染
+verbose_action: true  # 是否打印每个时间步的动作信息
+
+# ====================
+# Rollout artifact 导出
+# ====================
+artifact_dir: null          # 可选输出目录；为空时在启用导出时自动创建目录
+save_artifacts: false       # 总开关；实际仍需搭配下面的具体导出项
+save_timing: false          # 是否保存 timing.json（包含各阶段耗时统计）
+save_trajectory: false      # 是否保存 trajectory.npz（原始 EE action + 执行后 EE pose）
+save_summary_json: false    # 是否保存 JSON-friendly rollout summary
+save_trajectory_npz: false  # 是否保存每步轨迹/时序/EE pose 为 NPZ
+record_video: false         # 是否从单个相机流录制 rollout mp4
+video_camera: null          # video_camera_name 的别名
+video_camera_name: null     # 录制视频使用的相机名；为空时默认取 camera_names[0]
+video_fps: 30               # 导出 mp4 的目标帧率
--- a/roboimi/vla/conf/head/conditional_unet1d.yaml
+++ b/roboimi/vla/conf/head/conditional_unet1d.yaml
@@ -0,0 +1,15 @@
+_target_: roboimi.vla.models.heads.conditional_unet1d.ConditionalUnet1D
+_partial_: true
+
+# ====================
+# UNet1D 配置
+# ====================
+kernel_size: 3             # 卷积核大小
+cond_predict_scale: false  # FiLM 条件化时是否同时预测 scale（bias + scale 或仅 bias）
+
+# ====================
+# 网络架构（默认值，可覆盖）
+# ====================
+# diffusion_step_embed_dim: 256  # 扩散时间步嵌入维度
+# down_dims: [256, 512, 1024]     # 下采样各层通道数
+# n_groups: 8                      # GroupNorm 分组数
--- a/roboimi/vla/conf/head/gr00t_dit1d.yaml
+++ b/roboimi/vla/conf/head/gr00t_dit1d.yaml
@@ -0,0 +1,22 @@
+_target_: roboimi.vla.models.heads.gr00t_dit1d.Gr00tDiT1D
+_partial_: true
+
+# DiT architecture
+n_layer: 6
+n_head: 8
+n_emb: 256
+hidden_dim: 256
+mlp_ratio: 4
+dropout: 0.1
+
+# Positional embeddings
+add_action_pos_emb: true
+add_cond_pos_emb: true
+
+# Supplied by agent interpolation:
+# - input_dim
+# - output_dim
+# - horizon
+# - n_obs_steps
+# - cond_dim
+
--- a/roboimi/vla/conf/head/transformer1d.yaml
+++ b/roboimi/vla/conf/head/transformer1d.yaml
@@ -0,0 +1,30 @@
+# Transformer-based Diffusion Policy Head
+_target_: roboimi.vla.models.heads.transformer1d.Transformer1D
+_partial_: true
+
+# ====================
+# Transformer 架构配置
+# ====================
+n_layer: 4                  # Transformer层数（保持当前小模型配置）
+n_head: 4                   # 注意力头数
+n_emb: 128                  # 嵌入维度
+p_drop_emb: 0.05            # Embedding dropout
+p_drop_attn: 0.05           # Attention dropout
+
+# ====================
+# 条件配置
+# ====================
+causal_attn: false          # 对齐 external TransformerForDiffusion 的 full-attention / nocausal 变体
+time_as_cond: true          # 与 external 实现一致：时间步作为条件 token
+obs_as_cond: true           # API 对齐；实际是否启用由 cond_dim > 0 决定
+n_cond_layers: 1            # 条件编码器层数（保留当前配置）
+
+# ====================
+# 注意事项
+# ====================
+# 以下参数将在agent配置中通过interpolation提供：
+# - input_dim: ${agent.action_dim}
+# - output_dim: ${agent.action_dim}
+# - horizon: ${agent.pred_horizon}
+# - n_obs_steps: ${agent.obs_horizon}
+# - cond_dim: 通过agent中的global_cond_dim计算
--- a/roboimi/vla/conf/modules/identity_action_encoder.yaml
+++ b/roboimi/vla/conf/modules/identity_action_encoder.yaml
@@ -0,0 +1 @@
+_target_: roboimi.vla.modules.encoders.IdentityActionEncoder
--- a/roboimi/vla/conf/modules/identity_state_encoder.yaml
+++ b/roboimi/vla/conf/modules/identity_state_encoder.yaml
@@ -0,0 +1 @@
+_target_: roboimi.vla.modules.encoders.IdentityStateEncoder
--- a/roboimi/vla/core/init.py
+++ b/roboimi/vla/core/init.py
--- a/roboimi/vla/core/interfaces.py
+++ b/roboimi/vla/core/interfaces.py
@@ -0,0 +1,46 @@
+import abc
+import torch
+import torch.nn as nn
+from typing import Dict, Any, Optional
+
+class VLABackbone(nn.Module, abc.ABC):
+    """
+    Contract for Vision/Language Backbones.
+    Must return a feature tensor of shape (B, Seq, Embed_Dim).
+    """
+    @abc.abstractmethod
+    def forward(self, obs: Dict[str, torch.Tensor]) -> torch.Tensor:
+        """
+        Args:
+            obs: Dictionary containing 'image' and optionally 'text'.
+        Returns:
+            features: (B, S, D) embedding.
+        """
+        pass
+
+
+class VLAProjector(nn.Module, abc.ABC):
+    """
+    Contract for the adaptation layer (Projector).
+    Connects Backbone features to the Policy Head.
+    """
+    @abc.abstractmethod
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        pass
+
+
+class VLAHead(nn.Module, abc.ABC):
+    """
+    Contract for Action Generation Heads (Policies).
+    Handles both training (loss calculation) and inference (action generation).
+    """
+    @abc.abstractmethod
+    def forward(self, embeddings: torch.Tensor, actions: Optional[torch.Tensor] = None) -> Dict[str, torch.Tensor]:
+        """
+        Args:
+            embeddings: (B, S, Hidden) from Projector.
+            actions: (B, Pred_Horizon, Action_Dim) - Ground truth for training.
+        Returns:
+            Dict containing 'loss' (if actions provided) or 'pred_actions'.
+        """
+        pass
--- a/roboimi/vla/data/init.py
+++ b/roboimi/vla/data/init.py
--- a/roboimi/vla/data/simpe_robot_dataset.py
+++ b/roboimi/vla/data/simpe_robot_dataset.py
@@ -0,0 +1,243 @@
+import torch
+import h5py
+from torch.utils.data import Dataset
+from typing import List, Dict, Union
+from pathlib import Path
+from collections import OrderedDict
+
+
+class SimpleRobotDataset(Dataset):
+    """
+    HDF5 懒加载数据集 - LeRobotDataset 格式
+
+    返回格式:
+        - observation.state: (obs_horizon, state_dim)
+        - observation.{cam_name}: (obs_horizon, C, H, W)
+        - action: (pred_horizon, action_dim)
+    """
+
+    def __init__(
+        self,
+        dataset_dir: Union[str, Path],
+        obs_horizon: int = 2,
+        pred_horizon: int = 8,
+        camera_names: List[str] = None,
+        max_open_files: int = 64,
+    ):
+        """
+        Args:
+            dataset_dir: HDF5 文件目录路径
+            obs_horizon: 观察过去多少帧
+            pred_horizon: 预测未来多少帧动作
+            camera_names: 相机名称列表，如 ["r_vis", "top", "front"]
+            max_open_files: 每个 worker 最多缓存的 HDF5 文件句柄数
+
+        HDF5 文件格式:
+            - action: [T, action_dim]
+            - observations/qpos: [T, obs_dim]
+            - observations/images/{cam_name}: [T, H, W, C]
+        """
+        self.obs_horizon = obs_horizon
+        self.pred_horizon = pred_horizon
+        self.camera_names = camera_names or []
+        self.max_open_files = max(1, int(max_open_files))
+        self._file_cache: "OrderedDict[str, h5py.File]" = OrderedDict()
+
+        self.dataset_dir = Path(dataset_dir)
+        if not self.dataset_dir.exists():
+            raise FileNotFoundError(f"数据集目录不存在: {dataset_dir}")
+
+        # 查找 HDF5 文件
+        self.hdf5_files = sorted(self.dataset_dir.glob("*.hdf5"))
+        if not self.hdf5_files:
+            self.hdf5_files = sorted(self.dataset_dir.glob("episode_*.hdf5"))
+        if not self.hdf5_files:
+            raise FileNotFoundError(f"在 {dataset_dir} 中未找到 HDF5 文件")
+
+        # 构建 episode 索引（只存储元数据，不加载数据）
+        self.episodes = {}
+        self.frame_meta = []  # 存储 (ep_idx, frame_idx, hdf5_path)
+        for ep_idx, hdf5_path in enumerate(self.hdf5_files):
+            with h5py.File(hdf5_path, 'r') as f:
+                T = f['action'].shape[0]
+                start_idx = len(self.frame_meta)
+                for t in range(T):
+                    self.frame_meta.append({
+                        "ep_idx": ep_idx,
+                        "frame_idx": t,
+                        "hdf5_path": hdf5_path,
+                    })
+                self.episodes[ep_idx] = list(range(start_idx, len(self.frame_meta)))
+
+        print(f"懒加载模式: {len(self.hdf5_files)} 个 episodes, 共 {len(self.frame_meta)} 帧")
+
+    def __len__(self):
+        return len(self.frame_meta)
+
+    def _close_all_files(self) -> None:
+        """关闭当前 worker 内缓存的所有 HDF5 文件句柄。"""
+        for f in self._file_cache.values():
+            try:
+                f.close()
+            except Exception:
+                pass
+        self._file_cache.clear()
+
+    def _get_h5_file(self, hdf5_path: Union[str, Path]) -> h5py.File:
+        """
+        获取 HDF5 文件句柄（worker 内 LRU 缓存）。
+        注意：缓存的是文件句柄，不是帧数据本身。
+        """
+        key = str(hdf5_path)
+        if key in self._file_cache:
+            self._file_cache.move_to_end(key)
+            return self._file_cache[key]
+
+        # 超过上限时淘汰最久未使用的句柄
+        if len(self._file_cache) >= self.max_open_files:
+            _, old_file = self._file_cache.popitem(last=False)
+            try:
+                old_file.close()
+            except Exception:
+                pass
+
+        f = h5py.File(key, 'r')
+        self._file_cache[key] = f
+        return f
+
+    def _load_frame(self, idx: int, *, load_images: bool = True) -> Dict:
+        """从 HDF5 文件懒加载单帧数据"""
+        meta = self.frame_meta[idx]
+        f = self._get_h5_file(meta["hdf5_path"])
+        frame = {
+            "episode_index": meta["ep_idx"],
+            "frame_index": meta["frame_idx"],
+            "task": f.get('task', [b"unknown"])[0].decode() if 'task' in f else "unknown",
+            "observation.state": torch.from_numpy(f['observations/qpos'][meta["frame_idx"]]).float(),
+            "action": torch.from_numpy(f['action'][meta["frame_idx"]]).float(),
+        }
+
+        # 加载图像数据: observations/images/{cam_name} -> observation.{cam_name}
+        if load_images:
+            for cam_name in self.camera_names:
+                h5_path = f'observations/images/{cam_name}'
+                if h5_path in f:
+                    img = f[h5_path][meta["frame_idx"]]
+                    # Resize图像到224x224（减少内存和I/O负担）
+                    import cv2
+                    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)
+                    # 转换为float并归一化到 [0, 1]
+                    img = torch.from_numpy(img).float() / 255.0
+                    frame[f"observation.{cam_name}"] = img.permute(2, 0, 1)  # HWC -> CHW
+
+        return frame
+
+    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
+        frame = self._load_frame(idx, load_images=False)
+        ep_idx = frame["episode_index"]
+
+        # 获取当前 episode 的帧索引范围
+        ep_indices = self.episodes[ep_idx]
+        ep_start = ep_indices[0]
+        ep_end = ep_indices[-1]
+
+        # ============================================
+        # 1. 加载观察（过去 obs_horizon 帧）
+        # ============================================
+        observations = {
+            "state": [],  # 状态数据
+        }
+        # 为每个摄像头初始化独立列表
+        for cam_name in self.camera_names:
+            observations[f"observation.{cam_name}"] = []
+
+        observation_is_pad = []
+
+        for delta in range(-self.obs_horizon + 1, 1):  # [-1, 0] for obs_horizon=2
+            target_idx = idx + delta
+
+            # 边界检查
+            if ep_start <= target_idx <= ep_end:
+                target_frame = self._load_frame(target_idx)
+                is_pad = False
+            else:
+                # 超出边界，用边界帧填充
+                if target_idx < ep_start:
+                    target_frame = self._load_frame(ep_start)
+                else:
+                    target_frame = self._load_frame(ep_end)
+                is_pad = True
+
+            # 收集状态
+            observations["state"].append(target_frame["observation.state"])
+
+            # 收集每个摄像头的图像
+            for cam_name in self.camera_names:
+                observations[f"observation.{cam_name}"].append(target_frame[f"observation.{cam_name}"])
+
+            observation_is_pad.append(is_pad)
+
+        # ============================================
+        # 2. 加载动作（未来 pred_horizon 帧）
+        # ============================================
+        actions = []
+        action_is_pad = []
+
+        for delta in range(self.pred_horizon):
+            target_idx = idx + delta
+
+            if target_idx <= ep_end:
+                actions.append(self._load_frame(target_idx, load_images=False)["action"])
+                action_is_pad.append(False)
+            else:
+                actions.append(self._load_frame(ep_end, load_images=False)["action"])
+                action_is_pad.append(True)
+
+        # ============================================
+        # 3. 组装返回数据（LeRobotDataset 格式）
+        # ============================================
+        result = {
+            # 状态观察: (obs_horizon, state_dim)
+            "observation.state": torch.stack(observations["state"]),
+            "observation_is_pad": torch.tensor(observation_is_pad, dtype=torch.bool),
+
+            # 动作: (pred_horizon, action_dim)
+            "action": torch.stack(actions),
+            "action_is_pad": torch.tensor(action_is_pad, dtype=torch.bool),
+
+            # 任务
+            "task": frame["task"],
+        }
+
+        # 图像：每个摄像头独立的 key
+        # 形状: (obs_horizon, C, H, W)
+        for cam_name in self.camera_names:
+            result[f"observation.{cam_name}"] = torch.stack(observations[f"observation.{cam_name}"])
+
+        return result
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """获取所有相机键名 (LeRobotDataset 格式)"""
+        return [f"observation.{cam_name}" for cam_name in self.camera_names]
+
+    @property
+    def camera_info(self) -> dict:
+        """获取相机信息"""
+        if not self.camera_names:
+            return {}
+
+        # 从第一个样本获取形状
+        sample = self[0]
+        info = {}
+        for cam_name in self.camera_names:
+            key = f"observation.{cam_name}"
+            if key in sample:
+                info[key] = {
+                    "shape": sample[key].shape,
+                    "dtype": str(sample[key].dtype),
+                }
+        return info
+
+    def __del__(self):
+        self._close_all_files()
--- a/roboimi/vla/eval_utils.py
+++ b/roboimi/vla/eval_utils.py
@@ -0,0 +1,3 @@
+def execute_policy_action(env, action):
+    """Execute policy outputs using EE-action semantics."""
+    env.step(action)
--- a/roboimi/vla/models/init.py
+++ b/roboimi/vla/models/init.py
--- a/roboimi/vla/models/backbones/init.py
+++ b/roboimi/vla/models/backbones/init.py
@@ -0,0 +1,4 @@
+# Backbone models
+from .resnet_diffusion import ResNetDiffusionBackbone
+
+__all__ = ["ResNetBackbone", "ResNetDiffusionBackbone"]
--- a/roboimi/vla/models/backbones/resnet_diffusion.py
+++ b/roboimi/vla/models/backbones/resnet_diffusion.py
@@ -0,0 +1,394 @@
+from roboimi.vla.core.interfaces import VLABackbone
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchvision
+import numpy as np
+from typing import Callable, Optional, Tuple, Union
+
+def _replace_submodules(
+    root_module: nn.Module, predicate: Callable[[nn.Module], bool], func: Callable[[nn.Module], nn.Module]
+) -> nn.Module:
+    """
+    Args:
+        root_module: 需要替换子模块的根模块
+        predicate: 接受一个模块作为参数，如果该模块需要被替换则返回 True。
+        func: 接受一个模块作为参数，并返回一个新的模块来替换它。
+    Returns:
+        子模块已被替换的根模块。
+    """
+    if predicate(root_module):
+        return func(root_module)
+
+    replace_list = [k.split(".") for k, m in root_module.named_modules(remove_duplicate=True) if predicate(m)]
+    for *parents, k in replace_list:
+        parent_module = root_module
+        if len(parents) > 0:
+            parent_module = root_module.get_submodule(".".join(parents))
+        if isinstance(parent_module, nn.Sequential):
+            src_module = parent_module[int(k)]
+        else:
+            src_module = getattr(parent_module, k)
+        tgt_module = func(src_module)
+        if isinstance(parent_module, nn.Sequential):
+            parent_module[int(k)] = tgt_module
+        else:
+            setattr(parent_module, k, tgt_module)
+    # 验证所有 BN 是否已被替换
+    assert not any(predicate(m) for _, m in root_module.named_modules(remove_duplicate=True))
+    return root_module
+
+class SpatialSoftmax(nn.Module):
+    """
+    Finn 等人在 "Deep Spatial Autoencoders for Visuomotor Learning" 中描述的空间软 Argmax 操作
+    (https://huggingface.co/papers/1509.06113)。这是 robomimic 实现的一个最小移植版本。
+    """
+
+    def __init__(self, input_shape, num_kp=None):
+        """
+        Args:
+            input_shape (list): (C, H, W) 输入特征图形状。
+            num_kp (int): 输出中的关键点数量。如果为 None，输出将具有与输入相同的通道数。
+        """
+        super().__init__()
+
+        assert len(input_shape) == 3
+        self._in_c, self._in_h, self._in_w = input_shape
+
+        if num_kp is not None:
+            self.nets = torch.nn.Conv2d(self._in_c, num_kp, kernel_size=1)
+            self._out_c = num_kp
+        else:
+            self.nets = None
+            self._out_c = self._in_c
+
+        # 我们可以直接使用 torch.linspace，但这似乎与 numpy 的行为略有不同
+        # 并且会导致预训练模型的 pc_success 略有下降。
+        pos_x, pos_y = np.meshgrid(np.linspace(-1.0, 1.0, self._in_w), np.linspace(-1.0, 1.0, self._in_h))
+        pos_x = torch.from_numpy(pos_x.reshape(self._in_h * self._in_w, 1)).float()
+        pos_y = torch.from_numpy(pos_y.reshape(self._in_h * self._in_w, 1)).float()
+        # 注册为 buffer，以便将其移动到正确的设备。
+        self.register_buffer("pos_grid", torch.cat([pos_x, pos_y], dim=1))
+
+    def forward(self, features: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            features: (B, C, H, W) 输入特征图。
+        Returns:
+            (B, K, 2) 关键点的图像空间坐标。
+        """
+        if self.nets is not None:
+            features = self.nets(features)
+
+        # [B, K, H, W] -> [B * K, H * W]，其中 K 是关键点数量
+        features = features.reshape(-1, self._in_h * self._in_w)
+        # 2d softmax 归一化
+        attention = F.softmax(features, dim=-1)
+        # [B * K, H * W] x [H * W, 2] -> [B * K, 2] 用于 x 和 y 维度的空间坐标均值
+        expected_xy = attention @ self.pos_grid
+        # 重塑为 [B, K, 2]
+        feature_keypoints = expected_xy.view(-1, self._out_c, 2)
+
+        return feature_keypoints
+
+class _SingleRgbEncoder(nn.Module):
+    """单个摄像头的 RGB 编码器，支持独立或共享使用"""
+    def __init__(
+        self,
+        vision_backbone: str,
+        pretrained_backbone_weights: str | None,
+        input_shape: Tuple[int, int, int],
+        crop_shape: Optional[Tuple[int, int]],
+        crop_is_random: bool,
+        use_group_norm: bool,
+        spatial_softmax_num_keypoints: int,
+        freeze_backbone: bool = True,  # 新增：是否冻结backbone
+    ):
+        super().__init__()
+
+        # 设置可选的预处理
+        if crop_shape is not None:
+            self.do_crop = True
+            # 评估时始终使用中心裁剪
+            self.center_crop = torchvision.transforms.CenterCrop(crop_shape)
+            if crop_is_random:
+                self.maybe_random_crop = torchvision.transforms.RandomCrop(crop_shape)
+            else:
+                self.maybe_random_crop = self.center_crop
+        else:
+            self.do_crop = False
+            crop_shape = input_shape[1:]
+
+        # 设置骨干网络
+        backbone_model = getattr(torchvision.models, vision_backbone)(
+            weights=pretrained_backbone_weights
+        )
+
+        # 移除 AvgPool 和 FC (假设 layer4 是 children()[-3])
+        self.backbone = nn.Sequential(*(list(backbone_model.children())[:-2]))
+
+        if use_group_norm:
+            self.backbone = _replace_submodules(
+                root_module=self.backbone,
+                predicate=lambda x: isinstance(x, nn.BatchNorm2d),
+                func=lambda x: nn.GroupNorm(num_groups=x.num_features // 16, num_channels=x.num_features),
+            )
+
+        # 冻结backbone参数（可选）
+        if freeze_backbone:
+            for param in self.backbone.parameters():
+                param.requires_grad = False
+
+        # 设置池化和最终层
+        # 使用试运行来获取特征图形状
+        dummy_shape = (1, input_shape[0], *crop_shape)
+        with torch.no_grad():
+            dummy_out = self.backbone(torch.zeros(dummy_shape))
+        feature_map_shape = dummy_out.shape[1:]  # (C, H, W)
+
+        self.pool = SpatialSoftmax(feature_map_shape, num_kp=spatial_softmax_num_keypoints)
+        self.feature_dim = spatial_softmax_num_keypoints * 2
+        self.out = nn.Linear(spatial_softmax_num_keypoints * 2, self.feature_dim)
+        self.relu = nn.ReLU()
+
+        # 注册ImageNet标准化参数为buffer（会自动移到GPU）
+        self.register_buffer('mean', torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1))
+        self.register_buffer('std', torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1))
+
+    def forward_single_image(self, x: torch.Tensor) -> torch.Tensor:
+        if self.do_crop:
+            x = self.maybe_random_crop(x) if self.training else self.center_crop(x)
+
+        # ImageNet标准化（预训练权重期望的输入分布）
+        x = (x - self.mean) / self.std
+
+        x = self.relu(self.out(torch.flatten(self.pool(self.backbone(x)), start_dim=1)))
+        return x
+
+
+class ResNetDiffusionBackbone(VLABackbone):
+    def __init__(
+        self,
+        vision_backbone: str = "resnet18",
+        pretrained_backbone_weights: str | None = None,
+        input_shape: Tuple[int, int, int] = (3, 84, 84), # (C, H, W)
+        crop_shape: Optional[Tuple[int, int]] = None,
+        crop_is_random: bool = True,
+        use_group_norm: bool = True,
+        spatial_softmax_num_keypoints: int = 32,
+        use_separate_rgb_encoder_per_camera: bool = False,  # 新增：是否为每个摄像头使用独立编码器
+        num_cameras: int = 1,  # 新增：摄像头数量（仅在独立编码器模式下使用）
+        camera_names: Optional[Tuple[str, ...]] = None,  # 显式相机顺序
+        freeze_backbone: bool = True,  # 新增：是否冻结ResNet backbone（推荐True）
+    ):
+        super().__init__()
+
+        self.use_separate_rgb_encoder_per_camera = use_separate_rgb_encoder_per_camera
+        self.num_cameras = num_cameras
+        self.camera_names = tuple(camera_names) if camera_names is not None else None
+        if self.camera_names is not None and len(self.camera_names) != self.num_cameras:
+            raise ValueError(
+                f"camera_names 长度({len(self.camera_names)})与 num_cameras({self.num_cameras})不一致"
+            )
+
+        if use_separate_rgb_encoder_per_camera:
+            # 独立编码器模式：为每个摄像头创建独立的编码器
+            encoders = [
+                _SingleRgbEncoder(
+                    vision_backbone=vision_backbone,
+                    pretrained_backbone_weights=pretrained_backbone_weights,
+                    input_shape=input_shape,
+                    crop_shape=crop_shape,
+                    crop_is_random=crop_is_random,
+                    use_group_norm=use_group_norm,
+                    spatial_softmax_num_keypoints=spatial_softmax_num_keypoints,
+                    freeze_backbone=freeze_backbone,
+                )
+                for _ in range(num_cameras)
+            ]
+            self.rgb_encoder = nn.ModuleList(encoders)
+            # 重要：output_dim 始终表示单个编码器的特征维度（与 lerobot 保持一致）
+            self.feature_dim = encoders[0].feature_dim
+        else:
+            # 共享编码器模式：所有摄像头共享同一个编码器
+            self.rgb_encoder = _SingleRgbEncoder(
+                vision_backbone=vision_backbone,
+                pretrained_backbone_weights=pretrained_backbone_weights,
+                input_shape=input_shape,
+                crop_shape=crop_shape,
+                crop_is_random=crop_is_random,
+                use_group_norm=use_group_norm,
+                spatial_softmax_num_keypoints=spatial_softmax_num_keypoints,
+                freeze_backbone=freeze_backbone,
+            )
+            self.feature_dim = self.rgb_encoder.feature_dim
+
+    def _ordered_camera_names(self, images) -> Tuple[str, ...]:
+        if self.camera_names is None:
+            camera_names = tuple(sorted(images.keys()))
+            if len(camera_names) != self.num_cameras:
+                raise ValueError(
+                    f"图像输入相机数量({len(camera_names)})与 num_cameras({self.num_cameras})不一致"
+                )
+            return camera_names
+
+        missing = [cam_name for cam_name in self.camera_names if cam_name not in images]
+        if missing:
+            raise ValueError(
+                f"图像输入缺少必需相机。missing={missing}, expected={list(self.camera_names)}"
+            )
+        return self.camera_names
+
+    def forward(self, images):
+        """
+        Args:
+            images: Dict[str, Tensor], 每个摄像头的图像
+                    形状: {cam_name: (B, T, C, H, W)}
+
+        Returns:
+            Tensor: (B, T, total_feature_dim)
+        """
+        any_tensor = next(iter(images.values()))
+        B, T = any_tensor.shape[:2]
+        cam_names = self._ordered_camera_names(images)
+
+        if self.use_separate_rgb_encoder_per_camera:
+            # 独立编码器模式：每个摄像头使用对应的编码器
+            features_all = []
+            for cam_idx, cam_name in enumerate(cam_names):
+                img = images[cam_name]
+                encoder = self.rgb_encoder[cam_idx]
+                features = encoder.forward_single_image(img.reshape(B * T, *img.shape[2:]))
+                features_all.append(features)
+            return torch.cat(features_all, dim=1).view(B, T, -1)
+        else:
+            # 共享编码器模式：所有摄像头共享同一个编码器
+            features_all = []
+            for cam_name in cam_names:
+                img = images[cam_name]
+                features = self.rgb_encoder.forward_single_image(img.reshape(B * T, *img.shape[2:]))
+                features_all.append(features)
+            return torch.cat(features_all, dim=1).view(B, T, -1)
+
+    @property
+    def output_dim(self):
+        return self.feature_dim
+
+if __name__ == "__main__":
+    print("=" * 60)
+    print("🚀 Testing ResNetDiffusionBackbone")
+    print("=" * 60)
+
+    # Configuration
+    B, T = 2, 5
+    C, H, W = 3, 96, 96
+    crop_h, crop_w = 84, 84
+    num_keypoints = 32
+    feature_dim_per_cam = num_keypoints * 2
+
+    # Create dummy input (2 cameras)
+    images = {
+        "cam_high": torch.randn(B, T, C, H, W),
+        "cam_wrist": torch.randn(B, T, C, H, W)
+    }
+    num_cameras = len(images)
+
+    # ============================================================================
+    # Test 1: Shared Encoder (默认模式)
+    # ============================================================================
+    print("\n[Test 1] Shared Encoder Mode")
+    print("-" * 60)
+    backbone_shared = ResNetDiffusionBackbone(
+        vision_backbone="resnet18",
+        pretrained_backbone_weights=None,  # Speed up test
+        input_shape=(C, H, W),
+        crop_shape=(crop_h, crop_w),
+        crop_is_random=True,
+        use_group_norm=True,
+        spatial_softmax_num_keypoints=num_keypoints,
+        use_separate_rgb_encoder_per_camera=False,  # 共享编码器
+    )
+
+    print(f"✅ Shared encoder model instantiated")
+    print(f"   Output dim per camera: {feature_dim_per_cam}")
+    print(f"   Number of cameras: {num_cameras}")
+    print(f"   Expected total dim: {num_cameras * feature_dim_per_cam}")
+
+    output = backbone_shared(images)
+    print(f"\n🔄 Forward pass completed")
+    print(f"   Input shapes: {[v.shape for v in images.values()]}")
+    print(f"   Output shape: {output.shape}")
+
+    expected_dim = num_cameras * feature_dim_per_cam
+    assert output.shape == (B, T, expected_dim), f"Expected shape {(B, T, expected_dim)}, got {output.shape}"
+    print(f"✨ Test passed!")
+
+    # ============================================================================
+    # Test 2: Separate Encoders (独立编码器模式)
+    # ============================================================================
+    print("\n[Test 2] Separate Encoders Mode")
+    print("-" * 60)
+    backbone_separate = ResNetDiffusionBackbone(
+        vision_backbone="resnet18",
+        pretrained_backbone_weights=None,  # Speed up test
+        input_shape=(C, H, W),
+        crop_shape=(crop_h, crop_w),
+        crop_is_random=True,
+        use_group_norm=True,
+        spatial_softmax_num_keypoints=num_keypoints,
+        use_separate_rgb_encoder_per_camera=True,  # 独立编码器
+        num_cameras=num_cameras,
+    )
+
+    print(f"✅ Separate encoders model instantiated")
+    print(f"   Output dim per camera: {feature_dim_per_cam}")
+    print(f"   Number of cameras: {num_cameras}")
+    print(f"   Number of encoders: {len(backbone_separate.rgb_encoder)}")
+
+    output = backbone_separate(images)
+    print(f"\n🔄 Forward pass completed")
+    print(f"   Input shapes: {[v.shape for v in images.values()]}")
+    print(f"   Output shape: {output.shape}")
+
+    expected_dim = num_cameras * feature_dim_per_cam
+    assert output.shape == (B, T, expected_dim), f"Expected shape {(B, T, expected_dim)}, got {output.shape}"
+    print(f"✨ Test passed!")
+
+    # ============================================================================
+    # Test 3: Verify parameters count
+    # ============================================================================
+    print("\n[Test 3] Parameter Count Comparison")
+    print("-" * 60)
+    shared_params = sum(p.numel() for p in backbone_shared.parameters())
+    separate_params = sum(p.numel() for p in backbone_separate.parameters())
+
+    print(f"   Shared encoder parameters: {shared_params:,}")
+    print(f"   Separate encoders parameters: {separate_params:,}")
+    print(f"   Ratio: {separate_params / shared_params:.2f}x")
+
+    assert separate_params > shared_params, "Separate encoders should have more parameters"
+    print(f"✨ Verification passed!")
+
+    # ============================================================================
+    # Test 4: Verify independent parameters
+    # ============================================================================
+    print("\n[Test 4] Verify Independent Parameters")
+    print("-" * 60)
+    # Check that encoders have independent parameters
+    encoder_0_first_param = list(backbone_separate.rgb_encoder[0].parameters())[0]
+    encoder_1_first_param = list(backbone_separate.rgb_encoder[1].parameters())[0]
+
+    # Modify first encoder's parameter
+    with torch.no_grad():
+        encoder_0_first_param += 1.0
+
+    # Verify they are not the same tensor
+    assert not torch.allclose(encoder_0_first_param, encoder_1_first_param), \
+        "Encoders should have independent parameters"
+
+    print(f"✅ Encoders have independent parameters")
+    print(f"✨ All tests passed!")
+
+    print("\n" + "=" * 60)
+    print("🎉 All tests completed successfully!")
+    print("=" * 60)
--- a/roboimi/vla/models/heads/init.py
+++ b/roboimi/vla/models/heads/init.py
@@ -0,0 +1,5 @@
+# Action Head models
+from .conditional_unet1d import ConditionalUnet1D
+from .transformer1d import Transformer1D
+
+__all__ = ["ConditionalUnet1D", "Transformer1D"]
--- a/roboimi/vla/models/heads/conditional_unet1d.py
+++ b/roboimi/vla/models/heads/conditional_unet1d.py
@@ -0,0 +1,256 @@
+# Diffusion Policy Action Head 实现
+import torch
+import torch.nn as nn
+from typing import Dict, Optional
+from diffusers import DDPMScheduler
+from roboimi.vla.core.interfaces import VLAHead
+
+from typing import Union
+import logging
+import torch
+import torch.nn as nn
+import einops
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops.layers.torch import Rearrange
+import math
+
+
+class SinusoidalPosEmb(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+        emb = x[:, None] * emb[None, :]
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+
+class Downsample1d(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.conv = nn.Conv1d(dim, dim, 3, 2, 1)
+
+    def forward(self, x):
+        return self.conv(x)
+
+class Upsample1d(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.conv = nn.ConvTranspose1d(dim, dim, 4, 2, 1)
+
+    def forward(self, x):
+        return self.conv(x)
+
+class Conv1dBlock(nn.Module):
+    '''
+        Conv1d --> GroupNorm --> Mish
+    '''
+
+    def __init__(self, inp_channels, out_channels, kernel_size, n_groups=8):
+        super().__init__()
+
+        self.block = nn.Sequential(
+            nn.Conv1d(inp_channels, out_channels, kernel_size, padding=kernel_size // 2),
+            # Rearrange('batch channels horizon -> batch channels 1 horizon'),
+            nn.GroupNorm(n_groups, out_channels),
+            # Rearrange('batch channels 1 horizon -> batch channels horizon'),
+            nn.Mish(),
+        )
+
+    def forward(self, x):
+        return self.block(x)
+
+class ConditionalResidualBlock1D(nn.Module):
+    def __init__(self, 
+            in_channels, 
+            out_channels, 
+            cond_dim,
+            kernel_size=3,
+            n_groups=8,
+            cond_predict_scale=False):
+        super().__init__()
+        self.blocks = nn.ModuleList([
+            Conv1dBlock(in_channels, out_channels, kernel_size, n_groups=n_groups),
+            Conv1dBlock(out_channels, out_channels, kernel_size, n_groups=n_groups),
+        ])
+
+
+
+        cond_channels = out_channels
+        if cond_predict_scale:
+            cond_channels = out_channels * 2
+        self.cond_predict_scale = cond_predict_scale
+        self.out_channels = out_channels
+        self.cond_encoder = nn.Sequential(
+            nn.Mish(),
+            nn.Linear(cond_dim, cond_channels),
+            Rearrange('batch t -> batch t 1'),
+        )
+
+        # make sure dimensions compatible
+        self.residual_conv = nn.Conv1d(in_channels, out_channels, 1) \
+            if in_channels != out_channels else nn.Identity()
+
+    def forward(self, x, cond):
+        '''
+            x : [ batch_size x in_channels x horizon ]
+            cond : [ batch_size x cond_dim]
+
+            returns:
+            out : [ batch_size x out_channels x horizon ]
+        '''
+        out = self.blocks[0](x)
+        embed = self.cond_encoder(cond)
+        if self.cond_predict_scale:
+            embed = embed.reshape(
+                embed.shape[0], 2, self.out_channels, 1)
+            scale = embed[:,0,...]
+            bias = embed[:,1,...]
+            out = scale * out + bias
+        else:
+            out = out + embed
+        out = self.blocks[1](out)
+        out = out + self.residual_conv(x)
+        return out
+
+
+class ConditionalUnet1D(nn.Module):
+    def __init__(self,
+        input_dim,
+        global_cond_dim=None,
+        diffusion_step_embed_dim=256,
+        down_dims=[256,512,1024],
+        kernel_size=3,
+        n_groups=8,
+        cond_predict_scale=False
+        ):
+        super().__init__()
+        all_dims = [input_dim] + list(down_dims)
+        start_dim = down_dims[0]
+
+        dsed = diffusion_step_embed_dim
+        diffusion_step_encoder = nn.Sequential(
+            SinusoidalPosEmb(dsed),
+            nn.Linear(dsed, dsed * 4),
+            nn.Mish(),
+            nn.Linear(dsed * 4, dsed),
+        )
+        cond_dim = dsed
+        if global_cond_dim is not None:
+            cond_dim += global_cond_dim
+
+        in_out = list(zip(all_dims[:-1], all_dims[1:]))
+
+        mid_dim = all_dims[-1]
+        self.mid_modules = nn.ModuleList([
+            ConditionalResidualBlock1D(
+                mid_dim, mid_dim, cond_dim=cond_dim,
+                kernel_size=kernel_size, n_groups=n_groups,
+                cond_predict_scale=cond_predict_scale
+            ),
+            ConditionalResidualBlock1D(
+                mid_dim, mid_dim, cond_dim=cond_dim,
+                kernel_size=kernel_size, n_groups=n_groups,
+                cond_predict_scale=cond_predict_scale
+            ),
+        ])
+
+        down_modules = nn.ModuleList([])
+        for ind, (dim_in, dim_out) in enumerate(in_out):
+            is_last = ind >= (len(in_out) - 1)
+            down_modules.append(nn.ModuleList([
+                ConditionalResidualBlock1D(
+                    dim_in, dim_out, cond_dim=cond_dim, 
+                    kernel_size=kernel_size, n_groups=n_groups,
+                    cond_predict_scale=cond_predict_scale),
+                ConditionalResidualBlock1D(
+                    dim_out, dim_out, cond_dim=cond_dim, 
+                    kernel_size=kernel_size, n_groups=n_groups,
+                    cond_predict_scale=cond_predict_scale),
+                Downsample1d(dim_out) if not is_last else nn.Identity()
+            ]))
+
+        up_modules = nn.ModuleList([])
+        for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
+            is_last = ind >= (len(in_out) - 1)
+            up_modules.append(nn.ModuleList([
+                ConditionalResidualBlock1D(
+                    dim_out*2, dim_in, cond_dim=cond_dim,
+                    kernel_size=kernel_size, n_groups=n_groups,
+                    cond_predict_scale=cond_predict_scale),
+                ConditionalResidualBlock1D(
+                    dim_in, dim_in, cond_dim=cond_dim,
+                    kernel_size=kernel_size, n_groups=n_groups,
+                    cond_predict_scale=cond_predict_scale),
+                Upsample1d(dim_in) if not is_last else nn.Identity()
+            ]))
+        
+        final_conv = nn.Sequential(
+            Conv1dBlock(start_dim, start_dim, kernel_size=kernel_size),
+            nn.Conv1d(start_dim, input_dim, 1),
+        )
+
+        self.diffusion_step_encoder = diffusion_step_encoder
+        self.up_modules = up_modules
+        self.down_modules = down_modules
+        self.final_conv = final_conv
+
+
+    def forward(self,
+            sample: torch.Tensor,
+            timestep: Union[torch.Tensor, float, int],
+            global_cond=None,
+            **kwargs):
+        """
+        x: (B,T,input_dim)
+        timestep: (B,) or int, diffusion step
+        global_cond: (B,global_cond_dim)
+        output: (B,T,input_dim)
+        """
+        sample = einops.rearrange(sample, 'b h t -> b t h')
+
+        # 1. time
+        timesteps = timestep
+        if not torch.is_tensor(timesteps):
+            # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
+            timesteps = torch.tensor([timesteps], dtype=torch.long, device=sample.device)
+        elif torch.is_tensor(timesteps) and len(timesteps.shape) == 0:
+            timesteps = timesteps[None].to(sample.device)
+        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+        timesteps = timesteps.expand(sample.shape[0])
+
+        global_feature = self.diffusion_step_encoder(timesteps)
+
+        if global_cond is not None:
+            global_feature = torch.cat([
+                global_feature, global_cond
+            ], axis=-1)
+
+        x = sample
+        h = []
+        for idx, (resnet, resnet2, downsample) in enumerate(self.down_modules):
+            x = resnet(x, global_feature)
+            x = resnet2(x, global_feature)
+            h.append(x)
+            x = downsample(x)
+
+        for mid_module in self.mid_modules:
+            x = mid_module(x, global_feature)
+
+        for idx, (resnet, resnet2, upsample) in enumerate(self.up_modules):
+            x = torch.cat((x, h.pop()), dim=1)
+            x = resnet(x, global_feature)
+            x = resnet2(x, global_feature)
+            x = upsample(x)
+
+        x = self.final_conv(x)
+
+        x = einops.rearrange(x, 'b t h -> b h t')
+        return x
--- a/roboimi/vla/models/heads/gr00t_dit1d.py
+++ b/roboimi/vla/models/heads/gr00t_dit1d.py
@@ -0,0 +1,146 @@
+import torch
+import torch.nn as nn
+from types import SimpleNamespace
+from typing import Optional, Union
+from pathlib import Path
+import importlib.util
+
+
+def _load_gr00t_dit():
+    repo_root = Path(__file__).resolve().parents[4]
+    dit_path = repo_root / "gr00t" / "models" / "dit.py"
+    spec = importlib.util.spec_from_file_location("gr00t_dit_standalone", dit_path)
+    if spec is None or spec.loader is None:
+        raise ImportError(f"Unable to load DiT from {dit_path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module.DiT
+
+
+DiT = _load_gr00t_dit()
+
+
+class Gr00tDiT1D(nn.Module):
+    """
+    Adapter that wraps gr00t DiT with the same call signature used by VLA heads.
+
+    Expected forward interface:
+      - sample: (B, T_action, input_dim)
+      - timestep: (B,) or scalar diffusion timestep
+      - cond: (B, T_obs, cond_dim)
+    """
+
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        horizon: int,
+        n_obs_steps: int,
+        cond_dim: int,
+        n_layer: int = 8,
+        n_head: int = 8,
+        n_emb: int = 256,
+        hidden_dim: int = 256,
+        mlp_ratio: int = 4,
+        dropout: float = 0.1,
+        add_action_pos_emb: bool = True,
+        add_cond_pos_emb: bool = True,
+    ):
+        super().__init__()
+        if cond_dim <= 0:
+            raise ValueError("Gr00tDiT1D requires cond_dim > 0.")
+
+        self.horizon = horizon
+        self.n_obs_steps = n_obs_steps
+
+        self.input_proj = nn.Linear(input_dim, n_emb)
+        self.cond_proj = nn.Linear(cond_dim, n_emb)
+        self.output_proj = nn.Linear(hidden_dim, output_dim)
+
+        self.action_pos_emb = (
+            nn.Parameter(torch.zeros(1, horizon, n_emb))
+            if add_action_pos_emb
+            else None
+        )
+        self.cond_pos_emb = (
+            nn.Parameter(torch.zeros(1, n_obs_steps, n_emb))
+            if add_cond_pos_emb
+            else None
+        )
+
+        args = SimpleNamespace(
+            embed_dim=n_emb,
+            nheads=n_head,
+            mlp_ratio=mlp_ratio,
+            dropout=dropout,
+            num_layers=n_layer,
+            hidden_dim=hidden_dim,
+        )
+        self.dit = DiT(args, cross_attention_dim=n_emb)
+
+        self._init_weights()
+
+    def _init_weights(self):
+        if self.action_pos_emb is not None:
+            nn.init.normal_(self.action_pos_emb, mean=0.0, std=0.02)
+        if self.cond_pos_emb is not None:
+            nn.init.normal_(self.cond_pos_emb, mean=0.0, std=0.02)
+
+    def _normalize_timesteps(
+        self,
+        timestep: Union[torch.Tensor, float, int],
+        batch_size: int,
+        device: torch.device,
+    ) -> torch.Tensor:
+        if not torch.is_tensor(timestep):
+            timesteps = torch.tensor([timestep], device=device)
+        else:
+            timesteps = timestep.to(device)
+
+        if timesteps.ndim == 0:
+            timesteps = timesteps[None]
+        if timesteps.shape[0] != batch_size:
+            timesteps = timesteps.expand(batch_size)
+
+        return timesteps.long()
+
+    def forward(
+        self,
+        sample: torch.Tensor,
+        timestep: Union[torch.Tensor, float, int],
+        cond: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        if cond is None:
+            raise ValueError("`cond` is required for Gr00tDiT1D forward.")
+
+        bsz, t_act, _ = sample.shape
+        if t_act > self.horizon:
+            raise ValueError(
+                f"sample length {t_act} exceeds configured horizon {self.horizon}"
+            )
+
+        hidden_states = self.input_proj(sample)
+        if self.action_pos_emb is not None:
+            hidden_states = hidden_states + self.action_pos_emb[:, :t_act, :]
+
+        encoder_hidden_states = self.cond_proj(cond)
+        if self.cond_pos_emb is not None:
+            t_obs = encoder_hidden_states.shape[1]
+            if t_obs > self.n_obs_steps:
+                raise ValueError(
+                    f"cond length {t_obs} exceeds configured n_obs_steps {self.n_obs_steps}"
+                )
+            encoder_hidden_states = (
+                encoder_hidden_states + self.cond_pos_emb[:, :t_obs, :]
+            )
+
+        timesteps = self._normalize_timesteps(
+            timestep, batch_size=bsz, device=sample.device
+        )
+        dit_output = self.dit(
+            hidden_states=hidden_states,
+            timestep=timesteps,
+            encoder_hidden_states=encoder_hidden_states,
+        )
+        return self.output_proj(dit_output)
--- a/roboimi/vla/models/heads/transformer1d.py
+++ b/roboimi/vla/models/heads/transformer1d.py
@@ -0,0 +1,337 @@
+"""Transformer-based diffusion head aligned with diffusion_policy's TransformerForDiffusion."""
+
+from __future__ import annotations
+
+import logging
+import math
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+
+logger = logging.getLogger(__name__)
+
+
+class ModuleAttrMixin(nn.Module):
+    """Minimal local copy of diffusion_policy's ModuleAttrMixin for state-dict parity."""
+
+    def __init__(self) -> None:
+        super().__init__()
+        self._dummy_variable = nn.Parameter()
+
+    @property
+    def device(self):
+        return next(iter(self.parameters())).device
+
+    @property
+    def dtype(self):
+        return next(iter(self.parameters())).dtype
+
+
+class SinusoidalPosEmb(nn.Module):
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+        emb = x[:, None] * emb[None, :]
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+
+
+class Transformer1D(ModuleAttrMixin):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        horizon: int,
+        n_obs_steps: Optional[int] = None,
+        cond_dim: int = 0,
+        n_layer: int = 8,
+        n_head: int = 8,
+        n_emb: int = 256,
+        p_drop_emb: float = 0.1,
+        p_drop_attn: float = 0.1,
+        causal_attn: bool = False,
+        time_as_cond: bool = True,
+        obs_as_cond: bool = False,
+        n_cond_layers: int = 0,
+    ) -> None:
+        super().__init__()
+
+        if n_obs_steps is None:
+            n_obs_steps = horizon
+
+        T = horizon
+        T_cond = 1
+        if not time_as_cond:
+            T += 1
+            T_cond -= 1
+        obs_as_cond = cond_dim > 0
+        if obs_as_cond:
+            assert time_as_cond
+            T_cond += n_obs_steps
+
+        self.input_emb = nn.Linear(input_dim, n_emb)
+        self.pos_emb = nn.Parameter(torch.zeros(1, T, n_emb))
+        self.drop = nn.Dropout(p_drop_emb)
+
+        self.time_emb = SinusoidalPosEmb(n_emb)
+        self.cond_obs_emb = None
+        if obs_as_cond:
+            self.cond_obs_emb = nn.Linear(cond_dim, n_emb)
+
+        self.cond_pos_emb = None
+        self.encoder = None
+        self.decoder = None
+        encoder_only = False
+
+        if T_cond > 0:
+            self.cond_pos_emb = nn.Parameter(torch.zeros(1, T_cond, n_emb))
+            if n_cond_layers > 0:
+                encoder_layer = nn.TransformerEncoderLayer(
+                    d_model=n_emb,
+                    nhead=n_head,
+                    dim_feedforward=4 * n_emb,
+                    dropout=p_drop_attn,
+                    activation='gelu',
+                    batch_first=True,
+                    norm_first=True,
+                )
+                self.encoder = nn.TransformerEncoder(
+                    encoder_layer=encoder_layer,
+                    num_layers=n_cond_layers,
+                )
+            else:
+                self.encoder = nn.Sequential(
+                    nn.Linear(n_emb, 4 * n_emb),
+                    nn.Mish(),
+                    nn.Linear(4 * n_emb, n_emb),
+                )
+
+            decoder_layer = nn.TransformerDecoderLayer(
+                d_model=n_emb,
+                nhead=n_head,
+                dim_feedforward=4 * n_emb,
+                dropout=p_drop_attn,
+                activation='gelu',
+                batch_first=True,
+                norm_first=True,
+            )
+            self.decoder = nn.TransformerDecoder(
+                decoder_layer=decoder_layer,
+                num_layers=n_layer,
+            )
+        else:
+            encoder_only = True
+            encoder_layer = nn.TransformerEncoderLayer(
+                d_model=n_emb,
+                nhead=n_head,
+                dim_feedforward=4 * n_emb,
+                dropout=p_drop_attn,
+                activation='gelu',
+                batch_first=True,
+                norm_first=True,
+            )
+            self.encoder = nn.TransformerEncoder(
+                encoder_layer=encoder_layer,
+                num_layers=n_layer,
+            )
+
+        if causal_attn:
+            sz = T
+            mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
+            mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
+            self.register_buffer('mask', mask)
+
+            if time_as_cond and obs_as_cond:
+                S = T_cond
+                t, s = torch.meshgrid(torch.arange(T), torch.arange(S), indexing='ij')
+                mask = t >= (s - 1)
+                mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
+                self.register_buffer('memory_mask', mask)
+            else:
+                self.memory_mask = None
+        else:
+            self.mask = None
+            self.memory_mask = None
+
+        self.ln_f = nn.LayerNorm(n_emb)
+        self.head = nn.Linear(n_emb, output_dim)
+
+        self.T = T
+        self.T_cond = T_cond
+        self.horizon = horizon
+        self.time_as_cond = time_as_cond
+        self.obs_as_cond = obs_as_cond
+        self.encoder_only = encoder_only
+
+        self.apply(self._init_weights)
+        logger.info('number of parameters: %e', sum(p.numel() for p in self.parameters()))
+
+    def _init_weights(self, module):
+        ignore_types = (
+            nn.Dropout,
+            SinusoidalPosEmb,
+            nn.TransformerEncoderLayer,
+            nn.TransformerDecoderLayer,
+            nn.TransformerEncoder,
+            nn.TransformerDecoder,
+            nn.ModuleList,
+            nn.Mish,
+            nn.Sequential,
+        )
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if isinstance(module, nn.Linear) and module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.MultiheadAttention):
+            for name in ('in_proj_weight', 'q_proj_weight', 'k_proj_weight', 'v_proj_weight'):
+                weight = getattr(module, name)
+                if weight is not None:
+                    torch.nn.init.normal_(weight, mean=0.0, std=0.02)
+
+            for name in ('in_proj_bias', 'bias_k', 'bias_v'):
+                bias = getattr(module, name)
+                if bias is not None:
+                    torch.nn.init.zeros_(bias)
+        elif isinstance(module, nn.LayerNorm):
+            torch.nn.init.zeros_(module.bias)
+            torch.nn.init.ones_(module.weight)
+        elif isinstance(module, Transformer1D):
+            torch.nn.init.normal_(module.pos_emb, mean=0.0, std=0.02)
+            if module.cond_obs_emb is not None:
+                torch.nn.init.normal_(module.cond_pos_emb, mean=0.0, std=0.02)
+        elif isinstance(module, ignore_types):
+            pass
+        else:
+            raise RuntimeError(f'Unaccounted module {module}')
+
+    def get_optim_groups(self, weight_decay: float = 1e-3):
+        decay = set()
+        no_decay = set()
+        whitelist_weight_modules = (torch.nn.Linear, torch.nn.MultiheadAttention)
+        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
+
+        for module_name, module in self.named_modules():
+            for param_name, _ in module.named_parameters():
+                full_param_name = f'{module_name}.{param_name}' if module_name else param_name
+
+                if param_name.endswith('bias'):
+                    no_decay.add(full_param_name)
+                elif param_name.startswith('bias'):
+                    no_decay.add(full_param_name)
+                elif param_name.endswith('weight') and isinstance(module, whitelist_weight_modules):
+                    decay.add(full_param_name)
+                elif param_name.endswith('weight') and isinstance(module, blacklist_weight_modules):
+                    no_decay.add(full_param_name)
+
+        no_decay.add('pos_emb')
+        no_decay.add('_dummy_variable')
+        if self.cond_pos_emb is not None:
+            no_decay.add('cond_pos_emb')
+
+        param_dict = {name: param for name, param in self.named_parameters()}
+        inter_params = decay & no_decay
+        union_params = decay | no_decay
+        assert len(inter_params) == 0, f'parameters {inter_params} made it into both decay/no_decay sets!'
+        assert len(param_dict.keys() - union_params) == 0, (
+            f'parameters {param_dict.keys() - union_params} were not separated into either decay/no_decay sets!'
+        )
+
+        return [
+            {
+                'params': [param_dict[name] for name in sorted(decay)],
+                'weight_decay': weight_decay,
+            },
+            {
+                'params': [param_dict[name] for name in sorted(no_decay)],
+                'weight_decay': 0.0,
+            },
+        ]
+
+    def configure_optimizers(
+        self,
+        learning_rate: float = 1e-4,
+        weight_decay: float = 1e-3,
+        betas: Tuple[float, float] = (0.9, 0.95),
+    ):
+        optim_groups = self.get_optim_groups(weight_decay=weight_decay)
+        return torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas)
+
+    def forward(
+        self,
+        sample: torch.Tensor,
+        timestep: Union[torch.Tensor, float, int],
+        cond: Optional[torch.Tensor] = None,
+        **kwargs,
+    ):
+        timesteps = timestep
+        if not torch.is_tensor(timesteps):
+            timesteps = torch.tensor([timesteps], dtype=torch.long, device=sample.device)
+        elif torch.is_tensor(timesteps) and len(timesteps.shape) == 0:
+            timesteps = timesteps[None].to(sample.device)
+        timesteps = timesteps.expand(sample.shape[0])
+        time_emb = self.time_emb(timesteps).unsqueeze(1)
+
+        input_emb = self.input_emb(sample)
+
+        if self.encoder_only:
+            token_embeddings = torch.cat([time_emb, input_emb], dim=1)
+            t = token_embeddings.shape[1]
+            position_embeddings = self.pos_emb[:, :t, :]
+            x = self.drop(token_embeddings + position_embeddings)
+            x = self.encoder(src=x, mask=self.mask)
+            x = x[:, 1:, :]
+        else:
+            cond_embeddings = time_emb
+            if self.obs_as_cond:
+                cond_obs_emb = self.cond_obs_emb(cond)
+                cond_embeddings = torch.cat([cond_embeddings, cond_obs_emb], dim=1)
+            tc = cond_embeddings.shape[1]
+            position_embeddings = self.cond_pos_emb[:, :tc, :]
+            x = self.drop(cond_embeddings + position_embeddings)
+            memory = self.encoder(x)
+
+            token_embeddings = input_emb
+            t = token_embeddings.shape[1]
+            position_embeddings = self.pos_emb[:, :t, :]
+            x = self.drop(token_embeddings + position_embeddings)
+            x = self.decoder(
+                tgt=x,
+                memory=memory,
+                tgt_mask=self.mask,
+                memory_mask=self.memory_mask,
+            )
+
+        x = self.ln_f(x)
+        x = self.head(x)
+        return x
+
+
+def create_transformer1d(
+    input_dim: int,
+    output_dim: int,
+    horizon: int,
+    n_obs_steps: int,
+    cond_dim: int,
+    n_layer: int = 8,
+    n_head: int = 8,
+    n_emb: int = 256,
+    **kwargs,
+) -> Transformer1D:
+    return Transformer1D(
+        input_dim=input_dim,
+        output_dim=output_dim,
+        horizon=horizon,
+        n_obs_steps=n_obs_steps,
+        cond_dim=cond_dim,
+        n_layer=n_layer,
+        n_head=n_head,
+        n_emb=n_emb,
+        **kwargs,
+    )
--- a/roboimi/vla/models/normalization.py
+++ b/roboimi/vla/models/normalization.py
@@ -0,0 +1,126 @@
+"""
+归一化模块 - 统一训练和推理的归一化逻辑
+
+支持两种归一化方式:
+1. Gaussian (z-score): (x - mean) / std
+2. MinMax: 2 * (x - min) / (max - min) - 1  -> [-1, 1]
+"""
+
+import torch
+import torch.nn as nn
+from typing import Optional, Dict, Literal
+
+
+class NormalizationModule(nn.Module):
+    """
+    统一的归一化模块
+    用于在 Agent 内部对 qpos 和 action 进行归一化/反归一化
+    """
+
+    def __init__(
+        self,
+        stats: Optional[Dict] = None,
+        normalization_type: Literal['gaussian', 'min_max'] = None,
+    ):
+        """
+        Args:
+            stats: 数据集统计信息字典，格式:
+                {
+                    'qpos_mean': [...],
+                    'qpos_std': [...],
+                    'qpos_min': [...],  # 仅 min_max 需要
+                    'qpos_max': [...],  # 仅 min_max 需要
+                    'action_mean': [...],
+                    'action_std': [...],
+                    'action_min': [...],  # 仅 min_max 需要
+                    'action_max': [...],  # 仅 min_max 需要
+                }
+            normalization_type: 归一化类型 ('gaussian' 或 'min_max')
+        """
+        super().__init__()
+
+        self.normalization_type = normalization_type
+        self.enabled = stats is not None
+
+        if self.enabled:
+            if self.normalization_type == 'gaussian':
+                self.register_buffer('qpos_mean', torch.tensor(stats['qpos_mean'], dtype=torch.float32))
+                self.register_buffer('qpos_std', torch.tensor(stats['qpos_std'], dtype=torch.float32))
+                self.register_buffer('action_mean', torch.tensor(stats['action_mean'], dtype=torch.float32))
+                self.register_buffer('action_std', torch.tensor(stats['action_std'], dtype=torch.float32))
+
+            elif self.normalization_type == 'min_max':
+                self.register_buffer('qpos_min', torch.tensor(stats['qpos_min'], dtype=torch.float32))
+                self.register_buffer('qpos_max', torch.tensor(stats['qpos_max'], dtype=torch.float32))
+                self.register_buffer('action_min', torch.tensor(stats['action_min'], dtype=torch.float32))
+                self.register_buffer('action_max', torch.tensor(stats['action_max'], dtype=torch.float32))
+
+    def normalize_qpos(self, qpos: torch.Tensor) -> torch.Tensor:
+        """归一化 qpos"""
+        if not self.enabled:
+            return qpos
+
+        if self.normalization_type == 'gaussian':
+            return (qpos - self.qpos_mean) / self.qpos_std
+        elif self.normalization_type == 'min_max':
+            return 2 * (qpos - self.qpos_min) / (self.qpos_max - self.qpos_min) - 1
+        else:
+            raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+
+    def denormalize_qpos(self, qpos: torch.Tensor) -> torch.Tensor:
+        """反归一化 qpos"""
+        if not self.enabled:
+            return qpos
+
+        if self.normalization_type == 'gaussian':
+            return qpos * self.qpos_std + self.qpos_mean
+        elif self.normalization_type == 'min_max':
+            return (qpos + 1) / 2 * (self.qpos_max - self.qpos_min) + self.qpos_min
+        else:
+            raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+
+    def normalize_action(self, action: torch.Tensor) -> torch.Tensor:
+        """归一化 action"""
+        if not self.enabled:
+            return action
+
+        if self.normalization_type == 'gaussian':
+            return (action - self.action_mean) / self.action_std
+        elif self.normalization_type == 'min_max':
+            return 2 * (action - self.action_min) / (self.action_max - self.action_min) - 1
+        else:
+            raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+
+    def denormalize_action(self, action: torch.Tensor) -> torch.Tensor:
+        """反归一化 action"""
+        if not self.enabled:
+            return action
+
+        if self.normalization_type == 'gaussian':
+            return action * self.action_std + self.action_mean
+        elif self.normalization_type == 'min_max':
+            return (action + 1) / 2 * (self.action_max - self.action_min) + self.action_min
+        else:
+            raise ValueError(f"Unknown normalization type: {self.normalization_type}")
+
+    def get_stats(self) -> Optional[Dict]:
+        """导出统计信息（用于保存到 checkpoint）"""
+        if not self.enabled:
+            return None
+
+        stats = {
+            'normalization_type': self.normalization_type,
+        }
+
+        if self.normalization_type == 'gaussian':
+            stats['qpos_mean'] = self.qpos_mean.cpu().tolist()
+            stats['qpos_std'] = self.qpos_std.cpu().tolist()
+            stats['action_mean'] = self.action_mean.cpu().tolist()
+            stats['action_std'] = self.action_std.cpu().tolist()
+        elif self.normalization_type == 'min_max':
+            stats['qpos_min'] = self.qpos_min.cpu().tolist()
+            stats['qpos_max'] = self.qpos_max.cpu().tolist()
+            stats['action_min'] = self.action_min.cpu().tolist()
+            stats['action_max'] = self.action_max.cpu().tolist()
+
+        return stats
--- a/roboimi/vla/modules/encoders.py
+++ b/roboimi/vla/modules/encoders.py
@@ -0,0 +1,18 @@
+from torch import nn
+
+class IdentityStateEncoder(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, state):
+        return state
+    
+
+class IdentityActionEncoder(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, action):
+        return action
--- a/roboimi/vla/scripts/calculate_stats.py
+++ b/roboimi/vla/scripts/calculate_stats.py
@@ -0,0 +1,114 @@
+import argparse
+import glob
+import os
+import pickle
+from pathlib import Path
+
+import h5py
+import numpy as np
+
+
+DEFAULT_DATASET_DIR = str(
+    Path(__file__).resolve().parents[2] / "demos" / "dataset" / "sim_transfer"
+)
+
+def get_data_stats(dataset_dir):
+    """
+    计算 action 和 qpos 的 Min, Max, Mean, Std
+
+    输出扁平化结构（与 NormalizationModule 期望一致）：
+    {
+        'action_mean': [...],
+        'action_std': [...],
+        'action_min': [...],
+        'action_max': [...],
+        'qpos_mean': [...],
+        'qpos_std': [...],
+        'qpos_min': [...],
+        'qpos_max': [...],
+    }
+    """
+    files = sorted(glob.glob(os.path.join(dataset_dir, 'episode_*.hdf5')))
+    print(f"Found {len(files)} episodes in {dataset_dir}")
+
+    if not files:
+        raise ValueError(
+            f"No episode_*.hdf5 files found in dataset_dir: {dataset_dir}"
+        )
+
+    all_actions = []
+    all_qpos = []
+
+    print("Reading data...")
+    for file_path in files:
+        with h5py.File(file_path, 'r') as f:
+            action = f['action'][:]
+            qpos = f['observations']['qpos'][:]
+            all_actions.append(action)
+            all_qpos.append(qpos)
+
+    # 拼接所有数据
+    all_actions = np.concatenate(all_actions, axis=0)
+    all_qpos = np.concatenate(all_qpos, axis=0)
+
+    print(f"Total steps: {all_actions.shape[0]}")
+
+    # --- 核心计算部分 ---
+    # 计算统计量
+    action_mean = np.mean(all_actions, axis=0)
+    action_std = np.std(all_actions, axis=0)
+    action_min = np.min(all_actions, axis=0)
+    action_max = np.max(all_actions, axis=0)
+
+    qpos_mean = np.mean(all_qpos, axis=0)
+    qpos_std = np.std(all_qpos, axis=0)
+    qpos_min = np.min(all_qpos, axis=0)
+    qpos_max = np.max(all_qpos, axis=0)
+
+    # 修正标准差（防止除以 0）
+    eps = 1e-8
+    action_std_corrected = np.where(action_std < eps, eps, action_std)
+    qpos_std_corrected = np.where(qpos_std < eps, eps, qpos_std)
+
+    # 转换为扁平化结构（与 NormalizationModule 期望一致）
+    stats_flat = {
+        'action_mean': action_mean,
+        'action_std': action_std_corrected,
+        'action_min': action_min,
+        'action_max': action_max,
+        'qpos_mean': qpos_mean,
+        'qpos_std': qpos_std_corrected,
+        'qpos_min': qpos_min,
+        'qpos_max': qpos_max
+    }
+    return stats_flat
+
+
+def write_dataset_stats(dataset_dir):
+    output_path = os.path.join(dataset_dir, "dataset_stats.pkl")
+    stats_flat = get_data_stats(dataset_dir)
+
+    # 打印检查
+    print("\n--- Stats Computed ---")
+    print(f"Action Mean shape: {stats_flat['action_mean'].shape}")
+    print(f"Action Std shape: {stats_flat['action_std'].shape}")
+
+    with open(output_path, 'wb') as f:
+        pickle.dump(stats_flat, f)
+    print(f"\nStats saved to {output_path}")
+
+    return output_path
+
+
+def main(argv=None):
+    parser = argparse.ArgumentParser(description="Calculate dataset statistics.")
+    parser.add_argument(
+        "--dataset_dir",
+        default=DEFAULT_DATASET_DIR,
+        help="Directory containing episode_*.hdf5 files.",
+    )
+    args = parser.parse_args(argv)
+    write_dataset_stats(args.dataset_dir)
+
+if __name__ == "__main__":
+    main()
--- a/tests/init.py
+++ b/tests/init.py
@@ -0,0 +1 @@
+
--- a/tests/test_calculate_stats_cli.py
+++ b/tests/test_calculate_stats_cli.py
@@ -0,0 +1,88 @@
+import pickle
+import tempfile
+import unittest
+from pathlib import Path
+
+import h5py
+import numpy as np
+
+from roboimi.vla.scripts import calculate_stats
+
+
+class CalculateStatsCliTest(unittest.TestCase):
+    def test_default_dataset_dir_is_absolute_and_package_relative(self):
+        expected = (
+            Path(calculate_stats.__file__).resolve().parents[2]
+            / "demos"
+            / "dataset"
+            / "sim_transfer"
+        )
+
+        self.assertEqual(Path(calculate_stats.DEFAULT_DATASET_DIR), expected)
+        self.assertTrue(Path(calculate_stats.DEFAULT_DATASET_DIR).is_absolute())
+
+    def test_main_writes_dataset_stats_pkl_to_dataset_dir(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            dataset_dir = Path(tmpdir)
+            episode_path = dataset_dir / "episode_0.hdf5"
+
+            with h5py.File(episode_path, "w") as root:
+                root.create_dataset(
+                    "action",
+                    data=np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32),
+                )
+                observations = root.create_group("observations")
+                observations.create_dataset(
+                    "qpos",
+                    data=np.array([[5.0, 6.0], [7.0, 8.0]], dtype=np.float32),
+                )
+
+            calculate_stats.main(["--dataset_dir", str(dataset_dir)])
+
+            stats_path = dataset_dir / "dataset_stats.pkl"
+            self.assertTrue(stats_path.exists())
+
+            with stats_path.open("rb") as f:
+                stats = pickle.load(f)
+
+            self.assertEqual(
+                set(stats),
+                {
+                    "action_mean",
+                    "action_std",
+                    "action_min",
+                    "action_max",
+                    "qpos_mean",
+                    "qpos_std",
+                    "qpos_min",
+                    "qpos_max",
+                },
+            )
+            np.testing.assert_allclose(stats["action_mean"], np.array([2.0, 3.0]))
+            np.testing.assert_allclose(stats["qpos_mean"], np.array([6.0, 7.0]))
+
+    def test_main_raises_clear_error_for_empty_dataset_dir(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            dataset_dir = Path(tmpdir)
+
+            with self.assertRaisesRegex(
+                ValueError, r"No episode_\*\.hdf5 files found"
+            ) as ctx:
+                calculate_stats.main(["--dataset_dir", str(dataset_dir)])
+
+            self.assertIn(str(dataset_dir), str(ctx.exception))
+
+    def test_main_raises_clear_error_for_missing_dataset_dir(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            dataset_dir = Path(tmpdir) / "missing"
+
+            with self.assertRaisesRegex(
+                ValueError, r"No episode_\*\.hdf5 files found"
+            ) as ctx:
+                calculate_stats.main(["--dataset_dir", str(dataset_dir)])
+
+            self.assertIn(str(dataset_dir), str(ctx.exception))
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_eval_vla_execution.py
+++ b/tests/test_eval_vla_execution.py
@@ -0,0 +1,28 @@
+import unittest
+
+from roboimi.vla.eval_utils import execute_policy_action
+
+
+class _FakeEnv:
+    def __init__(self):
+        self.calls = []
+
+    def step(self, action):
+        self.calls.append(("step", action))
+
+    def step_jnt(self, action):
+        self.calls.append(("step_jnt", action))
+
+
+class EvalVLAExecutionTest(unittest.TestCase):
+    def test_execute_policy_action_uses_ee_step(self):
+        env = _FakeEnv()
+        action = [1, 2, 3]
+
+        execute_policy_action(env, action)
+
+        self.assertEqual(env.calls, [("step", action)])
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_eval_vla_headless.py
+++ b/tests/test_eval_vla_headless.py
@@ -0,0 +1,259 @@
+import unittest
+from pathlib import Path
+from unittest import mock
+
+import numpy as np
+import torch
+from omegaconf import OmegaConf
+
+from roboimi.demos.vla_scripts import eval_vla
+from roboimi.envs.double_base import DualDianaMed
+from roboimi.envs.double_pos_ctrl_env import make_sim_env
+
+
+class _FakeAgent:
+    def __init__(self):
+        self.reset_calls = 0
+        self.last_observation = None
+
+    def eval(self):
+        return self
+
+    def to(self, _device):
+        return self
+
+    def reset(self):
+        self.reset_calls += 1
+
+    def select_action(self, observation):
+        self.last_observation = observation
+        return torch.zeros(16)
+
+
+class _FakeEnv:
+    def __init__(self):
+        self.image_obs_calls = 0
+        self.render_calls = 0
+        self.reset_calls = []
+
+    def reset(self, box_pos):
+        self.reset_calls.append(np.array(box_pos))
+
+    def _get_image_obs(self):
+        self.image_obs_calls += 1
+        return {
+            "images": {
+                "front": np.zeros((8, 8, 3), dtype=np.uint8),
+            }
+        }
+
+    def _get_qpos_obs(self):
+        return {"qpos": np.zeros(16, dtype=np.float32)}
+
+    def render(self):
+        self.render_calls += 1
+        raise AssertionError("env.render() should be skipped when eval.headless=true")
+
+
+class _RewardTrackingEnv(_FakeEnv):
+    def __init__(self, reward_sequences):
+        super().__init__()
+        self.reward_sequences = reward_sequences
+        self.episode_index = -1
+        self.step_index = 0
+        self.rew = 0.0
+
+    def reset(self, box_pos):
+        super().reset(box_pos)
+        self.episode_index += 1
+        self.step_index = 0
+
+
+class _FakeRenderer:
+    def __init__(self, env):
+        self._env = env
+        self._frames = [
+            np.full((4, 4, 3), fill_value=index, dtype=np.uint8)
+            for index in range(5)
+        ]
+        self._index = 0
+
+    def update_scene(self, _mj_data, camera=None):
+        self._camera = camera
+
+    def render(self):
+        frame = self._frames[self._index]
+        self._index += 1
+        if self._index >= len(self._frames):
+            self._env.exit_flag = True
+        return frame
+
+
+class EvalVLAHeadlessTest(unittest.TestCase):
+    def test_eval_config_exposes_headless_default(self):
+        eval_cfg = OmegaConf.load(Path("roboimi/vla/conf/eval/eval.yaml"))
+
+        self.assertIn("headless", eval_cfg)
+        self.assertFalse(eval_cfg.headless)
+
+    def test_make_sim_env_accepts_headless_and_disables_render(self):
+        fake_env = object()
+
+        with mock.patch(
+            "roboimi.assets.robots.diana_med.BiDianaMed",
+            return_value="robot",
+        ), mock.patch(
+            "roboimi.envs.double_pos_ctrl_env.DualDianaMed_Pos_Ctrl",
+            return_value=fake_env,
+        ) as env_cls:
+            env = make_sim_env("sim_transfer", headless=True)
+
+        self.assertIs(env, fake_env)
+        env_cls.assert_called_once_with(
+            robot="robot",
+            is_render=False,
+            control_freq=30,
+            is_interpolate=True,
+            cam_view="angle",
+        )
+
+    def test_camera_viewer_headless_updates_images_without_gui_calls(self):
+        env = DualDianaMed.__new__(DualDianaMed)
+        env.mj_model = object()
+        env.mj_data = object()
+        env.exit_flag = False
+        env.is_render = False
+        env.cam = "angle"
+        env.r_vis = None
+        env.l_vis = None
+        env.top = None
+        env.angle = None
+        env.front = None
+
+        with mock.patch(
+            "roboimi.envs.double_base.mj.Renderer",
+            side_effect=lambda *args, **kwargs: _FakeRenderer(env),
+        ), mock.patch("roboimi.envs.double_base.cv2.namedWindow") as named_window, mock.patch(
+            "roboimi.envs.double_base.cv2.imshow"
+        ) as imshow, mock.patch("roboimi.envs.double_base.cv2.waitKey") as wait_key:
+            env.camera_viewer()
+
+        named_window.assert_not_called()
+        imshow.assert_not_called()
+        wait_key.assert_not_called()
+        self.assertIsNotNone(env.r_vis)
+        self.assertIsNotNone(env.l_vis)
+        self.assertIsNotNone(env.top)
+        self.assertIsNotNone(env.angle)
+        self.assertIsNotNone(env.front)
+
+    def test_eval_main_headless_skips_render_and_still_executes_policy(self):
+        fake_env = _FakeEnv()
+        fake_agent = _FakeAgent()
+        cfg = OmegaConf.create(
+            {
+                "agent": {},
+                "eval": {
+                    "ckpt_path": "checkpoints/vla_model_best.pt",
+                    "num_episodes": 1,
+                    "max_timesteps": 1,
+                    "device": "cpu",
+                    "task_name": "sim_transfer",
+                    "camera_names": ["front"],
+                    "use_smoothing": False,
+                    "smooth_alpha": 0.3,
+                    "verbose_action": False,
+                    "headless": True,
+                },
+            }
+        )
+
+        with mock.patch.object(
+            eval_vla,
+            "load_checkpoint",
+            return_value=(fake_agent, None),
+        ), mock.patch.object(
+            eval_vla,
+            "make_sim_env",
+            return_value=fake_env,
+        ) as make_env, mock.patch.object(
+            eval_vla,
+            "sample_transfer_pose",
+            return_value=np.array([0.1, 0.2, 0.3]),
+        ), mock.patch.object(
+            eval_vla,
+            "execute_policy_action",
+        ) as execute_policy_action, mock.patch.object(
+            eval_vla,
+            "tqdm",
+            side_effect=lambda iterable, **kwargs: iterable,
+        ):
+            eval_vla.main.__wrapped__(cfg)
+
+        make_env.assert_called_once_with("sim_transfer", headless=True)
+        execute_policy_action.assert_called_once()
+        self.assertEqual(fake_env.image_obs_calls, 1)
+        self.assertEqual(fake_env.render_calls, 0)
+        self.assertIsNotNone(fake_agent.last_observation)
+        self.assertIn("front", fake_agent.last_observation["images"])
+
+    def test_run_eval_returns_average_reward_summary(self):
+        reward_sequences = [
+            [1.0, 2.0],
+            [0.5, 4.0],
+        ]
+        fake_env = _RewardTrackingEnv(reward_sequences)
+        fake_agent = _FakeAgent()
+        cfg = OmegaConf.create(
+            {
+                "agent": {},
+                "eval": {
+                    "ckpt_path": "checkpoints/vla_model_best.pt",
+                    "num_episodes": 2,
+                    "max_timesteps": 2,
+                    "device": "cpu",
+                    "task_name": "sim_transfer",
+                    "camera_names": ["front"],
+                    "use_smoothing": False,
+                    "smooth_alpha": 0.3,
+                    "verbose_action": False,
+                    "headless": True,
+                },
+            }
+        )
+
+        def fake_execute_policy_action(env, action):
+            del action
+            env.rew = env.reward_sequences[env.episode_index][env.step_index]
+            env.step_index += 1
+
+        with mock.patch.object(
+            eval_vla,
+            "load_checkpoint",
+            return_value=(fake_agent, None),
+        ), mock.patch.object(
+            eval_vla,
+            "make_sim_env",
+            return_value=fake_env,
+        ), mock.patch.object(
+            eval_vla,
+            "sample_transfer_pose",
+            return_value=np.array([0.1, 0.2, 0.3]),
+        ), mock.patch.object(
+            eval_vla,
+            "execute_policy_action",
+            side_effect=fake_execute_policy_action,
+        ), mock.patch.object(
+            eval_vla,
+            "tqdm",
+            side_effect=lambda iterable, **kwargs: iterable,
+        ):
+            summary = eval_vla._run_eval(cfg)
+
+        self.assertEqual(summary["episode_rewards"], [3.0, 4.5])
+        self.assertAlmostEqual(summary["avg_reward"], 3.75)
+        self.assertEqual(summary["num_episodes"], 2)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_eval_vla_rollout_artifacts.py
+++ b/tests/test_eval_vla_rollout_artifacts.py
@@ -0,0 +1,228 @@
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from unittest import mock
+
+import numpy as np
+import torch
+from omegaconf import OmegaConf
+
+from roboimi.demos.vla_scripts import eval_vla
+
+
+class _FakeAgent:
+    def __init__(self, actions):
+        self._actions = [torch.tensor(action, dtype=torch.float32) for action in actions]
+        self.reset_calls = 0
+
+    def eval(self):
+        return self
+
+    def to(self, _device):
+        return self
+
+    def reset(self):
+        self.reset_calls += 1
+
+    def select_action(self, observation):
+        del observation
+        return self._actions.pop(0)
+
+
+class _FakeEnv:
+    def __init__(self):
+        self.step_count = 0
+        self.rew = 0.0
+        self.render_calls = 0
+        self.reset_calls = []
+
+    def reset(self, box_pos):
+        self.reset_calls.append(np.array(box_pos, copy=True))
+        self.step_count = 0
+        self.rew = 0.0
+
+    def _get_image_obs(self):
+        frame_value = self.step_count
+        front = np.full((6, 8, 3), fill_value=frame_value, dtype=np.uint8)
+        top = np.full((6, 8, 3), fill_value=frame_value + 20, dtype=np.uint8)
+        return {"images": {"front": front, "top": top}}
+
+    def _get_qpos_obs(self):
+        return {"qpos": np.arange(16, dtype=np.float32)}
+
+    def step(self, action):
+        del action
+        self.step_count += 1
+        self.rew = float(self.step_count)
+
+    def render(self):
+        self.render_calls += 1
+
+    def getBodyPos(self, name):
+        base = float(self.step_count)
+        if name == 'eef_left':
+            return np.array([base, base + 0.1, base + 0.2], dtype=np.float32)
+        if name == 'eef_right':
+            return np.array([base + 1.0, base + 1.1, base + 1.2], dtype=np.float32)
+        raise KeyError(name)
+
+    def getBodyQuat(self, name):
+        base = float(self.step_count)
+        if name == 'eef_left':
+            return np.array([1.0, base, 0.0, 0.0], dtype=np.float32)
+        if name == 'eef_right':
+            return np.array([1.0, 0.0, base, 0.0], dtype=np.float32)
+        raise KeyError(name)
+
+
+class _FakeVideoWriter:
+    def __init__(self, output_path):
+        self.output_path = Path(output_path)
+        self.output_path.parent.mkdir(parents=True, exist_ok=True)
+        self.output_path.write_bytes(b'')
+        self.frames = []
+        self.released = False
+
+    def isOpened(self):
+        return True
+
+    def write(self, frame):
+        self.frames.append(np.array(frame, copy=True))
+
+    def release(self):
+        self.released = True
+        self.output_path.write_bytes(b'fake-mp4')
+
+
+class EvalVLARolloutArtifactsTest(unittest.TestCase):
+    def test_eval_config_exposes_rollout_artifact_defaults(self):
+        eval_cfg = OmegaConf.load(Path('roboimi/vla/conf/eval/eval.yaml'))
+
+        self.assertIn('artifact_dir', eval_cfg)
+        self.assertFalse(eval_cfg.save_summary_json)
+        self.assertFalse(eval_cfg.save_trajectory_npz)
+        self.assertFalse(eval_cfg.record_video)
+        self.assertIsNone(eval_cfg.artifact_dir)
+        self.assertIsNone(eval_cfg.video_camera_name)
+        self.assertEqual(eval_cfg.video_fps, 30)
+
+    def test_run_eval_exports_npz_summary_and_video_artifacts(self):
+        actions = [
+            np.arange(16, dtype=np.float32),
+            np.arange(16, dtype=np.float32) + 10.0,
+        ]
+        fake_agent = _FakeAgent(actions)
+        fake_env = _FakeEnv()
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            cfg = OmegaConf.create(
+                {
+                    'agent': {},
+                    'eval': {
+                        'ckpt_path': 'checkpoints/vla_model_best.pt',
+                        'num_episodes': 1,
+                        'max_timesteps': 2,
+                        'device': 'cpu',
+                        'task_name': 'sim_transfer',
+                        'camera_names': ['front', 'top'],
+                        'use_smoothing': True,
+                        'smooth_alpha': 0.5,
+                        'verbose_action': False,
+                        'headless': True,
+                        'artifact_dir': tmpdir,
+                        'save_summary_json': True,
+                        'save_trajectory_npz': True,
+                        'record_video': True,
+                        'video_camera_name': 'front',
+                        'video_fps': 12,
+                    },
+                }
+            )
+
+            writer_holder = {}
+
+            def fake_open_video_writer(output_path, frame_size, fps):
+                self.assertEqual(frame_size, (8, 6))
+                self.assertEqual(fps, 12)
+                writer = _FakeVideoWriter(output_path)
+                writer_holder['writer'] = writer
+                return writer
+
+            with mock.patch.object(
+                eval_vla,
+                'load_checkpoint',
+                return_value=(fake_agent, None),
+            ), mock.patch.object(
+                eval_vla,
+                'make_sim_env',
+                return_value=fake_env,
+            ), mock.patch.object(
+                eval_vla,
+                'sample_transfer_pose',
+                return_value=np.array([0.1, 0.2, 0.3], dtype=np.float32),
+            ), mock.patch.object(
+                eval_vla,
+                'tqdm',
+                side_effect=lambda iterable, **kwargs: iterable,
+            ), mock.patch.object(
+                eval_vla,
+                '_open_video_writer',
+                side_effect=fake_open_video_writer,
+            ):
+                summary = eval_vla._run_eval(cfg)
+
+            artifacts = summary['artifacts']
+            trajectory_path = Path(artifacts['trajectory_npz'])
+            summary_path = Path(artifacts['summary_json'])
+            video_path = Path(artifacts['video_mp4'])
+
+            self.assertEqual(Path(artifacts['output_dir']), Path(tmpdir))
+            self.assertEqual(artifacts['video_camera_name'], 'front')
+            self.assertTrue(trajectory_path.exists())
+            self.assertTrue(summary_path.exists())
+            self.assertTrue(video_path.exists())
+
+            rollout_npz = np.load(trajectory_path)
+            np.testing.assert_array_equal(rollout_npz['episode_index'], np.array([0, 0]))
+            np.testing.assert_array_equal(rollout_npz['timestep'], np.array([0, 1]))
+            np.testing.assert_array_equal(rollout_npz['reward'], np.array([1.0, 2.0], dtype=np.float32))
+            np.testing.assert_array_equal(rollout_npz['raw_predicted_ee_action'][0], actions[0])
+            np.testing.assert_array_equal(rollout_npz['raw_predicted_ee_action'][1], actions[1])
+            np.testing.assert_array_equal(rollout_npz['executed_ee_action'][0], actions[0])
+            np.testing.assert_array_equal(
+                rollout_npz['executed_ee_action'][1],
+                (actions[0] + actions[1]) / 2.0,
+            )
+            np.testing.assert_array_equal(
+                rollout_npz['left_ee_pos'],
+                np.array([[1.0, 1.1, 1.2], [2.0, 2.1, 2.2]], dtype=np.float32),
+            )
+            np.testing.assert_array_equal(
+                rollout_npz['right_ee_pos'],
+                np.array([[2.0, 2.1, 2.2], [3.0, 3.1, 3.2]], dtype=np.float32),
+            )
+            self.assertEqual(rollout_npz['obs_read_time_ms'].shape, (2,))
+            self.assertEqual(rollout_npz['preprocess_time_ms'].shape, (2,))
+            self.assertEqual(rollout_npz['inference_time_ms'].shape, (2,))
+            self.assertEqual(rollout_npz['env_step_time_ms'].shape, (2,))
+            self.assertEqual(rollout_npz['total_time_ms'].shape, (2,))
+
+            writer = writer_holder['writer']
+            self.assertTrue(writer.released)
+            self.assertEqual(len(writer.frames), 2)
+            np.testing.assert_array_equal(writer.frames[0], np.zeros((6, 8, 3), dtype=np.uint8))
+            np.testing.assert_array_equal(writer.frames[1], np.full((6, 8, 3), 1, dtype=np.uint8))
+
+            with summary_path.open('r', encoding='utf-8') as fh:
+                saved_summary = json.load(fh)
+            self.assertEqual(saved_summary['artifacts']['trajectory_npz'], str(trajectory_path))
+            self.assertEqual(saved_summary['artifacts']['video_mp4'], str(video_path))
+            self.assertEqual(saved_summary['episode_rewards'], [3.0])
+            self.assertAlmostEqual(summary['avg_reward'], 3.0)
+            self.assertIn('avg_obs_read_time_ms', summary)
+            self.assertIn('avg_env_step_time_ms', summary)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/tests/test_raw_action_trajectory_viewer.py
+++ b/tests/test_raw_action_trajectory_viewer.py
@@ -0,0 +1,119 @@
+import tempfile
+import unittest
+from pathlib import Path
+from types import SimpleNamespace
+from unittest import mock
+
+import numpy as np
+
+from roboimi.utils import raw_action_trajectory_viewer as traj_view
+
+
+class RawActionTrajectoryViewerTest(unittest.TestCase):
+    def test_set_transfer_box_pose_writes_joint_qpos(self):
+        joint_qpos = np.zeros(7, dtype=np.float64)
+
+        class _FakeJoint:
+            def __init__(self, qpos):
+                self.qpos = qpos
+
+        class _FakeData:
+            def joint(self, name):
+                assert name == "red_box_joint"
+                return _FakeJoint(joint_qpos)
+
+        traj_view.set_transfer_box_pose(_FakeData(), np.array([0.2, -0.1, 1.05], dtype=np.float64))
+
+        np.testing.assert_array_equal(
+            joint_qpos,
+            np.array([0.2, -0.1, 1.05, 1.0, 0.0, 0.0, 0.0], dtype=np.float64),
+        )
+
+    def test_disable_cv2_highgui_temporarily_replaces_gui_calls(self):
+        fake_cv2 = SimpleNamespace(
+            namedWindow=lambda *args, **kwargs: "named",
+            imshow=lambda *args, **kwargs: "imshow",
+            waitKey=lambda *args, **kwargs: "wait",
+        )
+
+        restore = traj_view.disable_cv2_highgui(fake_cv2)
+        self.assertIsNone(fake_cv2.namedWindow("x"))
+        self.assertIsNone(fake_cv2.imshow("x", None))
+        self.assertEqual(fake_cv2.waitKey(1), 1)
+
+        restore()
+        self.assertEqual(fake_cv2.namedWindow("x"), "named")
+        self.assertEqual(fake_cv2.imshow("x", None), "imshow")
+        self.assertEqual(fake_cv2.waitKey(1), "wait")
+
+    def test_load_raw_action_positions_from_npz(self):
+        raw_action = np.array(
+            [
+                [1.0, 2.0, 3.0, 0, 0, 0, 1, 11.0, 12.0, 13.0, 0, 0, 0, 1, -1, -1],
+                [4.0, 5.0, 6.0, 0, 0, 0, 1, 14.0, 15.0, 16.0, 0, 0, 0, 1, -1, -1],
+            ],
+            dtype=np.float32,
+        )
+        with tempfile.TemporaryDirectory() as tmpdir:
+            path = Path(tmpdir) / "trajectory.npz"
+            np.savez(path, raw_action=raw_action)
+
+            positions = traj_view.load_raw_action_positions(path)
+
+        np.testing.assert_array_equal(
+            positions["left"],
+            np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=np.float32),
+        )
+        np.testing.assert_array_equal(
+            positions["right"],
+            np.array([[11.0, 12.0, 13.0], [14.0, 15.0, 16.0]], dtype=np.float32),
+        )
+
+    def test_build_red_capsule_segments_downsamples_to_fit_scene_limit(self):
+        left = np.stack([np.array([float(i), 0.0, 0.0], dtype=np.float32) for i in range(6)])
+        right = np.stack([np.array([float(i), 1.0, 0.0], dtype=np.float32) for i in range(6)])
+
+        markers = traj_view.build_trajectory_capsule_markers(
+            {"left": left, "right": right},
+            max_markers=4,
+            radius=0.01,
+        )
+
+        self.assertLessEqual(len(markers), 4)
+        self.assertTrue(all(marker["rgba"] == (1.0, 0.0, 0.0, 1.0) for marker in markers))
+        self.assertTrue(all(marker["radius"] == 0.01 for marker in markers))
+
+    def test_apply_capsule_markers_populates_user_scene(self):
+        fake_scene = SimpleNamespace(
+            maxgeom=3,
+            ngeom=99,
+            geoms=[object(), object(), object()],
+        )
+        markers = [
+            {
+                "from": np.array([0.0, 0.0, 0.0], dtype=np.float64),
+                "to": np.array([1.0, 0.0, 0.0], dtype=np.float64),
+                "rgba": (1.0, 0.0, 0.0, 1.0),
+                "radius": 0.01,
+            },
+            {
+                "from": np.array([0.0, 1.0, 0.0], dtype=np.float64),
+                "to": np.array([1.0, 1.0, 0.0], dtype=np.float64),
+                "rgba": (1.0, 0.0, 0.0, 1.0),
+                "radius": 0.01,
+            },
+        ]
+
+        with mock.patch.object(traj_view.mujoco, "mjv_initGeom") as init_geom, mock.patch.object(
+            traj_view.mujoco,
+            "mjv_connector",
+        ) as connector:
+            traj_view.apply_capsule_markers_to_scene(fake_scene, markers)
+
+        self.assertEqual(fake_scene.ngeom, 2)
+        self.assertEqual(init_geom.call_count, 2)
+        self.assertEqual(connector.call_count, 2)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_resnet_transformer_agent_wiring.py
+++ b/tests/test_resnet_transformer_agent_wiring.py
@@ -0,0 +1,387 @@
+import contextlib
+import sys
+import types
+import unittest
+from pathlib import Path
+
+import torch
+from hydra import compose, initialize_config_dir
+from hydra.errors import InstantiationException
+from hydra.core.global_hydra import GlobalHydra
+from hydra.utils import instantiate
+from omegaconf import OmegaConf
+
+
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+_CONFIG_DIR = str((_REPO_ROOT / 'roboimi/vla/conf').resolve())
+_EXPECTED_CAMERA_NAMES = ['r_vis', 'top', 'front']
+_MISSING = object()
+
+
+class _FakeScheduler:
+    def __init__(self, num_train_timesteps=100, **kwargs):
+        self.config = types.SimpleNamespace(num_train_timesteps=num_train_timesteps)
+        self.timesteps = []
+
+    def add_noise(self, sample, noise, timestep):
+        return sample + noise
+
+    def set_timesteps(self, num_inference_steps):
+        self.timesteps = list(range(num_inference_steps - 1, -1, -1))
+
+    def step(self, noise_pred, timestep, sample):
+        return types.SimpleNamespace(prev_sample=sample)
+
+
+class _IdentityCrop:
+    def __init__(self, size):
+        self.size = size
+
+    def __call__(self, x):
+        return x
+
+
+class _FakeResNet(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = torch.nn.Conv2d(3, 8, kernel_size=3, padding=1)
+        self.relu1 = torch.nn.ReLU()
+        self.conv2 = torch.nn.Conv2d(8, 16, kernel_size=3, padding=1, stride=2)
+        self.relu2 = torch.nn.ReLU()
+        self.avgpool = torch.nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = torch.nn.Linear(16, 16)
+
+    def forward(self, x):
+        x = self.relu1(self.conv1(x))
+        x = self.relu2(self.conv2(x))
+        x = self.avgpool(x)
+        x = torch.flatten(x, start_dim=1)
+        return self.fc(x)
+
+
+class _FakeRearrange(torch.nn.Module):
+    def __init__(self, *args, **kwargs):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class _CondCapturingHead(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.last_cond = None
+
+    def forward(self, sample, timestep, cond):
+        self.last_cond = cond.detach().clone()
+        return torch.zeros_like(sample)
+
+
+@contextlib.contextmanager
+def _stub_optional_modules():
+    previous_modules = {}
+
+    def inject(name, module):
+        if name not in previous_modules:
+            previous_modules[name] = sys.modules.get(name, _MISSING)
+        sys.modules[name] = module
+
+    diffusers_module = types.ModuleType('diffusers')
+    schedulers_module = types.ModuleType('diffusers.schedulers')
+    ddpm_module = types.ModuleType('diffusers.schedulers.scheduling_ddpm')
+    ddim_module = types.ModuleType('diffusers.schedulers.scheduling_ddim')
+    ddpm_module.DDPMScheduler = _FakeScheduler
+    ddim_module.DDIMScheduler = _FakeScheduler
+    diffusers_module.DDPMScheduler = _FakeScheduler
+    diffusers_module.DDIMScheduler = _FakeScheduler
+    diffusers_module.schedulers = schedulers_module
+    schedulers_module.scheduling_ddpm = ddpm_module
+    schedulers_module.scheduling_ddim = ddim_module
+
+    torchvision_module = types.ModuleType('torchvision')
+    models_module = types.ModuleType('torchvision.models')
+    transforms_module = types.ModuleType('torchvision.transforms')
+    models_module.resnet18 = lambda weights=None: _FakeResNet()
+    transforms_module.CenterCrop = _IdentityCrop
+    transforms_module.RandomCrop = _IdentityCrop
+    torchvision_module.models = models_module
+    torchvision_module.transforms = transforms_module
+
+    einops_module = types.ModuleType('einops')
+    einops_module.rearrange = lambda x, *args, **kwargs: x
+    einops_layers_module = types.ModuleType('einops.layers')
+    einops_layers_torch_module = types.ModuleType('einops.layers.torch')
+    einops_layers_torch_module.Rearrange = _FakeRearrange
+    einops_module.layers = einops_layers_module
+    einops_layers_module.torch = einops_layers_torch_module
+
+    try:
+        inject('diffusers', diffusers_module)
+        inject('diffusers.schedulers', schedulers_module)
+        inject('diffusers.schedulers.scheduling_ddpm', ddpm_module)
+        inject('diffusers.schedulers.scheduling_ddim', ddim_module)
+        inject('torchvision', torchvision_module)
+        inject('torchvision.models', models_module)
+        inject('torchvision.transforms', transforms_module)
+        inject('einops', einops_module)
+        inject('einops.layers', einops_layers_module)
+        inject('einops.layers.torch', einops_layers_torch_module)
+        yield
+    finally:
+        for name, previous in reversed(list(previous_modules.items())):
+            if previous is _MISSING:
+                sys.modules.pop(name, None)
+            else:
+                sys.modules[name] = previous
+
+
+def _compose_cfg(overrides=None):
+    if not OmegaConf.has_resolver('len'):
+        OmegaConf.register_new_resolver('len', lambda x: len(x))
+
+    GlobalHydra.instance().clear()
+    with initialize_config_dir(version_base=None, config_dir=_CONFIG_DIR):
+        return compose(config_name='config', overrides=list(overrides or []))
+
+
+def _make_images(batch_size, obs_horizon, image_shape, per_camera_fill=None):
+    channels, height, width = image_shape
+    per_camera_fill = per_camera_fill or {
+        'front': 30.0,
+        'top': 20.0,
+        'r_vis': 10.0,
+    }
+    return {
+        name: torch.full(
+            (batch_size, obs_horizon, channels, height, width),
+            fill_value=fill_value,
+            dtype=torch.float32,
+        )
+        for name, fill_value in per_camera_fill.items()
+    }
+
+
+def _patch_backbone_for_order_tracking(backbone):
+    feature_dim = backbone.output_dim
+
+    def encode_mean(image_batch):
+        mean_feature = image_batch.mean(dim=(1, 2, 3)).unsqueeze(-1)
+        return mean_feature.repeat(1, feature_dim)
+
+    if backbone.use_separate_rgb_encoder_per_camera:
+        for encoder in backbone.rgb_encoder:
+            encoder.forward_single_image = encode_mean
+    else:
+        backbone.rgb_encoder.forward_single_image = encode_mean
+
+
+def _extract_camera_markers(cond, feature_dim, num_cams):
+    camera_block = cond[0, 0, : feature_dim * num_cams].view(num_cams, feature_dim)
+    return camera_block[:, 0]
+
+
+class ResNetTransformerAgentWiringTest(unittest.TestCase):
+    def test_hydra_wiring_uses_required_three_camera_transformer_conditioning_in_agent_order_and_ignores_extra_keys(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+                'agent.inference_steps=1',
+                'agent.head.n_layer=1',
+                'agent.head.n_cond_layers=0',
+                'agent.head.n_emb=32',
+                'agent.head.n_head=4',
+            ]
+        )
+
+        self.assertEqual(list(cfg.data.camera_names), _EXPECTED_CAMERA_NAMES)
+        self.assertEqual(list(cfg.eval.camera_names), _EXPECTED_CAMERA_NAMES)
+        self.assertEqual(list(cfg.agent.camera_names), _EXPECTED_CAMERA_NAMES)
+        self.assertEqual(list(cfg.agent.vision_backbone.camera_names), _EXPECTED_CAMERA_NAMES)
+        self.assertEqual(cfg.agent.head_type, 'transformer')
+        self.assertEqual(cfg.agent.num_cams, 3)
+        self.assertTrue(cfg.agent.head.obs_as_cond)
+        self.assertFalse(cfg.agent.head.causal_attn)
+
+        with _stub_optional_modules():
+            agent = instantiate(cfg.agent)
+            expected_cond_dim = agent.vision_encoder.output_dim * agent.num_cams + agent.obs_dim
+            self.assertEqual(cfg.agent.head.cond_dim, expected_cond_dim)
+            self.assertEqual(agent.per_step_cond_dim, expected_cond_dim)
+            self.assertEqual(agent.noise_pred_net.cond_obs_emb.in_features, expected_cond_dim)
+
+            batch_size = 2
+            image_shape = tuple(cfg.agent.vision_backbone.input_shape)
+            images = _make_images(
+                batch_size,
+                cfg.agent.obs_horizon,
+                image_shape,
+                per_camera_fill={
+                    'front': 30.0,
+                    'top': 20.0,
+                    'r_vis': 10.0,
+                    'left_wrist': 99.0,
+                },
+            )
+            proprioception = torch.randn(batch_size, cfg.agent.obs_horizon, cfg.agent.obs_dim)
+            _patch_backbone_for_order_tracking(agent.vision_encoder)
+            capturing_head = _CondCapturingHead()
+            agent.noise_pred_net = capturing_head
+            predicted_actions = agent.predict_action(images, proprioception)
+            self.assertEqual(
+                predicted_actions.shape,
+                (batch_size, cfg.agent.pred_horizon, cfg.agent.action_dim),
+            )
+            self.assertIsNotNone(capturing_head.last_cond)
+            self.assertEqual(capturing_head.last_cond.shape[-1], expected_cond_dim)
+            camera_markers = _extract_camera_markers(
+                capturing_head.last_cond,
+                agent.vision_encoder.output_dim,
+                agent.num_cams,
+            )
+            self.assertTrue(torch.allclose(camera_markers, torch.tensor([10.0, 20.0, 30.0])))
+
+            missing_images = dict(images)
+            missing_images.pop('top')
+            with self.assertRaisesRegex(ValueError, 'missing=.*top'):
+                agent.predict_action(missing_images, proprioception)
+
+    def test_agent_rejects_conflicting_explicit_backbone_camera_names(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+        cfg.agent.vision_backbone.camera_names = ['front', 'top', 'r_vis']
+
+        with _stub_optional_modules():
+            with self.assertRaisesRegex(InstantiationException, 'camera_names'):
+                instantiate(cfg.agent)
+
+    def test_backbone_uses_sorted_fallback_order_when_camera_names_unset(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+        cfg.agent.vision_backbone.camera_names = None
+
+        with _stub_optional_modules():
+            backbone = instantiate(cfg.agent.vision_backbone)
+            _patch_backbone_for_order_tracking(backbone)
+            images = _make_images(
+                batch_size=1,
+                obs_horizon=cfg.agent.obs_horizon,
+                image_shape=tuple(cfg.agent.vision_backbone.input_shape),
+                per_camera_fill={
+                    'top': 20.0,
+                    'front': 30.0,
+                    'r_vis': 10.0,
+                },
+            )
+            ordered_features = backbone(images)
+            camera_markers = _extract_camera_markers(
+                ordered_features,
+                backbone.output_dim,
+                len(images),
+            )
+            self.assertTrue(torch.allclose(camera_markers, torch.tensor([30.0, 10.0, 20.0])))
+
+    def test_agent_queue_fallback_order_is_deterministic_when_camera_names_unset(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+        cfg.agent.camera_names = None
+        cfg.agent.vision_backbone.camera_names = None
+
+        with _stub_optional_modules():
+            agent = instantiate(cfg.agent)
+            observation = {
+                'qpos': torch.randn(cfg.agent.obs_dim),
+                'images': {
+                    'top': torch.full(tuple(cfg.agent.vision_backbone.input_shape), 20.0),
+                    'front': torch.full(tuple(cfg.agent.vision_backbone.input_shape), 30.0),
+                    'r_vis': torch.full(tuple(cfg.agent.vision_backbone.input_shape), 10.0),
+                },
+            }
+            agent._populate_queues(observation)
+            batch = agent._prepare_observation_batch()
+            self.assertEqual(list(batch['images'].keys()), ['front', 'r_vis', 'top'])
+
+    def test_backbone_rejects_camera_count_mismatch_when_camera_names_unset(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+        cfg.agent.vision_backbone.camera_names = None
+
+        with _stub_optional_modules():
+            backbone = instantiate(cfg.agent.vision_backbone)
+            images = _make_images(
+                batch_size=1,
+                obs_horizon=cfg.agent.obs_horizon,
+                image_shape=tuple(cfg.agent.vision_backbone.input_shape),
+                per_camera_fill={
+                    'front': 30.0,
+                    'r_vis': 10.0,
+                },
+            )
+            with self.assertRaisesRegex(ValueError, 'num_cameras'):
+                backbone(images)
+
+    def test_agent_rejects_camera_count_mismatch_when_camera_names_unset(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+                'agent.inference_steps=1',
+                'agent.head.n_layer=1',
+                'agent.head.n_cond_layers=0',
+                'agent.head.n_emb=32',
+                'agent.head.n_head=4',
+            ]
+        )
+        cfg.agent.camera_names = None
+        cfg.agent.vision_backbone.camera_names = None
+
+        with _stub_optional_modules():
+            agent = instantiate(cfg.agent)
+            images = _make_images(
+                batch_size=1,
+                obs_horizon=cfg.agent.obs_horizon,
+                image_shape=tuple(cfg.agent.vision_backbone.input_shape),
+                per_camera_fill={
+                    'front': 30.0,
+                    'r_vis': 10.0,
+                },
+            )
+            proprioception = torch.randn(1, cfg.agent.obs_horizon, cfg.agent.obs_dim)
+            with self.assertRaisesRegex(ValueError, 'num_cams'):
+                agent.predict_action(images, proprioception)
+
+    def test_agent_rejects_num_cams_mismatch_with_backbone_when_camera_names_unset(self):
+        cfg = _compose_cfg(
+            overrides=[
+                'agent.vision_backbone.pretrained_backbone_weights=null',
+                'agent.vision_backbone.input_shape=[3,16,16]',
+            ]
+        )
+        cfg.agent.camera_names = None
+        cfg.agent.vision_backbone.camera_names = None
+        cfg.agent.num_cams = 2
+        cfg.agent.vision_backbone.num_cameras = 3
+
+        with _stub_optional_modules():
+            with self.assertRaisesRegex(InstantiationException, 'num_cams'):
+                instantiate(cfg.agent)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/tests/test_robot_asset_paths.py
+++ b/tests/test_robot_asset_paths.py
@@ -0,0 +1,63 @@
+import os
+import tempfile
+import unittest
+from pathlib import Path
+from unittest import mock
+
+from roboimi.assets.robots.diana_med import BiDianaMed
+
+
+class _FakeKDL:
+    init_calls = []
+    reset_calls = []
+
+    def __init__(self, urdf_path):
+        self.__class__.init_calls.append(urdf_path)
+
+    def resetChain(self, base, end):
+        self.__class__.reset_calls.append((base, end))
+
+
+class RobotAssetPathResolutionTest(unittest.TestCase):
+    def setUp(self):
+        _FakeKDL.init_calls = []
+        _FakeKDL.reset_calls = []
+
+    def test_bidianamed_resolves_robot_asset_paths_independent_of_cwd(self):
+        repo_root = Path(__file__).resolve().parents[1]
+        expected_xml = repo_root / 'roboimi/assets/models/manipulators/DianaMed/bi_diana_transfer_ee.xml'
+        expected_urdf = repo_root / 'roboimi/assets/models/manipulators/DianaMed/DualDianaMed.urdf'
+        xml_calls = []
+
+        def fake_from_xml_path(*, filename, assets=None):
+            xml_calls.append((filename, assets))
+            return object()
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch(
+                    'roboimi.assets.robots.arm_base.mujoco.MjModel.from_xml_path',
+                    side_effect=fake_from_xml_path,
+                ), mock.patch(
+                    'roboimi.assets.robots.arm_base.mujoco.MjData',
+                    return_value=object(),
+                ), mock.patch(
+                    'roboimi.assets.robots.arm_base.KDL_utils',
+                    _FakeKDL,
+                ):
+                    BiDianaMed()
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertEqual(len(xml_calls), 1)
+        self.assertEqual(Path(xml_calls[0][0]), expected_xml)
+        self.assertTrue(Path(xml_calls[0][0]).is_absolute())
+        self.assertGreaterEqual(len(_FakeKDL.init_calls), 2)
+        self.assertEqual({Path(path) for path in _FakeKDL.init_calls}, {expected_urdf})
+        self.assertTrue(all(Path(path).is_absolute() for path in _FakeKDL.init_calls))
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/tests/test_simple_robot_dataset_image_loading.py
+++ b/tests/test_simple_robot_dataset_image_loading.py
@@ -0,0 +1,58 @@
+import sys
+import tempfile
+import types
+import unittest
+from pathlib import Path
+from unittest import mock
+
+import h5py
+import numpy as np
+
+from roboimi.vla.data.simpe_robot_dataset import SimpleRobotDataset
+
+
+class SimpleRobotDatasetImageLoadingTest(unittest.TestCase):
+    def _write_episode(self, dataset_dir: Path) -> None:
+        episode_path = dataset_dir / "episode_0.hdf5"
+        with h5py.File(episode_path, "w") as root:
+            root.create_dataset("action", data=np.arange(8, dtype=np.float32).reshape(4, 2))
+            root.create_dataset(
+                "observations/qpos",
+                data=np.arange(16, dtype=np.float32).reshape(4, 4),
+            )
+            root.create_dataset("task", data=np.array([b"sim_transfer"]))
+            root.create_dataset(
+                "observations/images/front",
+                data=np.arange(4 * 8 * 8 * 3, dtype=np.uint8).reshape(4, 8, 8, 3),
+            )
+
+    def test_getitem_only_resizes_observation_horizon_images(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            dataset_dir = Path(tmpdir)
+            self._write_episode(dataset_dir)
+            dataset = SimpleRobotDataset(
+                dataset_dir,
+                obs_horizon=2,
+                pred_horizon=3,
+                camera_names=["front"],
+            )
+
+            resize_calls = []
+
+            def fake_resize(image, size, interpolation=None):
+                resize_calls.append(
+                    {
+                        "shape": tuple(image.shape),
+                        "size": size,
+                        "interpolation": interpolation,
+                    }
+                )
+                return image
+
+            fake_cv2 = types.SimpleNamespace(INTER_LINEAR=1, resize=fake_resize)
+
+            with mock.patch.dict(sys.modules, {"cv2": fake_cv2}):
+                sample = dataset[1]
+
+        self.assertEqual(len(resize_calls), 2)
+        self.assertEqual(tuple(sample["observation.front"].shape), (2, 3, 8, 8))
--- a/tests/test_streaming_episode_writer.py
+++ b/tests/test_streaming_episode_writer.py
@@ -0,0 +1,79 @@
+import tempfile
+import unittest
+from pathlib import Path
+
+import h5py
+import numpy as np
+
+from roboimi.utils.streaming_episode_writer import StreamingEpisodeWriter
+
+
+class StreamingEpisodeWriterTest(unittest.TestCase):
+    def test_commit_persists_raw_action_and_resized_images(self):
+        camera_names = ["angle", "r_vis", "top", "front"]
+        raw_action_0 = np.arange(16, dtype=np.float32)
+        raw_action_1 = np.arange(16, dtype=np.float32) + 100.0
+        qpos_0 = np.arange(16, dtype=np.float32) + 200.0
+        qpos_1 = np.arange(16, dtype=np.float32) + 300.0
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            episode_path = Path(tmpdir) / "episode_0.hdf5"
+            writer = StreamingEpisodeWriter(
+                dataset_path=episode_path,
+                max_timesteps=2,
+                camera_names=camera_names,
+                image_size=(256, 256),
+            )
+
+            writer.append(
+                qpos=qpos_0,
+                action=raw_action_0,
+                images={
+                    cam: np.full((480, 640, 3), fill_value=idx + 1, dtype=np.uint8)
+                    for idx, cam in enumerate(camera_names)
+                },
+            )
+            writer.append(
+                qpos=qpos_1,
+                action=raw_action_1,
+                images={
+                    cam: np.full((480, 640, 3), fill_value=idx + 11, dtype=np.uint8)
+                    for idx, cam in enumerate(camera_names)
+                },
+            )
+            writer.commit()
+
+            self.assertTrue(episode_path.exists())
+            self.assertFalse(Path(str(episode_path) + ".tmp").exists())
+
+            with h5py.File(episode_path, "r") as root:
+                self.assertEqual(root["action"].shape, (2, 16))
+                self.assertEqual(root["observations/qpos"].shape, (2, 16))
+                np.testing.assert_allclose(root["action"][0], raw_action_0)
+                np.testing.assert_allclose(root["action"][1], raw_action_1)
+                np.testing.assert_allclose(root["observations/qpos"][0], qpos_0)
+                np.testing.assert_allclose(root["observations/qpos"][1], qpos_1)
+                for idx, cam_name in enumerate(camera_names):
+                    dataset = root[f"observations/images/{cam_name}"]
+                    self.assertEqual(dataset.shape, (2, 256, 256, 3))
+                    self.assertEqual(dataset.dtype, np.uint8)
+                    self.assertTrue(np.all(dataset[0] == idx + 1))
+                    self.assertTrue(np.all(dataset[1] == idx + 11))
+
+    def test_discard_removes_temporary_file(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            episode_path = Path(tmpdir) / "episode_0.hdf5"
+            writer = StreamingEpisodeWriter(
+                dataset_path=episode_path,
+                max_timesteps=1,
+                camera_names=["angle", "r_vis", "top", "front"],
+                image_size=(256, 256),
+            )
+            writer.discard()
+
+            self.assertFalse(episode_path.exists())
+            self.assertFalse(Path(str(episode_path) + ".tmp").exists())
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/test_train_vla_rollout_validation.py
+++ b/tests/test_train_vla_rollout_validation.py
@@ -0,0 +1,779 @@
+import os
+import tempfile
+import unittest
+from copy import deepcopy
+from pathlib import Path
+from unittest import mock
+
+import numpy as np
+import torch
+from omegaconf import OmegaConf
+from torch import nn
+
+from roboimi.demos.vla_scripts import eval_vla, train_vla
+
+
+class _FakeDataset:
+    def __len__(self):
+        return 4
+
+
+class _FakeLoader:
+    def __init__(self, batch, length=1):
+        self._batches = [batch] * length
+
+    def __len__(self):
+        return len(self._batches)
+
+    def __iter__(self):
+        return iter(self._batches)
+
+
+class _FakeOptimizer:
+    def __init__(self, lr=1e-3):
+        self.param_groups = [{'lr': lr}]
+
+    def zero_grad(self):
+        return None
+
+    def step(self):
+        return None
+
+    def state_dict(self):
+        return {}
+
+    def load_state_dict(self, state_dict):
+        del state_dict
+        return None
+
+
+class _FakeScheduler:
+    def __init__(self):
+        self.step_calls = 0
+
+    def step(self):
+        self.step_calls += 1
+
+    def state_dict(self):
+        return {}
+
+    def load_state_dict(self, state_dict):
+        del state_dict
+        return None
+
+
+class _FakeProgressBar:
+    def __init__(self, iterable):
+        self._items = list(iterable)
+        self.postfix_calls = []
+
+    def __iter__(self):
+        return iter(self._items)
+
+    def set_postfix(self, values):
+        self.postfix_calls.append(values)
+
+
+class _FakeAgent(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.weight = nn.Parameter(torch.tensor(0.0))
+
+    def to(self, device):
+        del device
+        return self
+
+    def compute_loss(self, agent_input):
+        del agent_input
+        return (self.weight - torch.tensor(0.5)).pow(2)
+
+    def get_normalization_stats(self):
+        return {}
+
+
+class _SequentialLossAgent(nn.Module):
+    def __init__(self, losses):
+        super().__init__()
+        self.weight = nn.Parameter(torch.tensor(0.0))
+        self._losses = list(losses)
+        self._index = 0
+
+    def to(self, device):
+        del device
+        return self
+
+    def compute_loss(self, agent_input):
+        del agent_input
+        loss_value = self._losses[self._index]
+        self._index += 1
+        return (self.weight * 0) + torch.tensor(float(loss_value))
+
+    def get_normalization_stats(self):
+        return {}
+
+
+class _FakeEvalAgent:
+    def __init__(self):
+        self.reset_calls = 0
+
+    def eval(self):
+        return self
+
+    def to(self, device):
+        del device
+        return self
+
+    def reset(self):
+        self.reset_calls += 1
+
+    def select_action(self, observation):
+        del observation
+        return torch.zeros(2)
+
+
+class _FakeEvalEnv:
+    def reset(self, box_pos):
+        self.box_pos = box_pos
+
+    def _get_image_obs(self):
+        return {
+            'images': {
+                'front': np.zeros((8, 8, 3), dtype=np.uint8),
+            }
+        }
+
+    def _get_qpos_obs(self):
+        return {'qpos': np.zeros(4, dtype=np.float32)}
+
+    def render(self):
+        raise AssertionError('render should not be called in this helper delegation test')
+
+
+class TrainVLARolloutValidationTest(unittest.TestCase):
+    def test_default_train_config_uses_full_dataset_and_epoch_rollout_validation(self):
+        cfg = OmegaConf.load(Path('roboimi/vla/conf/config.yaml'))
+
+        self.assertEqual(cfg.train.val_split, 0.0)
+        self.assertGreater(cfg.train.batch_size, 8)
+        self.assertGreater(float(cfg.train.lr), 5e-5)
+        self.assertGreater(cfg.train.num_workers, 8)
+        self.assertEqual(cfg.train.rollout_val_freq_epochs, 50)
+
+    def test_eval_main_delegates_to_plain_run_eval_helper(self):
+        cfg = OmegaConf.create(
+            {
+                'agent': {},
+                'eval': {
+                    'ckpt_path': 'checkpoints/vla_model_step_1.pt',
+                    'num_episodes': 1,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': True,
+                },
+            }
+        )
+        run_eval_mock = mock.Mock()
+
+        with mock.patch.object(eval_vla, '_run_eval', run_eval_mock, create=True), \
+             mock.patch.object(eval_vla, 'load_checkpoint', return_value=(_FakeEvalAgent(), None)), \
+             mock.patch.object(eval_vla, 'make_sim_env', return_value=_FakeEvalEnv()), \
+             mock.patch.object(eval_vla, 'sample_transfer_pose', return_value=np.zeros(3)), \
+             mock.patch.object(eval_vla, 'execute_policy_action'), \
+             mock.patch.object(eval_vla, 'tqdm', side_effect=lambda iterable, **kwargs: iterable):
+            eval_vla.main.__wrapped__(cfg)
+
+        run_eval_mock.assert_called_once_with(cfg)
+
+    def test_run_training_rollout_validation_runs_every_50_epochs_and_uses_avg_reward_metric(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 1,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 100,
+                    'log_freq': 1,
+                    'save_freq': 1000,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 50,
+                    'rollout_num_episodes': 3,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 99,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': False,
+                },
+            }
+        )
+        agent = _FakeAgent()
+        rollout_mock = mock.Mock(side_effect=[{'avg_reward': 2.0}, {'avg_reward': 1.0}])
+        swanlab_log_mock = mock.Mock()
+        saved_checkpoints = []
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return agent
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(_dataset, *, shuffle, **_kwargs):
+            del shuffle, _kwargs
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=1,
+            )
+
+        def fake_torch_save(payload, path):
+            saved_checkpoints.append((str(path), deepcopy(payload)))
+            return None
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla, '_log_to_swanlab', swanlab_log_mock), \
+                     mock.patch.object(train_vla.torch, 'save', side_effect=fake_torch_save), \
+                     mock.patch.object(eval_vla, '_run_eval', rollout_mock, create=True), \
+                     mock.patch.object(eval_vla.main, '__wrapped__', side_effect=AssertionError('training hook should call eval_vla._run_eval')):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertEqual(rollout_mock.call_count, 2)
+        first_rollout_cfg = rollout_mock.call_args_list[0].args[0]
+        second_rollout_cfg = rollout_mock.call_args_list[1].args[0]
+        self.assertEqual(first_rollout_cfg.eval.ckpt_path, 'checkpoints/vla_model_step_49.pt')
+        self.assertEqual(second_rollout_cfg.eval.ckpt_path, 'checkpoints/vla_model_step_99.pt')
+        self.assertEqual(first_rollout_cfg.eval.num_episodes, 3)
+        self.assertTrue(first_rollout_cfg.eval.headless)
+        self.assertEqual(first_rollout_cfg.eval.device, 'cpu')
+        self.assertFalse(first_rollout_cfg.eval.verbose_action)
+        self.assertEqual(cfg.eval.ckpt_path, 'unused.pt')
+        self.assertEqual(cfg.eval.num_episodes, 99)
+        self.assertFalse(cfg.eval.headless)
+        self.assertEqual(cfg.eval.device, 'cpu')
+        self.assertFalse(cfg.eval.verbose_action)
+
+        rollout_reward_logs = [
+            call.args[1]['rollout/avg_reward']
+            for call in swanlab_log_mock.call_args_list
+            if len(call.args) >= 2 and 'rollout/avg_reward' in call.args[1]
+        ]
+        self.assertEqual(rollout_reward_logs, [2.0, 1.0])
+
+        best_model_saves = [
+            payload for path, payload in saved_checkpoints
+            if path.endswith('checkpoints/vla_model_best.pt')
+        ]
+        self.assertEqual(len(best_model_saves), 1)
+        self.assertEqual(best_model_saves[0]['rollout_avg_reward'], 2.0)
+
+    def test_run_training_keeps_loss_based_best_checkpoint_until_first_rollout_metric_exists(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 1,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 5,
+                    'log_freq': 1,
+                    'save_freq': 2,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 50,
+                    'rollout_num_episodes': 3,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 99,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': False,
+                },
+            }
+        )
+        saved_checkpoints = []
+        rollout_mock = mock.Mock()
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return _FakeAgent()
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(_dataset, *, shuffle, **_kwargs):
+            del shuffle, _kwargs
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=5,
+            )
+
+        def fake_torch_save(payload, path):
+            saved_checkpoints.append((str(path), deepcopy(payload)))
+            return None
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla.torch, 'save', side_effect=fake_torch_save), \
+                     mock.patch.object(eval_vla, '_run_eval', rollout_mock, create=True):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertEqual(rollout_mock.call_count, 0)
+        best_model_saves = [
+            payload for path, payload in saved_checkpoints
+            if path.endswith('checkpoints/vla_model_best.pt')
+        ]
+        self.assertEqual(len(best_model_saves), 1)
+        self.assertIsNone(best_model_saves[0]['rollout_avg_reward'])
+
+    def test_run_training_disables_drop_last_when_train_set_is_smaller_than_batch_size(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 8,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 1,
+                    'log_freq': 1,
+                    'save_freq': 10,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 50,
+                    'rollout_num_episodes': 3,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 99,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': False,
+                },
+            }
+        )
+        dataloader_calls = []
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return _FakeAgent()
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(dataset, *, shuffle, drop_last, **_kwargs):
+            dataloader_calls.append({
+                'shuffle': shuffle,
+                'drop_last': drop_last,
+                'dataset_len': len(dataset),
+            })
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=1,
+            )
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla.torch, 'save', return_value=None):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        train_loader_calls = [call for call in dataloader_calls if call['shuffle']]
+        self.assertEqual(len(train_loader_calls), 1)
+        self.assertFalse(train_loader_calls[0]['drop_last'])
+
+    def test_run_training_disables_persistent_workers_for_train_and_val_loaders(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 2,
+                    'num_workers': 2,
+                    'val_split': 0.25,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 1,
+                    'log_freq': 1,
+                    'save_freq': 10,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 50,
+                    'rollout_num_episodes': 3,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 99,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': False,
+                },
+            }
+        )
+        dataloader_calls = []
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return _FakeAgent()
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(_dataset, *, shuffle, persistent_workers, num_workers, **_kwargs):
+            dataloader_calls.append({
+                'shuffle': shuffle,
+                'num_workers': num_workers,
+                'persistent_workers': persistent_workers,
+            })
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=1,
+            )
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla.torch, 'save', return_value=None):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertEqual(len(dataloader_calls), 2)
+        self.assertEqual([call['shuffle'] for call in dataloader_calls], [True, False])
+        self.assertTrue(all(call['num_workers'] == 2 for call in dataloader_calls))
+        self.assertTrue(all(call['persistent_workers'] is False for call in dataloader_calls))
+
+    def test_run_training_uses_loss_best_until_first_rollout_then_prefers_rollout_reward(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 1,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 6,
+                    'log_freq': 1,
+                    'save_freq': 1,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 2,
+                    'rollout_num_episodes': 1,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+                'eval': {
+                    'ckpt_path': 'unused.pt',
+                    'num_episodes': 99,
+                    'max_timesteps': 1,
+                    'device': 'cpu',
+                    'task_name': 'sim_transfer',
+                    'camera_names': ['front'],
+                    'use_smoothing': False,
+                    'smooth_alpha': 0.3,
+                    'verbose_action': False,
+                    'headless': False,
+                },
+            }
+        )
+        agent = _SequentialLossAgent([10, 9, 8, 7, 6, 5])
+        rollout_mock = mock.Mock(return_value={'avg_reward': 1.0})
+        saved_checkpoints = []
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _FakeDataset()
+            if config_node is cfg.agent:
+                return agent
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(_dataset, *, shuffle, **_kwargs):
+            del _kwargs
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=2 if shuffle else 1,
+            )
+
+        def fake_torch_save(payload, path):
+            saved_checkpoints.append((str(path), deepcopy(payload)))
+            return None
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla.torch, 'save', side_effect=fake_torch_save), \
+                     mock.patch.object(eval_vla, '_run_eval', rollout_mock, create=True):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        best_model_saves = [
+            (payload['step'], payload['rollout_avg_reward'])
+            for path, payload in saved_checkpoints
+            if path.endswith('checkpoints/vla_model_best.pt')
+        ]
+        self.assertEqual(
+            best_model_saves,
+            [
+                (1, None),
+                (2, None),
+                (3, None),
+                (3, 1.0),
+            ],
+        )
+        self.assertEqual(rollout_mock.call_count, 1)
+
+    def test_run_training_keeps_tiny_train_dataset_batch_when_batch_size_is_larger(self):
+        cfg = OmegaConf.create(
+            {
+                'train': {
+                    'device': 'cpu',
+                    'batch_size': 8,
+                    'num_workers': 0,
+                    'val_split': 0.0,
+                    'seed': 0,
+                    'lr': 1e-3,
+                    'max_steps': 1,
+                    'log_freq': 1,
+                    'save_freq': 1000,
+                    'warmup_steps': 1,
+                    'scheduler_type': 'constant',
+                    'min_lr': 0.0,
+                    'grad_clip': 1.0,
+                    'weight_decay': 0.0,
+                    'pretrained_ckpt': None,
+                    'resume_ckpt': None,
+                    'use_swanlab': False,
+                    'rollout_val_freq_epochs': 0,
+                },
+                'data': {
+                    'camera_names': ['front'],
+                },
+                'agent': {
+                    '_target_': 'fake.agent',
+                },
+            }
+        )
+        agent = _FakeAgent()
+        dataloader_calls = []
+        saved_checkpoints = []
+
+        class _TinyDataset:
+            def __len__(self):
+                return 1
+
+        def fake_instantiate(config_node, **_kwargs):
+            if config_node is cfg.data:
+                return _TinyDataset()
+            if config_node is cfg.agent:
+                return agent
+            raise AssertionError(f'unexpected instantiate config: {config_node!r}')
+
+        def fake_dataloader(dataset, *, drop_last, shuffle, **_kwargs):
+            del _kwargs
+            dataloader_calls.append(
+                {
+                    'shuffle': shuffle,
+                    'drop_last': drop_last,
+                    'dataset_len': len(dataset),
+                }
+            )
+            loader_length = 0 if drop_last and len(dataset) < cfg.train.batch_size else 1
+            return _FakeLoader(
+                {
+                    'observation.front': torch.zeros(1, 3, 2, 2),
+                    'observation.state': torch.zeros(1, 4),
+                    'action': torch.zeros(1, 2),
+                    'action_is_pad': torch.zeros(1, 1, dtype=torch.bool),
+                },
+                length=loader_length,
+            )
+
+        def fake_torch_save(payload, path):
+            saved_checkpoints.append((str(path), deepcopy(payload)))
+            return None
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            previous_cwd = os.getcwd()
+            try:
+                os.chdir(tempdir)
+                with mock.patch.object(train_vla, 'instantiate', side_effect=fake_instantiate), \
+                     mock.patch.object(train_vla, 'DataLoader', side_effect=fake_dataloader), \
+                     mock.patch.object(train_vla, 'build_training_optimizer', return_value=_FakeOptimizer(cfg.train.lr)), \
+                     mock.patch.object(train_vla, 'get_lr_schedule_with_warmup', return_value=_FakeScheduler()), \
+                     mock.patch.object(train_vla, 'tqdm', side_effect=lambda iterable, **kwargs: _FakeProgressBar(iterable)), \
+                     mock.patch.object(train_vla.torch, 'save', side_effect=fake_torch_save):
+                    train_vla._run_training(cfg)
+            finally:
+                os.chdir(previous_cwd)
+
+        self.assertEqual(
+            dataloader_calls[0],
+            {
+                'shuffle': True,
+                'drop_last': False,
+                'dataset_len': 1,
+            },
+        )
+        self.assertEqual(
+            [path for path, _payload in saved_checkpoints],
+            ['checkpoints/vla_model_final.pt'],
+        )
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Logic	8d6060224a	docs: refine IMF migration spec defaults	2026-04-01 22:55:50 +08:00
Logic	8a8193fe7e	docs: add IMF-AttnRes migration implementation plan	2026-04-01 22:52:39 +08:00
Logic	1a92c5e8a6	docs: add IMF-AttnRes migration design spec	2026-04-01 22:49:10 +08:00
Logic	b76bcd8b37	chore: ignore local git worktrees	2026-04-01 22:46:02 +08:00
Logic	2f9b99e0c4	feat(vis): add raw action trajectory viewer	2026-04-01 22:27:22 +08:00
Logic	d5d5b53f71	feat(data): stream sim episodes with raw ee actions	2026-03-31 15:44:53 +08:00
Logic	d84bc6876e	feat(vla): align transformer training stack and rollout validation	2026-03-31 15:39:20 +08:00
Logic	424c265823	feat(eval): export rollout video timing and ee trajectory	2026-03-31 15:34:28 +08:00
Logic	cb79e00546	docs: add VLA training headless swanlab design spec	2026-03-30 18:50:12 +08:00
gouhanke	23088e5e33	feat: 架构引入DiT	2026-03-06 11:31:37 +08:00
gouhanke	ca1716c67f	chore: 导入gr00t	2026-03-06 11:31:37 +08:00
JiajunLI	642d41dd8f	feat: 添加resume机制	2026-03-06 11:19:30 +08:00
gouhanke	7d39933a5b	feat: 缓存worker内的句柄	2026-03-04 10:49:41 +08:00
gouhanke	8bcad5844e	fix: 修复VLA设备与损失计算逻辑，并优化Transformer默认训练参数	2026-03-03 17:56:12 +08:00
gouhanke	cdb887c9bf	feat: 添加transformer头	2026-02-28 19:07:27 +08:00
gouhanke	abb4f501e3	chore: 删除unet里的local_cond(未使用)	2026-02-28 10:42:16 +08:00
gouhanke	1d33db0ef0	chore: 缩小物块的大小	2026-02-27 18:23:30 +08:00
gouhanke	f27e397f98	chore: 修改了采数时的一些参数	2026-02-26 17:09:40 +08:00
gouhanke	4e0add4e1d	debug: 修复episode首帧图像不正确的问题；修复前2个episode帧重复的问题	2026-02-26 16:17:54 +08:00
gouhanke	40c40695dd	chore: 添加测试文件 - check_all_episodes.py：检查各个episode是否有重复帧。 - check_specific_frames.py：检查前几帧是否位于正确初始位置。 - generate_dataset_videos.py：根据hdf5生成视频	2026-02-26 13:59:47 +08:00
gouhanke	3deeffb9fe	chore:改变了一些参数配置	2026-02-26 13:56:03 +08:00
gouhanke	0b05c01024	feat: 推理时输出action	2026-02-12 19:54:11 +08:00
gouhanke	926a78eb66	feat: 添加finetune	2026-02-12 19:31:44 +08:00
gouhanke	efbe4b6ac9	Revert "Merge branch 'dev' of gitlab.com:leeeezd0016-group/gouhanke-vla into dev" This reverts commit `acb1467473`, reversing changes made to `624b926e33`.	2026-02-12 18:31:56 +08:00
gouhanke	acb1467473	Merge branch 'dev' of gitlab.com:leeeezd0016-group/gouhanke-vla into dev	2026-02-12 18:08:27 +08:00
gouhanke	624b926e33	debug: 添加推理时缩放，加大采数以及推理时物块的放置范围	2026-02-12 17:14:23 +08:00
gouhanke	926d8cf894	chore: 加载时将图像缩放到224*224， resnet禁用crop	2026-02-12 15:02:18 +08:00
gouhanke	116ba13fb9	chore: 验证归一化是否有效	2026-02-12 13:01:13 +08:00
gouhanke	37a47ac2dd	debug: 保存stats到ckpt	2026-02-12 13:00:43 +08:00
gouhanke	ab971b3f96	debug: 归一化	2026-02-12 12:23:34 +08:00
gouhanke	83cd55e67b	添加pad_loss	2026-02-11 20:33:26 +08:00
gouhanke	eeb07cad15	feat: 冻结resnet	2026-02-11 20:11:25 +08:00
gouhanke	83d11ab640	Merge branch 'dev' of gitlab.com:leeeezd0016-group/gouhanke-vla into dev	2026-02-11 17:20:21 +08:00
JiajunLI	aba8779671	更改默认参数	2026-02-11 17:14:32 +08:00
gouhanke	b42c1c68fd	debug: 将归一化放在GPU上	2026-02-11 17:13:55 +08:00
gouhanke	320369ffb8	debug: 归一化图像到[0, 1]	2026-02-11 16:47:39 +08:00
gouhanke	130d4bb3c5	refactor：大重构	2026-02-11 15:53:55 +08:00
gouhanke	1e95d40bf9	debug	2026-02-10 15:56:05 +08:00
gouhanke	3c27d6d793	refactor: 重构resnet	2026-02-10 15:26:10 +08:00
gouhanke	88b9c10a75	refactor(dataset): 重新创建robotdataset最小实现 - 内部实现__getitem__参数，可以通过滑动窗口进行采样 -	2026-02-10 10:26:19 +08:00
gouhanke	ac870f6110	chore: 计算推理频率	2026-02-09 15:39:22 +08:00
gouhanke	8b700b6d99	暂存	2026-02-09 14:41:35 +08:00
gouhanke	f833c6d9f1	添加readme文件	2026-02-07 09:57:59 +08:00
gouhanke	4332530a5f	feat(train): 添加warmup学习率调度器	2026-02-06 22:54:34 +08:00
gouhanke	456056347f	debug: 固定验证集上的随机噪声，修复resnet在训练时bn层会切换到train的问题	2026-02-06 21:31:19 +08:00
gouhanke	05f3cc1e47	chore: 删除detr和gr00t	2026-02-06 20:21:01 +08:00
gouhanke	a6fcb88203	chore: 删除多余文件	2026-02-06 20:19:11 +08:00
gouhanke	3d0c2ec5b1	feat(train): 添加验证集	2026-02-06 18:00:09 +08:00
gouhanke	ea49e63eb7	feat: 注册了自定义 resolver计算长度	2026-02-06 16:08:56 +08:00
gouhanke	7a9ca06aa0	feat(dependency): 生成environment.yml文件	2026-02-06 15:40:24 +08:00
gouhanke	f006d50814	chore: 自动获取cameras的长度	2026-02-06 15:33:07 +08:00
gouhanke	f4a5c77b7c	refactor: 归一化从agent解耦到训练、推理脚本中	2026-02-06 14:29:36 +08:00
gouhanke	a43a2e3d18	chore: 删除多余脚本	2026-02-06 13:45:35 +08:00
gouhanke	31419a6fc1	chore(camera): 添加front相机	2026-02-06 11:53:01 +08:00
gouhanke	66009473ad	debug(inference): 添加推理阶段qpos归一化	2026-02-06 09:00:44 +08:00
gouhanke	b0a944f7aa	feat(train): 跑通训练脚本	2026-02-05 14:08:43 +08:00
gouhanke	dd2749cb12	feat: 更新框架，新增数据及定义和backbone	2026-02-05 01:44:43 +08:00
gouhanke	92660562fb	feat(dataset): 添加统计数据计算脚本	2026-02-05 01:44:43 +08:00
gouhanke	03f10b0c22	feat: 编写状态编码器、动作编码器	2026-02-05 01:44:43 +08:00
gouhanke	8fce9c89ef	chore: 删除多余文件	2026-02-05 01:44:43 +08:00
JiajunLI	30b8ff4d7d	Merge branch 'main' into 'dev' # Conflicts: # README.md	2026-02-04 09:30:00 +00:00
gouhanke	3f8c3dbf5d	chore(readme): 修改readme里的数据结构标准	2026-02-04 14:33:52 +08:00
gouhanke	3465782256	feat: 添加保存模型的功能和推理脚本	2026-02-03 18:03:47 +08:00
gouhanke	f5e2eca809	debug(train): 在siglip和DiffusionHead下跑通训练流程	2026-02-03 17:42:32 +08:00
gouhanke	3b58760469	跑通配置和训练脚本	2026-02-03 16:51:04 +08:00
gouhanke	bd8bbb0cfc	debug: 核心骨架伪实现	2026-02-03 16:14:54 +08:00
gouhanke	d3863ea1dd	feat(dataset): 定义VLAChunkedDataset类，构建数据可视化工具	2026-02-03 15:24:09 +08:00
gouhanke	57acfd645f	feat(vla): vla框架初始化	2026-02-03 14:18:30 +08:00
gouhanke	c1ce560b32	feat(inference): 添加动作平滑器	2026-02-03 10:32:09 +08:00
gouhanke	a977cc4f5e	chore(Git LFS): 配置 Git LFS 以支持 .safetensors 模型文件	2026-02-03 10:30:06 +08:00
JiajunLI	fdf4dd8bed	feat(policy): 引入gr00t(DiT)	2026-02-02 17:16:28 +08:00
JiajunLI	fd1bd20c4f	chore(constants): 修改参与训练和推理的相机 - 现在使用顶部相机、右手腕相机。	2026-01-28 19:32:56 +08:00
Li Zonda	ab1f50cc66	Initial commit	2025-12-08 08:27:37 +00:00
				`@@ -0,0 +1 @@`
				`*.safetensors filter=lfs diff=lfs merge=lfs -text`
				`@@ -1 +0,0 @@`
				`# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved`
				`@@ -0,0 +1 @@`
				`_target_: roboimi.vla.modules.encoders.IdentityActionEncoder`