Files
roboimi/docs/lewm-imf-experiment-guide.md

472 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# feat-lewm-imf-fusion 实验操作指南
适用 worktree`/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion`
## 0. 先记住当前常用 recipe
当前这条分支最常用的训练/验证配方,直接参考:
`experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/`
核心约定:
- agent`lewm_resnet_query_imf_attnres`
- from scratch`train.pretrained_ckpt=null``agent.lewm_pretrained_ckpt=null`
- 训练:`batch_size=32``lr=1e-4``max_steps=109350``save_freq=10000`
- 数值验证:`train.val_split=0.0` + `train.val_episode_indices=[100]`
- held-out numeric validation`train.action_mse_val_freq_epochs=1`
- rollout validation`train.rollout_val_freq_epochs=5``train.rollout_num_episodes=10`
- SwanLab`train.use_swanlab=true`project=`roboimi-vla`
---
## 1. 分支结构与关键文件
| 路径 | 作用 |
| --- | --- |
| `roboimi/demos/vla_scripts/train_vla.py` | 主训练入口负责数据集、checkpoint、数值验证、训练期 rollout 验证、SwanLab |
| `roboimi/demos/vla_scripts/eval_vla.py` | 单次 rollout / 离线验证入口;支持 headless、summary、trajectory image/video artifact |
| `roboimi/vla/conf/config.yaml` | 全局 Hydra 配置;训练默认值都在这里 |
| `roboimi/vla/conf/eval/eval.yaml` | eval 默认配置;`eval.ckpt_path``eval.num_episodes`、artifact 开关都在这里 |
| `roboimi/vla/conf/agent/lewm_resnet_query_imf_attnres.yaml` | 本分支最常用 agentLeWM query fusion + IMF AttnRes head |
| `roboimi/vla/conf/backbone/lewm_resnet_query_fusion.yaml` | LeWM 多视角 ResNet query fusion backbone 配置 |
| `roboimi/vla/agent_imf.py` | `IMFVLAAgent` 实现one-step IMF 推理、LeWM loss、LeWM 预训练组件加载 |
| `roboimi/vla/data/simpe_robot_dataset.py` | HDF5 懒加载数据集;也负责 `episode_indices` 过滤 |
| `roboimi/vla/scripts/calculate_stats.py` | 重算 `dataset_stats.pkl` |
| `experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/` | 当前最常用 suitemanifest、notes、launch log、local 启动脚本都在这里 |
补充:
- 本分支常用 run name 形如 `lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037`
- `q08/ph16/ex08` 这类后缀分别对应 `agent.lewm_query_offsets``agent.pred_horizon``agent.num_action_steps`
---
## 2. 三台机器与环境
| 机器 | GPU | repo / worktree | Python | 常用数据集路径 |
| --- | --- | --- | --- | --- |
| 本地 `droid-z790eagleax` | 1× RTX 5090 32GB | `/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion` | `/home/droid/.conda/envs/roboimi/bin/python` | `/home/droid/project/diana_sim/sim_transfer` |
| 5880 节点 `100.73.14.65` | 2× RTX 5880 Ada 48GB | `/home/droid/roboimi_suite_20260416_lewm_imf_fusion` | `/home/droid/miniforge3/envs/roboimi/bin/python` | `/home/droid/sim_dataset/sim_transfer` |
| L20 节点 `100.119.99.14` | 8× NVIDIA L20 46GB | `/data/roboimi_suite_20260416_lewm_imf_fusion` | `/home/droid/miniforge3/envs/roboimi/bin/python` | `/data/simtransfer/current` |
连接:
- 5880`ssh droid@100.73.14.65`
- L20`ssh droid@100.119.99.14`
经验规则:
- 本地 5090适合单条 smoke / 小规模主跑 / 本地调参
- 5880适合 2 条并行主跑
- L20适合大 grid数据和 run 建议都放 `/data`
---
## 3. 训练流怎么走
`train_vla.py` 的实际流程:
1. 读取 Hydra 配置并打印完整 cfg
2. 通过 `build_train_val_datasets()` 构建 train/val dataset
3.`DataLoader` 建 train/val loader
4.`dataset_dir/dataset_stats.pkl` 读取归一化统计
5. instantiate `IMFVLAAgent`
6. 可选加载:
- `train.pretrained_ckpt`
- `train.resume_ckpt`
- `agent.lewm_pretrained_ckpt`
7. 训练循环里按 `log_freq` 打 train loss / lr
8.`save_freq` 保存 `checkpoints/vla_model_step_*.pt`
9. 每个 epoch 结束时,按配置跑:
- held-out action MSE
- rollout validation
10. 最后写:
- `checkpoints/vla_model_best.pt`
- `checkpoints/vla_model_final.pt`
当前 best model 选择逻辑:
- **第一次拿到 rollout reward 之前**:先用 `val_loss`(或 train loss 回退)挑 best
- **第一次 rollout 之后**:优先用 `rollout_avg_reward` 挑 best
输出目录一般通过 `hydra.run.dir=...` 固定;否则 Hydra 自己生成。
---
## 4. 验证流怎么走
### 4.1 held-out 数值验证
当前常用做法不是随机切 `val_split`,而是:
- `train.val_split=0.0`
- `train.val_episode_indices=[100]`
- `train.action_mse_val_freq_epochs=1`
这样每个 epoch 结束都会在 `episode_100.hdf5` 上跑一次 `compute_action_mse_validation()`,日志 key 是:
- 控制台 / `train_vla.log``held-out action MSE`
- SwanLab`val/action_mse`
### 4.2 rollout 验证
当前训练内 rollout 验证由 `train_vla.py -> run_rollout_validation() -> eval_vla._run_eval()` 触发。
当前这条分支的常用训练内 rollout 约束是:
- `train.rollout_val_freq_epochs=5`
- `train.rollout_num_episodes=10`
- `train.rollout_validate_on_checkpoint=false`
- 强制 headless
- 强制 `verbose_action=false`
- 强制 `record_video=false`
- 强制 `save_trajectory_image=true`
- 强制 `trajectory_image_camera_name=front`
- 强制 `save_summary_json=true`
当前已经修正为**配置驱动的 rollout device / worker 路径**
- `train.rollout_device`:默认跟随 `train.device`
- `train.rollout_num_workers`:默认 `null`
- 当 rollout 设备是 CPU 时,自动退化为 `1`
- 当 rollout 设备是 CUDA 时,自动推断为 `min(train.rollout_num_episodes, 8)`
- `train.rollout_cuda_devices`:默认 `null`,等价于当前可见逻辑 GPU `[0]`
- `train.rollout_response_timeout_s`
- `train.rollout_server_startup_timeout_s`
所以现在:
- 训练在 `cuda` 上时,**训练期 rollout 默认会走 GPU**
- 如果 `rollout_num_workers > 1`,就会自动走并行 rollout
- 可以是 **单 GPU 多 worker 共用一个 inference server**
- 也可以是 **多 GPU 多 server 分摊 worker**
训练内 rollout artifact 默认落到:
`<hydra.run.dir>/rollout_artifacts/<checkpoint_stem>/`
常见文件:
- `rollout_summary.json`
- `rollout_front_ep01_trajectory.png` ... `rollout_front_ep10_trajectory.png`
日志重点看:
- `Epoch X rollout 平均奖励`
- `最佳模型已更新`
---
## 5. 数据集加载与 `val_episode_indices` 机制
### 5.1 数据集格式
`SimpleRobotDataset` 读取 `dataset_dir` 下的 `episode_*.hdf5`,每个 episode 文件里至少要有:
- `action`
- `observations/qpos`
- `observations/images/{cam_name}`
当前常用相机:
- `r_vis`
- `top`
- `front`
### 5.2 懒加载行为
`roboimi/vla/data/simpe_robot_dataset.py` 是按帧懒加载,不会一次性把整套 HDF5 全读进内存。
它会:
- 扫描目录下的 HDF5 文件
- 用文件名里的 episode 编号(如 `episode_100.hdf5` -> `100`)建立 `available_episode_indices`
- 在 worker 内做 HDF5 文件句柄 LRU 缓存
### 5.3 `val_episode_indices` 怎么切
`build_train_val_datasets()` 的逻辑是:
1. 先 instantiate 一次完整 dataset
2. 读取 `dataset.available_episode_indices`
3. 检查 `train.val_episode_indices` 是否都存在
4.`episode_indices=` 再各 instantiate 一次:
- train dataset = 全部 episode - held-out episode
- val dataset = 只包含 held-out episode
因此:
- `train.val_episode_indices=[100]` 的意思是“把 `episode_100.hdf5` 整个拿去做 held-out val”
- 如果 episode 不存在,会直接报错
- 如果你把所有 episode 都塞进 `val_episode_indices`,也会直接报错,因为训练集会变空
### 5.4 图像 resize 与 LeWM 附加字段
dataset 侧 resize 默认来自:
- `data.image_resize_shape`
- 如果 backbone 额外覆盖,则优先 `agent.vision_backbone.dataset_image_resize_shape`
返回 batch 除了常规:
- `observation.state`
- `observation.<cam>`
- `action`
还会在 LeWM 打开时返回:
- `lewm.observation.state`
- `lewm.observation.<cam>`
- `lewm.future.state`
- `lewm.future.<cam>`
### 5.5 统计文件
训练和推理都默认依赖 `dataset_stats.pkl`。数据集更新后重算:
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/vla/scripts/calculate_stats.py \
--dataset_dir /home/droid/project/diana_sim/sim_transfer
```
远端只要把 `--dataset_dir` 换成对应主机路径即可。
---
## 6. SwanLab 行为
当前配置默认值里 `train.use_swanlab=false`,但本分支常用 recipe 基本都显式开:
- `train.use_swanlab=true`
- `train.swanlab_project=roboimi-vla`
- `train.swanlab_run_name=<run_name>`
`train_vla.py` 的 SwanLab 行为:
- 初始化时上传 `train` / `data` / `agent` 三段 config
- 训练中记录:
- `train/loss`
- `train/lr`
- `train/best_loss`
- `train/step`
- checkpoint 验证时记录:
- `val/loss`
- held-out 数值验证时记录:
- `val/action_mse`
- rollout 验证时记录:
- `rollout/avg_reward`
- `rollout/epoch`
- 训练结束时记录:
- `final/checkpoint_path`
- `final/best_checkpoint_path`
训练期 rollout 生成的前视图轨迹 PNG 会 best-effort 上传到 SwanLab失败只会 warning不会让训练中断。
---
## 7. 并行 rollout 说明
### 7.1 这套能力从哪里来
本分支的并行 rollout 方向不是 DataLoader 并行,而是 **`eval_vla.py` 的 multiprocess rollout path**。
参考来源:
`/home/droid/project/roboimi/.worktrees/multiprocess-rollout/roboimi/demos/vla_scripts/eval_vla.py`
那条路径的控制参数是:
- `eval.num_workers`
- `eval.cuda_devices`
语义是:
- `eval.num_workers`:环境 worker 数,按 episode 切分
- `eval.cuda_devices`:推理 server 绑定到哪些逻辑 GPU
### 7.2 两种常见模式
1. **单机单卡,多 worker 共用同一张 GPU**
- 典型:本地 5090 只有 1 卡,但想让 4 个 rollout worker 并行跑环境
- 形式:`eval.device=cuda eval.num_workers=4 'eval.cuda_devices=[0]'`
- 这时是 **1 个 CUDA inference server + 4 个 env worker**
2. **单机多卡,多 server 分摊 worker**
- 典型5880 有 2 卡L20 有多卡
- 形式:`eval.device=cuda eval.num_workers=8 'eval.cuda_devices=[0,1]'`
- worker 会按 round-robin 分到多个 server 上
### 7.3 操作上要注意什么
- 并行 rollout 依赖 **多进程 eval 路径**,不是 `train.num_workers`
- `train.num_workers` 是 DataLoader worker和 rollout 并行不是一回事
- `eval.num_workers > 1` 时必须 `eval.headless=true`
- worker 数会自动 cap 到 `eval.num_episodes`
- multiprocess rollout 当前已经支持 **per-episode trajectory image PNG**;多 worker 时每个 worker 会在自己的 artifact 子目录下写图summary 会带回对应路径
- 但多 worker 时仍然不要同时要求:
- `eval.record_video=true`
- `eval.save_trajectory=true`
- `eval.save_trajectory_npz=true`
- `eval.save_trajectory_image=true` 现在是可以开的,适合并行 reward + 定性检查一起做
### 7.4 并行 rollout 命令模板
**5090 单卡 4 worker**
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
train.device=cuda eval.device=cuda eval.headless=true eval.verbose_action=false \
eval.ckpt_path=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/<run_name>/checkpoints/vla_model_best.pt \
eval.num_episodes=10 eval.num_workers=4 'eval.cuda_devices=[0]' \
eval.save_summary_json=true eval.artifact_dir=/tmp/lewm_parallel_eval_5090
```
**5880 双卡 8 worker**
```bash
/home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/sim_dataset/sim_transfer \
train.device=cuda eval.device=cuda eval.headless=true eval.verbose_action=false \
eval.ckpt_path=/home/droid/roboimi_suite_20260416_lewm_imf_fusion/runs/<run_name>/checkpoints/vla_model_best.pt \
eval.num_episodes=10 eval.num_workers=8 'eval.cuda_devices=[0,1]' \
eval.save_summary_json=true eval.artifact_dir=/tmp/lewm_parallel_eval_5880
```
---
## 8. 当前常用命令 / 脚本
### 8.1 本地 5090直接用 suite 脚本
现成脚本:
`experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/launch_local_5090.sh`
运行:
```bash
bash experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/launch_local_5090.sh
```
### 8.2 本地 5090手动启动同 recipe
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
'agent.lewm_query_offsets=[8]' \
agent.pred_horizon=8 \
agent.num_action_steps=8 \
train.device=cuda \
train.batch_size=32 \
train.lr=0.0001 \
train.max_steps=109350 \
train.num_workers=4 \
train.save_freq=10000 \
train.rollout_validate_on_checkpoint=false \
train.rollout_val_freq_epochs=5 \
train.rollout_num_episodes=10 \
train.val_split=0.0 \
'train.val_episode_indices=[100]' \
train.action_mse_val_freq_epochs=1 \
train.use_swanlab=true \
train.swanlab_project=roboimi-vla \
train.swanlab_run_name=lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037 \
train.pretrained_ckpt=null \
agent.lewm_pretrained_ckpt=null \
hydra.run.dir=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037
```
### 8.3 5880常用命令模板
```bash
ssh droid@100.73.14.65
cd /home/droid/roboimi_suite_20260416_lewm_imf_fusion
/home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/sim_dataset/sim_transfer \
'agent.lewm_query_offsets=[8]' \
agent.pred_horizon=16 \
agent.num_action_steps=8 \
train.device=cuda train.batch_size=32 train.lr=0.0001 train.max_steps=109350 \
train.num_workers=4 train.save_freq=10000 train.rollout_validate_on_checkpoint=false \
train.rollout_val_freq_epochs=5 train.rollout_num_episodes=10 train.val_split=0.0 \
'train.val_episode_indices=[100]' train.action_mse_val_freq_epochs=1 \
train.use_swanlab=true train.swanlab_project=roboimi-vla \
train.swanlab_run_name=lewmimf-q08-ph16-ex08-emb384-l12-fromscratch-epoch50-step109350-5880g0-20260421-153037 \
train.pretrained_ckpt=null agent.lewm_pretrained_ckpt=null \
hydra.run.dir=/home/droid/roboimi_suite_20260416_lewm_imf_fusion/runs/lewmimf-q08-ph16-ex08-emb384-l12-fromscratch-epoch50-step109350-5880g0-20260421-153037
```
### 8.4 L20常用命令模板
```bash
ssh droid@100.119.99.14
cd /data/roboimi_suite_20260416_lewm_imf_fusion
/home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/data/simtransfer/current \
'agent.lewm_query_offsets=[16]' \
agent.pred_horizon=16 \
agent.num_action_steps=16 \
train.device=cuda train.batch_size=32 train.lr=0.0001 train.max_steps=109350 \
train.num_workers=4 train.save_freq=10000 train.rollout_validate_on_checkpoint=false \
train.rollout_val_freq_epochs=5 train.rollout_num_episodes=10 train.val_split=0.0 \
'train.val_episode_indices=[100]' train.action_mse_val_freq_epochs=1 \
train.use_swanlab=true train.swanlab_project=roboimi-vla \
train.swanlab_run_name=lewmimf-q16-ph16-ex16-emb384-l12-fromscratch-epoch50-step109350-l20g0-20260421-153037 \
train.pretrained_ckpt=null agent.lewm_pretrained_ckpt=null \
hydra.run.dir=/data/roboimi_suite_20260416_lewm_imf_fusion/runs/lewmimf-q16-ph16-ex16-emb384-l12-fromscratch-epoch50-step109350-l20g0-20260421-153037
```
### 8.5 单次离线验证(当前分支已支持并行)
**单 GPU / 4 worker**
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
train.device=cuda eval.device=cuda \
eval.ckpt_path=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/<run_name>/checkpoints/vla_model_best.pt \
eval.num_episodes=10 eval.num_workers=4 'eval.cuda_devices=[0]' \
eval.headless=true eval.verbose_action=false \
eval.save_summary_json=true eval.save_trajectory_image=true \
eval.trajectory_image_camera_name=front \
eval.artifact_dir=/tmp/lewm_eval_front
```
**训练内启用并行 GPU rollout推荐显式写清楚**
```bash
/home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \
agent=lewm_resnet_query_imf_attnres \
data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \
'agent.lewm_query_offsets=[8]' \
agent.pred_horizon=8 \
agent.num_action_steps=8 \
train.device=cuda \
train.batch_size=32 \
train.lr=0.0001 \
train.max_steps=109350 \
train.num_workers=4 \
train.save_freq=10000 \
train.rollout_val_freq_epochs=5 \
train.rollout_num_episodes=10 \
train.rollout_device=cuda \
train.rollout_num_workers=4 \
'train.rollout_cuda_devices=[0]' \
train.rollout_validate_on_checkpoint=false \
train.val_split=0.0 \
'train.val_episode_indices=[100]' \
train.action_mse_val_freq_epochs=1 \
train.use_swanlab=true \
train.swanlab_project=roboimi-vla \
train.swanlab_run_name=<run_name> \
hydra.run.dir=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/<run_name>
```
### 8.6 监控日志
```bash
tail -f runs/<run_name>/launch.stdout.log
tail -f runs/<run_name>/train_vla.log
```
远端就把 `runs/<run_name>` 换成 manifest 里的绝对路径。
---
## 9. 操作建议
- **优先以 suite 的 `manifest.json` / `notes.md` / `launch_logs/*.launch.log` 为准**,不要手写一套和历史 run 不一致的命令
- 要做当前常用验证,就显式加上:
- `train.val_split=0.0`
- `train.val_episode_indices=[100]`
- `train.action_mse_val_freq_epochs=1`
- `train.rollout_val_freq_epochs=5`
- `train.rollout_num_episodes=10`
- 本分支如果要对比不同 horizon / action-step尽量只改
- `agent.lewm_query_offsets`
- `agent.pred_horizon`
- `agent.num_action_steps`
- 想复现 2026-04-21 那轮 from-scratch 结果时,记得同时设:
- `train.pretrained_ckpt=null`
- `agent.lewm_pretrained_ckpt=null`