# feat-lewm-imf-fusion 实验操作指南 适用 worktree:`/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion` ## 0. 先记住当前常用 recipe 当前这条分支最常用的训练/验证配方,直接参考: `experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/` 核心约定: - agent:`lewm_resnet_query_imf_attnres` - from scratch:`train.pretrained_ckpt=null`,`agent.lewm_pretrained_ckpt=null` - 训练:`batch_size=32`,`lr=1e-4`,`max_steps=109350`,`save_freq=10000` - 数值验证:`train.val_split=0.0` + `train.val_episode_indices=[100]` - held-out numeric validation:`train.action_mse_val_freq_epochs=1` - rollout validation:`train.rollout_val_freq_epochs=5`,`train.rollout_num_episodes=10` - SwanLab:`train.use_swanlab=true`,project=`roboimi-vla` --- ## 1. 分支结构与关键文件 | 路径 | 作用 | | --- | --- | | `roboimi/demos/vla_scripts/train_vla.py` | 主训练入口;负责数据集、checkpoint、数值验证、训练期 rollout 验证、SwanLab | | `roboimi/demos/vla_scripts/eval_vla.py` | 单次 rollout / 离线验证入口;支持 headless、summary、trajectory image/video artifact | | `roboimi/vla/conf/config.yaml` | 全局 Hydra 配置;训练默认值都在这里 | | `roboimi/vla/conf/eval/eval.yaml` | eval 默认配置;`eval.ckpt_path`、`eval.num_episodes`、artifact 开关都在这里 | | `roboimi/vla/conf/agent/lewm_resnet_query_imf_attnres.yaml` | 本分支最常用 agent;LeWM query fusion + IMF AttnRes head | | `roboimi/vla/conf/backbone/lewm_resnet_query_fusion.yaml` | LeWM 多视角 ResNet query fusion backbone 配置 | | `roboimi/vla/agent_imf.py` | `IMFVLAAgent` 实现;one-step IMF 推理、LeWM loss、LeWM 预训练组件加载 | | `roboimi/vla/data/simpe_robot_dataset.py` | HDF5 懒加载数据集;也负责 `episode_indices` 过滤 | | `roboimi/vla/scripts/calculate_stats.py` | 重算 `dataset_stats.pkl` | | `experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/` | 当前最常用 suite;manifest、notes、launch log、local 启动脚本都在这里 | 补充: - 本分支常用 run name 形如 `lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037` - `q08/ph16/ex08` 这类后缀分别对应 `agent.lewm_query_offsets`、`agent.pred_horizon`、`agent.num_action_steps` --- ## 2. 三台机器与环境 | 机器 | GPU | repo / worktree | Python | 常用数据集路径 | | --- | --- | --- | --- | --- | | 本地 `droid-z790eagleax` | 1× RTX 5090 32GB | `/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion` | `/home/droid/.conda/envs/roboimi/bin/python` | `/home/droid/project/diana_sim/sim_transfer` | | 5880 节点 `100.73.14.65` | 2× RTX 5880 Ada 48GB | `/home/droid/roboimi_suite_20260416_lewm_imf_fusion` | `/home/droid/miniforge3/envs/roboimi/bin/python` | `/home/droid/sim_dataset/sim_transfer` | | L20 节点 `100.119.99.14` | 8× NVIDIA L20 46GB | `/data/roboimi_suite_20260416_lewm_imf_fusion` | `/home/droid/miniforge3/envs/roboimi/bin/python` | `/data/simtransfer/current` | 连接: - 5880:`ssh droid@100.73.14.65` - L20:`ssh droid@100.119.99.14` 经验规则: - 本地 5090:适合单条 smoke / 小规模主跑 / 本地调参 - 5880:适合 2 条并行主跑 - L20:适合大 grid;数据和 run 建议都放 `/data` --- ## 3. 训练流怎么走 `train_vla.py` 的实际流程: 1. 读取 Hydra 配置并打印完整 cfg 2. 通过 `build_train_val_datasets()` 构建 train/val dataset 3. 用 `DataLoader` 建 train/val loader 4. 从 `dataset_dir/dataset_stats.pkl` 读取归一化统计 5. instantiate `IMFVLAAgent` 6. 可选加载: - `train.pretrained_ckpt` - `train.resume_ckpt` - `agent.lewm_pretrained_ckpt` 7. 训练循环里按 `log_freq` 打 train loss / lr 8. 按 `save_freq` 保存 `checkpoints/vla_model_step_*.pt` 9. 每个 epoch 结束时,按配置跑: - held-out action MSE - rollout validation 10. 最后写: - `checkpoints/vla_model_best.pt` - `checkpoints/vla_model_final.pt` 当前 best model 选择逻辑: - **第一次拿到 rollout reward 之前**:先用 `val_loss`(或 train loss 回退)挑 best - **第一次 rollout 之后**:优先用 `rollout_avg_reward` 挑 best 输出目录一般通过 `hydra.run.dir=...` 固定;否则 Hydra 自己生成。 --- ## 4. 验证流怎么走 ### 4.1 held-out 数值验证 当前常用做法不是随机切 `val_split`,而是: - `train.val_split=0.0` - `train.val_episode_indices=[100]` - `train.action_mse_val_freq_epochs=1` 这样每个 epoch 结束都会在 `episode_100.hdf5` 上跑一次 `compute_action_mse_validation()`,日志 key 是: - 控制台 / `train_vla.log`:`held-out action MSE` - SwanLab:`val/action_mse` ### 4.2 rollout 验证 当前训练内 rollout 验证由 `train_vla.py -> run_rollout_validation() -> eval_vla._run_eval()` 触发。 当前这条分支的常用训练内 rollout 约束是: - `train.rollout_val_freq_epochs=5` - `train.rollout_num_episodes=10` - `train.rollout_validate_on_checkpoint=false` - 强制 headless - 强制 `verbose_action=false` - 强制 `record_video=false` - 强制 `save_trajectory_image=true` - 强制 `trajectory_image_camera_name=front` - 强制 `save_summary_json=true` 当前已经修正为**配置驱动的 rollout device / worker 路径**: - `train.rollout_device`:默认跟随 `train.device` - `train.rollout_num_workers`:默认 `null` - 当 rollout 设备是 CPU 时,自动退化为 `1` - 当 rollout 设备是 CUDA 时,自动推断为 `min(train.rollout_num_episodes, 8)` - `train.rollout_cuda_devices`:默认 `null`,等价于当前可见逻辑 GPU `[0]` - `train.rollout_response_timeout_s` - `train.rollout_server_startup_timeout_s` 所以现在: - 训练在 `cuda` 上时,**训练期 rollout 默认会走 GPU** - 如果 `rollout_num_workers > 1`,就会自动走并行 rollout - 可以是 **单 GPU 多 worker 共用一个 inference server** - 也可以是 **多 GPU 多 server 分摊 worker** 训练内 rollout artifact 默认落到: `/rollout_artifacts//` 常见文件: - `rollout_summary.json` - `rollout_front_ep01_trajectory.png` ... `rollout_front_ep10_trajectory.png` 日志重点看: - `Epoch X rollout 平均奖励` - `最佳模型已更新` --- ## 5. 数据集加载与 `val_episode_indices` 机制 ### 5.1 数据集格式 `SimpleRobotDataset` 读取 `dataset_dir` 下的 `episode_*.hdf5`,每个 episode 文件里至少要有: - `action` - `observations/qpos` - `observations/images/{cam_name}` 当前常用相机: - `r_vis` - `top` - `front` ### 5.2 懒加载行为 `roboimi/vla/data/simpe_robot_dataset.py` 是按帧懒加载,不会一次性把整套 HDF5 全读进内存。 它会: - 扫描目录下的 HDF5 文件 - 用文件名里的 episode 编号(如 `episode_100.hdf5` -> `100`)建立 `available_episode_indices` - 在 worker 内做 HDF5 文件句柄 LRU 缓存 ### 5.3 `val_episode_indices` 怎么切 `build_train_val_datasets()` 的逻辑是: 1. 先 instantiate 一次完整 dataset 2. 读取 `dataset.available_episode_indices` 3. 检查 `train.val_episode_indices` 是否都存在 4. 用 `episode_indices=` 再各 instantiate 一次: - train dataset = 全部 episode - held-out episode - val dataset = 只包含 held-out episode 因此: - `train.val_episode_indices=[100]` 的意思是“把 `episode_100.hdf5` 整个拿去做 held-out val” - 如果 episode 不存在,会直接报错 - 如果你把所有 episode 都塞进 `val_episode_indices`,也会直接报错,因为训练集会变空 ### 5.4 图像 resize 与 LeWM 附加字段 dataset 侧 resize 默认来自: - `data.image_resize_shape` - 如果 backbone 额外覆盖,则优先 `agent.vision_backbone.dataset_image_resize_shape` 返回 batch 除了常规: - `observation.state` - `observation.` - `action` 还会在 LeWM 打开时返回: - `lewm.observation.state` - `lewm.observation.` - `lewm.future.state` - `lewm.future.` ### 5.5 统计文件 训练和推理都默认依赖 `dataset_stats.pkl`。数据集更新后重算: ```bash /home/droid/.conda/envs/roboimi/bin/python roboimi/vla/scripts/calculate_stats.py \ --dataset_dir /home/droid/project/diana_sim/sim_transfer ``` 远端只要把 `--dataset_dir` 换成对应主机路径即可。 --- ## 6. SwanLab 行为 当前配置默认值里 `train.use_swanlab=false`,但本分支常用 recipe 基本都显式开: - `train.use_swanlab=true` - `train.swanlab_project=roboimi-vla` - `train.swanlab_run_name=` `train_vla.py` 的 SwanLab 行为: - 初始化时上传 `train` / `data` / `agent` 三段 config - 训练中记录: - `train/loss` - `train/lr` - `train/best_loss` - `train/step` - checkpoint 验证时记录: - `val/loss` - held-out 数值验证时记录: - `val/action_mse` - rollout 验证时记录: - `rollout/avg_reward` - `rollout/epoch` - 训练结束时记录: - `final/checkpoint_path` - `final/best_checkpoint_path` 训练期 rollout 生成的前视图轨迹 PNG 会 best-effort 上传到 SwanLab;失败只会 warning,不会让训练中断。 --- ## 7. 并行 rollout 说明 ### 7.1 这套能力从哪里来 本分支的并行 rollout 方向不是 DataLoader 并行,而是 **`eval_vla.py` 的 multiprocess rollout path**。 参考来源: `/home/droid/project/roboimi/.worktrees/multiprocess-rollout/roboimi/demos/vla_scripts/eval_vla.py` 那条路径的控制参数是: - `eval.num_workers` - `eval.cuda_devices` 语义是: - `eval.num_workers`:环境 worker 数,按 episode 切分 - `eval.cuda_devices`:推理 server 绑定到哪些逻辑 GPU ### 7.2 两种常见模式 1. **单机单卡,多 worker 共用同一张 GPU** - 典型:本地 5090 只有 1 卡,但想让 4 个 rollout worker 并行跑环境 - 形式:`eval.device=cuda eval.num_workers=4 'eval.cuda_devices=[0]'` - 这时是 **1 个 CUDA inference server + 4 个 env worker** 2. **单机多卡,多 server 分摊 worker** - 典型:5880 有 2 卡,L20 有多卡 - 形式:`eval.device=cuda eval.num_workers=8 'eval.cuda_devices=[0,1]'` - worker 会按 round-robin 分到多个 server 上 ### 7.3 操作上要注意什么 - 并行 rollout 依赖 **多进程 eval 路径**,不是 `train.num_workers` - `train.num_workers` 是 DataLoader worker,和 rollout 并行不是一回事 - `eval.num_workers > 1` 时必须 `eval.headless=true` - worker 数会自动 cap 到 `eval.num_episodes` - multiprocess rollout 当前已经支持 **per-episode trajectory image PNG**;多 worker 时每个 worker 会在自己的 artifact 子目录下写图,summary 会带回对应路径 - 但多 worker 时仍然不要同时要求: - `eval.record_video=true` - `eval.save_trajectory=true` - `eval.save_trajectory_npz=true` - `eval.save_trajectory_image=true` 现在是可以开的,适合并行 reward + 定性检查一起做 ### 7.4 并行 rollout 命令模板 **5090 单卡 4 worker:** ```bash /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \ train.device=cuda eval.device=cuda eval.headless=true eval.verbose_action=false \ eval.ckpt_path=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs//checkpoints/vla_model_best.pt \ eval.num_episodes=10 eval.num_workers=4 'eval.cuda_devices=[0]' \ eval.save_summary_json=true eval.artifact_dir=/tmp/lewm_parallel_eval_5090 ``` **5880 双卡 8 worker:** ```bash /home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/sim_dataset/sim_transfer \ train.device=cuda eval.device=cuda eval.headless=true eval.verbose_action=false \ eval.ckpt_path=/home/droid/roboimi_suite_20260416_lewm_imf_fusion/runs//checkpoints/vla_model_best.pt \ eval.num_episodes=10 eval.num_workers=8 'eval.cuda_devices=[0,1]' \ eval.save_summary_json=true eval.artifact_dir=/tmp/lewm_parallel_eval_5880 ``` --- ## 8. 当前常用命令 / 脚本 ### 8.1 本地 5090:直接用 suite 脚本 现成脚本: `experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/launch_local_5090.sh` 运行: ```bash bash experiment_suites/2026-04-21-lewm-fromscratch-old9-epoch50-roll5-val-20260421-153037/launch_local_5090.sh ``` ### 8.2 本地 5090:手动启动同 recipe ```bash /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \ 'agent.lewm_query_offsets=[8]' \ agent.pred_horizon=8 \ agent.num_action_steps=8 \ train.device=cuda \ train.batch_size=32 \ train.lr=0.0001 \ train.max_steps=109350 \ train.num_workers=4 \ train.save_freq=10000 \ train.rollout_validate_on_checkpoint=false \ train.rollout_val_freq_epochs=5 \ train.rollout_num_episodes=10 \ train.val_split=0.0 \ 'train.val_episode_indices=[100]' \ train.action_mse_val_freq_epochs=1 \ train.use_swanlab=true \ train.swanlab_project=roboimi-vla \ train.swanlab_run_name=lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037 \ train.pretrained_ckpt=null \ agent.lewm_pretrained_ckpt=null \ hydra.run.dir=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/lewmimf-q08-ph08-ex08-emb384-l12-fromscratch-epoch50-step109350-5090g0-20260421-153037 ``` ### 8.3 5880:常用命令模板 ```bash ssh droid@100.73.14.65 cd /home/droid/roboimi_suite_20260416_lewm_imf_fusion /home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/sim_dataset/sim_transfer \ 'agent.lewm_query_offsets=[8]' \ agent.pred_horizon=16 \ agent.num_action_steps=8 \ train.device=cuda train.batch_size=32 train.lr=0.0001 train.max_steps=109350 \ train.num_workers=4 train.save_freq=10000 train.rollout_validate_on_checkpoint=false \ train.rollout_val_freq_epochs=5 train.rollout_num_episodes=10 train.val_split=0.0 \ 'train.val_episode_indices=[100]' train.action_mse_val_freq_epochs=1 \ train.use_swanlab=true train.swanlab_project=roboimi-vla \ train.swanlab_run_name=lewmimf-q08-ph16-ex08-emb384-l12-fromscratch-epoch50-step109350-5880g0-20260421-153037 \ train.pretrained_ckpt=null agent.lewm_pretrained_ckpt=null \ hydra.run.dir=/home/droid/roboimi_suite_20260416_lewm_imf_fusion/runs/lewmimf-q08-ph16-ex08-emb384-l12-fromscratch-epoch50-step109350-5880g0-20260421-153037 ``` ### 8.4 L20:常用命令模板 ```bash ssh droid@100.119.99.14 cd /data/roboimi_suite_20260416_lewm_imf_fusion /home/droid/miniforge3/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/data/simtransfer/current \ 'agent.lewm_query_offsets=[16]' \ agent.pred_horizon=16 \ agent.num_action_steps=16 \ train.device=cuda train.batch_size=32 train.lr=0.0001 train.max_steps=109350 \ train.num_workers=4 train.save_freq=10000 train.rollout_validate_on_checkpoint=false \ train.rollout_val_freq_epochs=5 train.rollout_num_episodes=10 train.val_split=0.0 \ 'train.val_episode_indices=[100]' train.action_mse_val_freq_epochs=1 \ train.use_swanlab=true train.swanlab_project=roboimi-vla \ train.swanlab_run_name=lewmimf-q16-ph16-ex16-emb384-l12-fromscratch-epoch50-step109350-l20g0-20260421-153037 \ train.pretrained_ckpt=null agent.lewm_pretrained_ckpt=null \ hydra.run.dir=/data/roboimi_suite_20260416_lewm_imf_fusion/runs/lewmimf-q16-ph16-ex16-emb384-l12-fromscratch-epoch50-step109350-l20g0-20260421-153037 ``` ### 8.5 单次离线验证(当前分支已支持并行) **单 GPU / 4 worker:** ```bash /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/eval_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \ train.device=cuda eval.device=cuda \ eval.ckpt_path=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs//checkpoints/vla_model_best.pt \ eval.num_episodes=10 eval.num_workers=4 'eval.cuda_devices=[0]' \ eval.headless=true eval.verbose_action=false \ eval.save_summary_json=true eval.save_trajectory_image=true \ eval.trajectory_image_camera_name=front \ eval.artifact_dir=/tmp/lewm_eval_front ``` **训练内启用并行 GPU rollout(推荐显式写清楚)**: ```bash /home/droid/.conda/envs/roboimi/bin/python roboimi/demos/vla_scripts/train_vla.py \ agent=lewm_resnet_query_imf_attnres \ data.dataset_dir=/home/droid/project/diana_sim/sim_transfer \ 'agent.lewm_query_offsets=[8]' \ agent.pred_horizon=8 \ agent.num_action_steps=8 \ train.device=cuda \ train.batch_size=32 \ train.lr=0.0001 \ train.max_steps=109350 \ train.num_workers=4 \ train.save_freq=10000 \ train.rollout_val_freq_epochs=5 \ train.rollout_num_episodes=10 \ train.rollout_device=cuda \ train.rollout_num_workers=4 \ 'train.rollout_cuda_devices=[0]' \ train.rollout_validate_on_checkpoint=false \ train.val_split=0.0 \ 'train.val_episode_indices=[100]' \ train.action_mse_val_freq_epochs=1 \ train.use_swanlab=true \ train.swanlab_project=roboimi-vla \ train.swanlab_run_name= \ hydra.run.dir=/home/droid/project/roboimi/.worktrees/feat-lewm-imf-fusion/runs/ ``` ### 8.6 监控日志 ```bash tail -f runs//launch.stdout.log tail -f runs//train_vla.log ``` 远端就把 `runs/` 换成 manifest 里的绝对路径。 --- ## 9. 操作建议 - **优先以 suite 的 `manifest.json` / `notes.md` / `launch_logs/*.launch.log` 为准**,不要手写一套和历史 run 不一致的命令 - 要做当前常用验证,就显式加上: - `train.val_split=0.0` - `train.val_episode_indices=[100]` - `train.action_mse_val_freq_epochs=1` - `train.rollout_val_freq_epochs=5` - `train.rollout_num_episodes=10` - 本分支如果要对比不同 horizon / action-step,尽量只改: - `agent.lewm_query_offsets` - `agent.pred_horizon` - `agent.num_action_steps` - 想复现 2026-04-21 那轮 from-scratch 结果时,记得同时设: - `train.pretrained_ckpt=null` - `agent.lewm_pretrained_ckpt=null`