feat: add rollout trajectory image artifacts and swanlab logging

2026-04-03 09:39:16 +08:00
parent 48f0eb8dd0
commit 0586a6e6c7
8 changed files with 626 additions and 21 deletions
@@ -0,0 +1,79 @@
+# IMF Rollout Trajectory Images and Short-Horizon Training Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add training-time rollout front trajectory image export plus SwanLab image logging, then start a new local IMF training run with `emb=384`, `layer=12`, `pred_horizon=8`, `num_action_steps=4`, `max_steps=50000`.
+
+**Architecture:** Extend `eval_vla.py` so a rollout can emit one per-episode static front-view image with red EE trajectory overlay. Extend `train_vla.py` so rollout validation forces image export, forces video off, and uploads those per-episode images to SwanLab. Launch the requested new run through explicit command-line overrides rather than branch-default config changes.
+
+**Tech Stack:** Python, PyTorch, Hydra/OmegaConf, MuJoCo, OpenCV, SwanLab.
+
+---
+
+### Task 1: Add and validate rollout image tests
+
+**Files:**
+- Modify: `tests/test_eval_vla_rollout_artifacts.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Add/adjust eval tests so they assert per-episode trajectory image paths are produced without requiring video export.
+- [ ] Add/adjust training tests so they assert training-time rollout validation forces `record_video=false`.
+- [ ] Add/adjust training tests so they assert trajectory image paths flow from eval summary into SwanLab media logging.
+- [ ] Add/adjust training tests so they assert image media is logged, not only scalar reward metrics.
+
+### Task 2: Implement per-episode front trajectory image export in eval
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/eval_vla.py`
+- Reuse/Read: `roboimi/utils/raw_action_trajectory_viewer.py`
+- Modify: `roboimi/vla/conf/eval/eval.yaml`
+
+- [ ] Add config plumbing for `save_trajectory_image` and `trajectory_image_camera_name`.
+- [ ] Ensure the default training-time camera resolution path is pinned to `front`.
+- [ ] Implement distinct per-episode image naming so 5 rollout episodes create 5 distinct PNGs.
+- [ ] Reuse the existing red trajectory representation logic when composing the PNG.
+- [ ] Ensure headless eval works under EGL even on machines with `DISPLAY` set.
+
+### Task 3: Implement SwanLab rollout image logging in training
+
+**Files:**
+- Modify: `roboimi/demos/vla_scripts/train_vla.py`
+- Modify: `tests/test_train_vla_swanlab_logging.py`
+- Modify: `tests/test_train_vla_rollout_validation.py`
+
+- [ ] Make `run_rollout_validation()` force `record_video=false`.
+- [ ] Make `run_rollout_validation()` force `save_trajectory_image=true` and `trajectory_image_camera_name=front`.
+- [ ] Ensure rollout validation still uses 5 episodes per validation event for the requested run.
+- [ ] Add a best-effort helper that converts per-episode image paths into SwanLab image media payloads.
+- [ ] Keep image-upload failures non-fatal and warning-only.
+
+### Task 4: Verify action-chunk semantics for the new run
+
+**Files:**
+- Verify: `roboimi/vla/agent.py`
+- Verify: `roboimi/vla/agent_imf.py`
+- Test: `tests/test_imf_vla_agent.py`
+
+- [ ] Confirm the existing queue logic still means “predict 8, execute first 4”.
+- [ ] Do not change branch defaults unless strictly necessary; prefer launch-time overrides.
+
+### Task 5: Verify and launch the requested local training run
+
+**Files:**
+- Use: `roboimi/demos/vla_scripts/train_vla.py`
+- Use: `roboimi/demos/vla_scripts/eval_vla.py`
+
+- [ ] Run the targeted verification suite.
+- [ ] Run one real headless smoke eval and confirm a front trajectory PNG is produced while `video_mp4` stays null.
+- [ ] Launch the new local training run with explicit overrides including:
+  - `agent=resnet_imf_attnres`
+  - `agent.head.n_emb=384`
+  - `agent.head.n_layer=12`
+  - `agent.pred_horizon=8`
+  - `agent.num_action_steps=4`
+  - `train.max_steps=50000`
+  - `train.rollout_num_episodes=5`
+  - `train.use_swanlab=true`
+  - current local baseline dataset/camera/CUDA/batch/lr/num_workers/backbone settings
+- [ ] Verify PID, GPU allocation, log tail, and SwanLab run URL.