fix: stabilize headless rollout and summarize phase1 grid

This commit is contained in:
Logic
2026-04-04 23:47:15 +08:00
parent 0586a6e6c7
commit a78006808a
8 changed files with 446 additions and 1 deletions

View File

@@ -0,0 +1,20 @@
# IMF Horizon Grid Suite Notes
- Created: 2026-04-04 13:19:52
- Phase-1 matrix: (8,8), (16,8), (16,16), (32,8), (32,16), (32,32)
- Fixed baseline: IMF AttnRes, n_emb=384, n_layer=12, batch_size=80, lr=2.5e-4, max_steps=50k, rollout every 5 epochs with 5 episodes.
- Host allocation:
- local RTX 5090: ph32_ex32
- 100.73.14.65 RTX 5880 GPU0: ph8_ex8
- 100.73.14.65 RTX 5880 GPU1: ph16_ex8
- 100.119.99.14 L20 GPU0: ph16_ex16
- 100.119.99.14 L20 GPU1: ph32_ex8
- 100.119.99.14 L20 GPU2: ph32_ex16
- 100.119.99.14 still needs env + dataset + swanlab credential copy before launch.
- 2026-04-04 13:23:43: launched local ph32_ex32 (pid 1437836), remote 100.73 ph8_ex8 (pid 931824), ph16_ex8 (pid 931826); started 100.119 bootstrap (local pid 1437837).
- 2026-04-04 13:25:43: first status sync — local ph32_ex32 step≈500; remote ph8_ex8 step≈400; remote ph16_ex8 step≈400.
- 2026-04-04 13:27:41: second status sync — 100.119 bootstrap finished env copy and entered dataset copy; local ph32_ex32 step≈900; remote ph8_ex8 step≈800; remote ph16_ex8 step≈800.
- 2026-04-04 13:35:31: 100.119 bootstrap data/env copy finished. Original validation command hit a quoting bug, then I manually revalidated torch+mujoco+swanlab and launched ph16_ex16/ph32_ex8/ph32_ex16 with pids 81129/81130/81131.
- 2026-04-04 13:37:36: all 6 Phase-1 runs are now up. SwanLab links recorded in status.json; latest observed steps ~ local 900 / 5880 runs 800 / L20 runs 100.
- 2026-04-04 14:41:08: diagnosed remote first-rollout crash as early mujoco import before MUJOCO_GL=egl in eval_vla.py via raw_action_trajectory_viewer. Added regression test tests/test_eval_vla_headless_import.py, fixed import to lazy-load, verified 20-step headless eval on 5880 and L20, then resumed 5 failed runs from step 4374. Current resumed pids: ph8_ex8=938714, ph16_ex8=938717, ph16_ex16=90169, ph32_ex8=90173, ph32_ex16=90175.