Files
roboimi/experiment_suites/2026-04-04-imf-horizon-grid/notes.md

1.9 KiB

IMF Horizon Grid Suite Notes

  • Created: 2026-04-04 13:19:52

  • Phase-1 matrix: (8,8), (16,8), (16,16), (32,8), (32,16), (32,32)

  • Fixed baseline: IMF AttnRes, n_emb=384, n_layer=12, batch_size=80, lr=2.5e-4, max_steps=50k, rollout every 5 epochs with 5 episodes.

  • Host allocation:

    • local RTX 5090: ph32_ex32
    • 100.73.14.65 RTX 5880 GPU0: ph8_ex8
    • 100.73.14.65 RTX 5880 GPU1: ph16_ex8
    • 100.119.99.14 L20 GPU0: ph16_ex16
    • 100.119.99.14 L20 GPU1: ph32_ex8
    • 100.119.99.14 L20 GPU2: ph32_ex16
  • 100.119.99.14 still needs env + dataset + swanlab credential copy before launch.

  • 2026-04-04 13:23:43: launched local ph32_ex32 (pid 1437836), remote 100.73 ph8_ex8 (pid 931824), ph16_ex8 (pid 931826); started 100.119 bootstrap (local pid 1437837).

  • 2026-04-04 13:25:43: first status sync — local ph32_ex32 step≈500; remote ph8_ex8 step≈400; remote ph16_ex8 step≈400.

  • 2026-04-04 13:27:41: second status sync — 100.119 bootstrap finished env copy and entered dataset copy; local ph32_ex32 step≈900; remote ph8_ex8 step≈800; remote ph16_ex8 step≈800.

  • 2026-04-04 13:35:31: 100.119 bootstrap data/env copy finished. Original validation command hit a quoting bug, then I manually revalidated torch+mujoco+swanlab and launched ph16_ex16/ph32_ex8/ph32_ex16 with pids 81129/81130/81131.

  • 2026-04-04 13:37:36: all 6 Phase-1 runs are now up. SwanLab links recorded in status.json; latest observed steps ~ local 900 / 5880 runs 800 / L20 runs 100.

  • 2026-04-04 14:41:08: diagnosed remote first-rollout crash as early mujoco import before MUJOCO_GL=egl in eval_vla.py via raw_action_trajectory_viewer. Added regression test tests/test_eval_vla_headless_import.py, fixed import to lazy-load, verified 20-step headless eval on 5880 and L20, then resumed 5 failed runs from step 4374. Current resumed pids: ph8_ex8=938714, ph16_ex8=938717, ph16_ex16=90169, ph32_ex8=90173, ph32_ex16=90175.