chore(pusht): add 5090 repro docs and uv setup

2026-03-14 12:25:44 +08:00
parent 5ba07ac666
commit 08c1950c6d
6 changed files with 270 additions and 8 deletions
--- a/PUSHT_REPRO_5090.md
+++ b/PUSHT_REPRO_5090.md
@@ -0,0 +1,108 @@
+# PushT Repro On 5090
+
+## Goal
+Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`.
+
+## Current Verified Local Setup
+- Virtualenv: `./.venv` managed with `uv`
+- Python: `3.9.25`
+- Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
+- Version strategy used here:
+  - newer Torch/CUDA stack for current 5090-class hardware support
+  - keep older repo-era packages where they are still required by the code
+- Verified key pins in `.venv`:
+  - `numpy 1.26.4`
+  - `gym 0.23.1`
+  - `hydra-core 1.2.0`
+  - `diffusers 0.11.1`
+  - `huggingface_hub 0.10.1`
+  - `wandb 0.13.3`
+  - `zarr 2.12.0`
+  - `numcodecs 0.10.2`
+  - `av 14.0.1`
+  - `robomimic 0.2.0`
+
+## Dataset
+- README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip`
+- Local archive currently present: `data/pusht.zip`
+- Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr`
+
+## Repo-Local Code Adjustments
+- `diffusion_policy/env_runner/pusht_image_runner.py`
+  - switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv`
+- `diffusion_policy/gym_util/sync_vector_env.py`
+  - added `reset_async`
+  - added seeded `reset_wait`
+  - updated `concatenate(...)` call order for current `gym`
+
+These changes were needed to keep PushT evaluation working without the async shared-memory path.
+
+## Validated GPU Smoke Command
+This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints:
+
+```bash
+.venv/bin/python train.py \
+  --config-dir=. \
+  --config-name=image_pusht_diffusion_policy_cnn.yaml \
+  training.seed=42 \
+  training.device=cuda:0 \
+  logging.mode=offline \
+  dataloader.num_workers=0 \
+  val_dataloader.num_workers=0 \
+  task.env_runner.n_envs=1 \
+  training.debug=true \
+  task.env_runner.n_test=2 \
+  task.env_runner.n_test_vis=0 \
+  task.env_runner.n_train=1 \
+  task.env_runner.n_train_vis=0 \
+  task.env_runner.max_steps=20
+```
+
+## Practical Full Training Command Used Here
+This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`:
+
+```bash
+.venv/bin/python train.py \
+  --config-dir=. \
+  --config-name=image_pusht_diffusion_policy_cnn.yaml \
+  training.seed=42 \
+  training.device=cuda:0 \
+  logging.mode=offline \
+  dataloader.num_workers=0 \
+  val_dataloader.num_workers=0 \
+  task.env_runner.n_envs=1 \
+  task.env_runner.n_test_vis=0 \
+  task.env_runner.n_train_vis=0 \
+  hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
+```
+
+## Why These Overrides Were Used
+- `logging.mode=offline`
+  - avoids needing a W&B login and still leaves local run metadata in the output dir
+- `dataloader.num_workers=0` and `val_dataloader.num_workers=0`
+  - avoids extra multiprocessing on this host
+- `task.env_runner.n_envs=1`
+  - keeps PushT eval on the serial `SyncVectorEnv` path
+- `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0`
+  - avoids video-writing issues on this stack
+  - one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log`
+
+## Output Locations
+- Smoke run:
+  - `data/outputs/gpu_smoke2___pusht_gpu_smoke`
+- Longer GPU run:
+  - `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`
+- Files to inspect inside a run:
+  - `.hydra/overrides.yaml`
+  - `logs.json.txt`
+  - `train.log`
+  - `checkpoints/latest.ckpt`
+
+## Known Caveats
+- The default config is still tuned for older assumptions:
+  - `logging.mode=online`
+  - `dataloader.num_workers=8`
+  - `task.env_runner.n_envs=null`
+  - `task.env_runner.n_test_vis=4`
+  - `task.env_runner.n_train_vis=2`
+- In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.