chore(pusht): add 5090 repro docs and uv setup
This commit is contained in:
108
PUSHT_REPRO_5090.md
Normal file
108
PUSHT_REPRO_5090.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# PushT Repro On 5090
|
||||
|
||||
## Goal
|
||||
Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`.
|
||||
|
||||
## Current Verified Local Setup
|
||||
- Virtualenv: `./.venv` managed with `uv`
|
||||
- Python: `3.9.25`
|
||||
- Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
|
||||
- Version strategy used here:
|
||||
- newer Torch/CUDA stack for current 5090-class hardware support
|
||||
- keep older repo-era packages where they are still required by the code
|
||||
- Verified key pins in `.venv`:
|
||||
- `numpy 1.26.4`
|
||||
- `gym 0.23.1`
|
||||
- `hydra-core 1.2.0`
|
||||
- `diffusers 0.11.1`
|
||||
- `huggingface_hub 0.10.1`
|
||||
- `wandb 0.13.3`
|
||||
- `zarr 2.12.0`
|
||||
- `numcodecs 0.10.2`
|
||||
- `av 14.0.1`
|
||||
- `robomimic 0.2.0`
|
||||
|
||||
## Dataset
|
||||
- README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip`
|
||||
- Local archive currently present: `data/pusht.zip`
|
||||
- Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||
|
||||
## Repo-Local Code Adjustments
|
||||
- `diffusion_policy/env_runner/pusht_image_runner.py`
|
||||
- switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv`
|
||||
- `diffusion_policy/gym_util/sync_vector_env.py`
|
||||
- added `reset_async`
|
||||
- added seeded `reset_wait`
|
||||
- updated `concatenate(...)` call order for current `gym`
|
||||
|
||||
These changes were needed to keep PushT evaluation working without the async shared-memory path.
|
||||
|
||||
## Validated GPU Smoke Command
|
||||
This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints:
|
||||
|
||||
```bash
|
||||
.venv/bin/python train.py \
|
||||
--config-dir=. \
|
||||
--config-name=image_pusht_diffusion_policy_cnn.yaml \
|
||||
training.seed=42 \
|
||||
training.device=cuda:0 \
|
||||
logging.mode=offline \
|
||||
dataloader.num_workers=0 \
|
||||
val_dataloader.num_workers=0 \
|
||||
task.env_runner.n_envs=1 \
|
||||
training.debug=true \
|
||||
task.env_runner.n_test=2 \
|
||||
task.env_runner.n_test_vis=0 \
|
||||
task.env_runner.n_train=1 \
|
||||
task.env_runner.n_train_vis=0 \
|
||||
task.env_runner.max_steps=20
|
||||
```
|
||||
|
||||
## Practical Full Training Command Used Here
|
||||
This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`:
|
||||
|
||||
```bash
|
||||
.venv/bin/python train.py \
|
||||
--config-dir=. \
|
||||
--config-name=image_pusht_diffusion_policy_cnn.yaml \
|
||||
training.seed=42 \
|
||||
training.device=cuda:0 \
|
||||
logging.mode=offline \
|
||||
dataloader.num_workers=0 \
|
||||
val_dataloader.num_workers=0 \
|
||||
task.env_runner.n_envs=1 \
|
||||
task.env_runner.n_test_vis=0 \
|
||||
task.env_runner.n_train_vis=0 \
|
||||
hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
|
||||
```
|
||||
|
||||
## Why These Overrides Were Used
|
||||
- `logging.mode=offline`
|
||||
- avoids needing a W&B login and still leaves local run metadata in the output dir
|
||||
- `dataloader.num_workers=0` and `val_dataloader.num_workers=0`
|
||||
- avoids extra multiprocessing on this host
|
||||
- `task.env_runner.n_envs=1`
|
||||
- keeps PushT eval on the serial `SyncVectorEnv` path
|
||||
- `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0`
|
||||
- avoids video-writing issues on this stack
|
||||
- one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log`
|
||||
|
||||
## Output Locations
|
||||
- Smoke run:
|
||||
- `data/outputs/gpu_smoke2___pusht_gpu_smoke`
|
||||
- Longer GPU run:
|
||||
- `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`
|
||||
- Files to inspect inside a run:
|
||||
- `.hydra/overrides.yaml`
|
||||
- `logs.json.txt`
|
||||
- `train.log`
|
||||
- `checkpoints/latest.ckpt`
|
||||
|
||||
## Known Caveats
|
||||
- The default config is still tuned for older assumptions:
|
||||
- `logging.mode=online`
|
||||
- `dataloader.num_workers=8`
|
||||
- `task.env_runner.n_envs=null`
|
||||
- `task.env_runner.n_test_vis=4`
|
||||
- `task.env_runner.n_train_vis=2`
|
||||
- In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.
|
||||
Reference in New Issue
Block a user