3.9 KiB
3.9 KiB
PushT Repro On 5090
Goal
Reproduce the canonical single-seed image PushT experiment from this repo in ~/diffusion_policy using image_pusht_diffusion_policy_cnn.yaml.
Current Verified Local Setup
- Virtualenv:
./.venvmanaged withuv - Python:
3.9.25 - Torch stack:
torch 2.8.0+cu128,torchvision 0.23.0+cu128 - Version strategy used here:
- newer Torch/CUDA stack for current 5090-class hardware support
- keep older repo-era packages where they are still required by the code
- Verified key pins in
.venv:numpy 1.26.4gym 0.23.1hydra-core 1.2.0diffusers 0.11.1huggingface_hub 0.10.1wandb 0.13.3zarr 2.12.0numcodecs 0.10.2av 14.0.1robomimic 0.2.0
Dataset
- README source:
https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip - Local archive currently present:
data/pusht.zip - Unpacked dataset used by the config:
data/pusht/pusht_cchi_v7_replay.zarr
Repo-Local Code Adjustments
diffusion_policy/env_runner/pusht_image_runner.py- switched PushT image evaluation from
AsyncVectorEnvtoSyncVectorEnv
- switched PushT image evaluation from
diffusion_policy/gym_util/sync_vector_env.py- added
reset_async - added seeded
reset_wait - updated
concatenate(...)call order for currentgym
- added
These changes were needed to keep PushT evaluation working without the async shared-memory path.
Validated GPU Smoke Command
This route is verified by data/outputs/gpu_smoke2___pusht_gpu_smoke, which contains logs.json.txt plus checkpoints:
.venv/bin/python train.py \
--config-dir=. \
--config-name=image_pusht_diffusion_policy_cnn.yaml \
training.seed=42 \
training.device=cuda:0 \
logging.mode=offline \
dataloader.num_workers=0 \
val_dataloader.num_workers=0 \
task.env_runner.n_envs=1 \
training.debug=true \
task.env_runner.n_test=2 \
task.env_runner.n_test_vis=0 \
task.env_runner.n_train=1 \
task.env_runner.n_train_vis=0 \
task.env_runner.max_steps=20
Practical Full Training Command Used Here
This matches the longer GPU run under data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42:
.venv/bin/python train.py \
--config-dir=. \
--config-name=image_pusht_diffusion_policy_cnn.yaml \
training.seed=42 \
training.device=cuda:0 \
logging.mode=offline \
dataloader.num_workers=0 \
val_dataloader.num_workers=0 \
task.env_runner.n_envs=1 \
task.env_runner.n_test_vis=0 \
task.env_runner.n_train_vis=0 \
hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
Why These Overrides Were Used
logging.mode=offline- avoids needing a W&B login and still leaves local run metadata in the output dir
dataloader.num_workers=0andval_dataloader.num_workers=0- avoids extra multiprocessing on this host
task.env_runner.n_envs=1- keeps PushT eval on the serial
SyncVectorEnvpath
- keeps PushT eval on the serial
task.env_runner.n_test_vis=0andtask.env_runner.n_train_vis=0- avoids video-writing issues on this stack
- one earlier GPU run with default vis settings logged libav/libx264
profile=higherrors indata/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log
Output Locations
- Smoke run:
data/outputs/gpu_smoke2___pusht_gpu_smoke
- Longer GPU run:
data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
- Files to inspect inside a run:
.hydra/overrides.yamllogs.json.txttrain.logcheckpoints/latest.ckpt
Known Caveats
- The default config is still tuned for older assumptions:
logging.mode=onlinedataloader.num_workers=8task.env_runner.n_envs=nulltask.env_runner.n_test_vis=4task.env_runner.n_train_vis=2
- In this shell,
torch.cuda.is_available()currently reportsFalseeven though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.