PushT Repro On 5090

Goal

Reproduce the canonical single-seed image PushT experiment from this repo in ~/diffusion_policy using image_pusht_diffusion_policy_cnn.yaml.

Current Verified Local Setup

Virtualenv: ./.venv managed with uv
Python: 3.9.25
Torch stack: torch 2.8.0+cu128, torchvision 0.23.0+cu128
Version strategy used here:
- newer Torch/CUDA stack for current 5090-class hardware support
- keep older repo-era packages where they are still required by the code
Verified key pins in .venv:
- numpy 1.26.4
- gym 0.23.1
- hydra-core 1.2.0
- diffusers 0.11.1
- huggingface_hub 0.10.1
- wandb 0.13.3
- zarr 2.12.0
- numcodecs 0.10.2
- av 14.0.1
- robomimic 0.2.0

Dataset

README source: https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip
Local archive currently present: data/pusht.zip
Unpacked dataset used by the config: data/pusht/pusht_cchi_v7_replay.zarr

Repo-Local Code Adjustments

diffusion_policy/env_runner/pusht_image_runner.py
- switched PushT image evaluation from AsyncVectorEnv to SyncVectorEnv
diffusion_policy/gym_util/sync_vector_env.py
- added reset_async
- added seeded reset_wait
- updated concatenate(...) call order for current gym

These changes were needed to keep PushT evaluation working without the async shared-memory path.

Validated GPU Smoke Command

This route is verified by data/outputs/gpu_smoke2___pusht_gpu_smoke, which contains logs.json.txt plus checkpoints:

.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  training.debug=true \
  task.env_runner.n_test=2 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train=1 \
  task.env_runner.n_train_vis=0 \
  task.env_runner.max_steps=20

Practical Full Training Command Used Here

This matches the longer GPU run under data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42:

.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train_vis=0 \
  hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42

Why These Overrides Were Used

logging.mode=offline
- avoids needing a W&B login and still leaves local run metadata in the output dir
dataloader.num_workers=0 and val_dataloader.num_workers=0
- avoids extra multiprocessing on this host
task.env_runner.n_envs=1
- keeps PushT eval on the serial SyncVectorEnv path
task.env_runner.n_test_vis=0 and task.env_runner.n_train_vis=0
- avoids video-writing issues on this stack
- one earlier GPU run with default vis settings logged libav/libx264 profile=high errors in data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log

Output Locations

Smoke run:
- data/outputs/gpu_smoke2___pusht_gpu_smoke
Longer GPU run:
- data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
Files to inspect inside a run:
- .hydra/overrides.yaml
- logs.json.txt
- train.log
- checkpoints/latest.ckpt

Known Caveats

The default config is still tuned for older assumptions:
- logging.mode=online
- dataloader.num_workers=8
- task.env_runner.n_envs=null
- task.env_runner.n_test_vis=4
- task.env_runner.n_train_vis=2
In this shell, torch.cuda.is_available() currently reports False even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.

3.9 KiB Raw Blame History