Files
diffusion_policy/PUSHT_REPRO_5090.md

3.9 KiB

PushT Repro On 5090

Goal

Reproduce the canonical single-seed image PushT experiment from this repo in ~/diffusion_policy using image_pusht_diffusion_policy_cnn.yaml.

Current Verified Local Setup

  • Virtualenv: ./.venv managed with uv
  • Python: 3.9.25
  • Torch stack: torch 2.8.0+cu128, torchvision 0.23.0+cu128
  • Version strategy used here:
    • newer Torch/CUDA stack for current 5090-class hardware support
    • keep older repo-era packages where they are still required by the code
  • Verified key pins in .venv:
    • numpy 1.26.4
    • gym 0.23.1
    • hydra-core 1.2.0
    • diffusers 0.11.1
    • huggingface_hub 0.10.1
    • wandb 0.13.3
    • zarr 2.12.0
    • numcodecs 0.10.2
    • av 14.0.1
    • robomimic 0.2.0

Dataset

  • README source: https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip
  • Local archive currently present: data/pusht.zip
  • Unpacked dataset used by the config: data/pusht/pusht_cchi_v7_replay.zarr

Repo-Local Code Adjustments

  • diffusion_policy/env_runner/pusht_image_runner.py
    • switched PushT image evaluation from AsyncVectorEnv to SyncVectorEnv
  • diffusion_policy/gym_util/sync_vector_env.py
    • added reset_async
    • added seeded reset_wait
    • updated concatenate(...) call order for current gym

These changes were needed to keep PushT evaluation working without the async shared-memory path.

Validated GPU Smoke Command

This route is verified by data/outputs/gpu_smoke2___pusht_gpu_smoke, which contains logs.json.txt plus checkpoints:

.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  training.debug=true \
  task.env_runner.n_test=2 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train=1 \
  task.env_runner.n_train_vis=0 \
  task.env_runner.max_steps=20

Practical Full Training Command Used Here

This matches the longer GPU run under data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42:

.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train_vis=0 \
  hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42

Why These Overrides Were Used

  • logging.mode=offline
    • avoids needing a W&B login and still leaves local run metadata in the output dir
  • dataloader.num_workers=0 and val_dataloader.num_workers=0
    • avoids extra multiprocessing on this host
  • task.env_runner.n_envs=1
    • keeps PushT eval on the serial SyncVectorEnv path
  • task.env_runner.n_test_vis=0 and task.env_runner.n_train_vis=0
    • avoids video-writing issues on this stack
    • one earlier GPU run with default vis settings logged libav/libx264 profile=high errors in data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log

Output Locations

  • Smoke run:
    • data/outputs/gpu_smoke2___pusht_gpu_smoke
  • Longer GPU run:
    • data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
  • Files to inspect inside a run:
    • .hydra/overrides.yaml
    • logs.json.txt
    • train.log
    • checkpoints/latest.ckpt

Known Caveats

  • The default config is still tuned for older assumptions:
    • logging.mode=online
    • dataloader.num_workers=8
    • task.env_runner.n_envs=null
    • task.env_runner.n_test_vis=4
    • task.env_runner.n_train_vis=2
  • In this shell, torch.cuda.is_available() currently reports False even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.