# PushT Repro On 5090 ## Goal Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`. ## Current Verified Local Setup - Virtualenv: `./.venv` managed with `uv` - Python: `3.9.25` - Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128` - Version strategy used here: - newer Torch/CUDA stack for current 5090-class hardware support - keep older repo-era packages where they are still required by the code - Verified key pins in `.venv`: - `numpy 1.26.4` - `gym 0.23.1` - `hydra-core 1.2.0` - `diffusers 0.11.1` - `huggingface_hub 0.10.1` - `wandb 0.13.3` - `zarr 2.12.0` - `numcodecs 0.10.2` - `av 14.0.1` - `robomimic 0.2.0` ## Dataset - README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip` - Local archive currently present: `data/pusht.zip` - Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr` ## Repo-Local Code Adjustments - `diffusion_policy/env_runner/pusht_image_runner.py` - switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv` - `diffusion_policy/gym_util/sync_vector_env.py` - added `reset_async` - added seeded `reset_wait` - updated `concatenate(...)` call order for current `gym` These changes were needed to keep PushT evaluation working without the async shared-memory path. ## Validated GPU Smoke Command This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints: ```bash .venv/bin/python train.py \ --config-dir=. \ --config-name=image_pusht_diffusion_policy_cnn.yaml \ training.seed=42 \ training.device=cuda:0 \ logging.mode=offline \ dataloader.num_workers=0 \ val_dataloader.num_workers=0 \ task.env_runner.n_envs=1 \ training.debug=true \ task.env_runner.n_test=2 \ task.env_runner.n_test_vis=0 \ task.env_runner.n_train=1 \ task.env_runner.n_train_vis=0 \ task.env_runner.max_steps=20 ``` ## Practical Full Training Command Used Here This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`: ```bash .venv/bin/python train.py \ --config-dir=. \ --config-name=image_pusht_diffusion_policy_cnn.yaml \ training.seed=42 \ training.device=cuda:0 \ logging.mode=offline \ dataloader.num_workers=0 \ val_dataloader.num_workers=0 \ task.env_runner.n_envs=1 \ task.env_runner.n_test_vis=0 \ task.env_runner.n_train_vis=0 \ hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42 ``` ## Why These Overrides Were Used - `logging.mode=offline` - avoids needing a W&B login and still leaves local run metadata in the output dir - `dataloader.num_workers=0` and `val_dataloader.num_workers=0` - avoids extra multiprocessing on this host - `task.env_runner.n_envs=1` - keeps PushT eval on the serial `SyncVectorEnv` path - `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0` - avoids video-writing issues on this stack - one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log` ## Output Locations - Smoke run: - `data/outputs/gpu_smoke2___pusht_gpu_smoke` - Longer GPU run: - `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42` - Files to inspect inside a run: - `.hydra/overrides.yaml` - `logs.json.txt` - `train.log` - `checkpoints/latest.ckpt` ## Known Caveats - The default config is still tuned for older assumptions: - `logging.mode=online` - `dataloader.num_workers=8` - `task.env_runner.n_envs=null` - `task.env_runner.n_test_vis=4` - `task.env_runner.n_train_vis=2` - In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.