diffusion_policy/PUSHT_REPRO_5090.md

# PushT Repro On 5090

## Goal
Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`.

## Current Verified Local Setup
- Virtualenv: `./.venv` managed with `uv`
- Python: `3.9.25`
- Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
- Version strategy used here:
  - newer Torch/CUDA stack for current 5090-class hardware support
  - keep older repo-era packages where they are still required by the code
- Verified key pins in `.venv`:
  - `numpy 1.26.4`
  - `gym 0.23.1`
  - `hydra-core 1.2.0`
  - `diffusers 0.11.1`
  - `huggingface_hub 0.10.1`
  - `wandb 0.13.3`
  - `zarr 2.12.0`
  - `numcodecs 0.10.2`
  - `av 14.0.1`
  - `robomimic 0.2.0`

## Dataset
- README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip`
- Local archive currently present: `data/pusht.zip`
- Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr`

## Repo-Local Code Adjustments
- `diffusion_policy/env_runner/pusht_image_runner.py`
  - switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv`
- `diffusion_policy/gym_util/sync_vector_env.py`
  - added `reset_async`
  - added seeded `reset_wait`
  - updated `concatenate(...)` call order for current `gym`

These changes were needed to keep PushT evaluation working without the async shared-memory path.

## Validated GPU Smoke Command
This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints:

```bash
.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  training.debug=true \
  task.env_runner.n_test=2 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train=1 \
  task.env_runner.n_train_vis=0 \
  task.env_runner.max_steps=20
```

## Practical Full Training Command Used Here
This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`:

```bash
.venv/bin/python train.py \
  --config-dir=. \
  --config-name=image_pusht_diffusion_policy_cnn.yaml \
  training.seed=42 \
  training.device=cuda:0 \
  logging.mode=offline \
  dataloader.num_workers=0 \
  val_dataloader.num_workers=0 \
  task.env_runner.n_envs=1 \
  task.env_runner.n_test_vis=0 \
  task.env_runner.n_train_vis=0 \
  hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
```

## Why These Overrides Were Used
- `logging.mode=offline`
  - avoids needing a W&B login and still leaves local run metadata in the output dir
- `dataloader.num_workers=0` and `val_dataloader.num_workers=0`
  - avoids extra multiprocessing on this host
- `task.env_runner.n_envs=1`
  - keeps PushT eval on the serial `SyncVectorEnv` path
- `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0`
  - avoids video-writing issues on this stack
  - one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log`

## Output Locations
- Smoke run:
  - `data/outputs/gpu_smoke2___pusht_gpu_smoke`
- Longer GPU run:
  - `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`
- Files to inspect inside a run:
  - `.hydra/overrides.yaml`
  - `logs.json.txt`
  - `train.log`
  - `checkpoints/latest.ckpt`

## Known Caveats
- The default config is still tuned for older assumptions:
  - `logging.mode=online`
  - `dataloader.num_workers=8`
  - `task.env_runner.n_envs=null`
  - `task.env_runner.n_test_vis=4`
  - `task.env_runner.n_train_vis=2`
- In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.