chore(pusht): add 5090 repro docs and uv setup

This commit is contained in:
Logic
2026-03-14 12:25:44 +08:00
parent 5ba07ac666
commit 08c1950c6d
6 changed files with 270 additions and 8 deletions

68
AGENTS.md Normal file
View File

@@ -0,0 +1,68 @@
# Agent Notes
## Purpose
`~/diffusion_policy` is the Diffusion Policy training repo. The main workflow here is Hydra-driven training via `train.py`, with the canonical PushT image experiment configured by `image_pusht_diffusion_policy_cnn.yaml`.
## Top Level
- `diffusion_policy/`: core code, configs, datasets, env runners, workspaces.
- `data/`: local datasets, outputs, checkpoints, run logs.
- `train.py`: main training entrypoint.
- `eval.py`: checkpoint evaluation entrypoint.
- `image_pusht_diffusion_policy_cnn.yaml`: canonical single-seed PushT image config from the README path.
- `.venv/`: local `uv`-managed virtualenv.
- `.uv-cache/`, `.uv-python/`: local `uv` cache and Python install state.
- `README.md`: upstream instructions and canonical commands.
## Canonical PushT Image Path
- Entrypoint: `python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml`
- Dataset path in config: `data/pusht/pusht_cchi_v7_replay.zarr`
- README canonical device override: `training.device=cuda:0`
## Data
- PushT archive currently present at `data/pusht.zip`
- Unpacked dataset used by training: `data/pusht/pusht_cchi_v7_replay.zarr`
## Local Compatibility Adjustments
- `diffusion_policy/env_runner/pusht_image_runner.py` now uses `SyncVectorEnv` instead of `AsyncVectorEnv`.
Reason: avoid shared-memory and semaphore failures on this host/session.
- `diffusion_policy/gym_util/sync_vector_env.py` has local compatibility changes:
- added `reset_async`
- seeded `reset_wait`
- updated `concatenate(...)` call order for the current `gym` API
## Environment Expectations
- Use the local `uv` env at `.venv`
- Verified local Python: `3.9.25`
- Verified local Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
- Other key installed versions verified in `.venv`:
- `gym 0.23.1`
- `hydra-core 1.2.0`
- `diffusers 0.11.1`
- `huggingface_hub 0.10.1`
- `wandb 0.13.3`
- `zarr 2.12.0`
- `numcodecs 0.10.2`
- `av 14.0.1`
- Important note: this shell currently reports `torch.cuda.is_available() == False`, so always verify CUDA access in the current session before assuming GPU is usable.
## Logging And Outputs
- Hydra run outputs: `data/outputs/...`
- Per-run files to check first:
- `.hydra/overrides.yaml`
- `logs.json.txt`
- `train.log`
- `checkpoints/latest.ckpt`
- Extra launcher logs may live under `data/run_logs/`
## Practical Guidance
- Inspect with `rg`, `sed`, and existing Hydra output folders before changing code.
- Prefer config overrides before code edits.
- On this host, start from these safety overrides unless revalidated:
- `logging.mode=offline`
- `dataloader.num_workers=0`
- `val_dataloader.num_workers=0`
- `task.env_runner.n_envs=1`
- `task.env_runner.n_test_vis=0`
- `task.env_runner.n_train_vis=0`
- If a run fails, inspect `.hydra/overrides.yaml`, then `logs.json.txt`, then `train.log`.
- Avoid driver or system changes unless the repo-local path is clearly blocked.

108
PUSHT_REPRO_5090.md Normal file
View File

@@ -0,0 +1,108 @@
# PushT Repro On 5090
## Goal
Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`.
## Current Verified Local Setup
- Virtualenv: `./.venv` managed with `uv`
- Python: `3.9.25`
- Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
- Version strategy used here:
- newer Torch/CUDA stack for current 5090-class hardware support
- keep older repo-era packages where they are still required by the code
- Verified key pins in `.venv`:
- `numpy 1.26.4`
- `gym 0.23.1`
- `hydra-core 1.2.0`
- `diffusers 0.11.1`
- `huggingface_hub 0.10.1`
- `wandb 0.13.3`
- `zarr 2.12.0`
- `numcodecs 0.10.2`
- `av 14.0.1`
- `robomimic 0.2.0`
## Dataset
- README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip`
- Local archive currently present: `data/pusht.zip`
- Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr`
## Repo-Local Code Adjustments
- `diffusion_policy/env_runner/pusht_image_runner.py`
- switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv`
- `diffusion_policy/gym_util/sync_vector_env.py`
- added `reset_async`
- added seeded `reset_wait`
- updated `concatenate(...)` call order for current `gym`
These changes were needed to keep PushT evaluation working without the async shared-memory path.
## Validated GPU Smoke Command
This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints:
```bash
.venv/bin/python train.py \
--config-dir=. \
--config-name=image_pusht_diffusion_policy_cnn.yaml \
training.seed=42 \
training.device=cuda:0 \
logging.mode=offline \
dataloader.num_workers=0 \
val_dataloader.num_workers=0 \
task.env_runner.n_envs=1 \
training.debug=true \
task.env_runner.n_test=2 \
task.env_runner.n_test_vis=0 \
task.env_runner.n_train=1 \
task.env_runner.n_train_vis=0 \
task.env_runner.max_steps=20
```
## Practical Full Training Command Used Here
This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`:
```bash
.venv/bin/python train.py \
--config-dir=. \
--config-name=image_pusht_diffusion_policy_cnn.yaml \
training.seed=42 \
training.device=cuda:0 \
logging.mode=offline \
dataloader.num_workers=0 \
val_dataloader.num_workers=0 \
task.env_runner.n_envs=1 \
task.env_runner.n_test_vis=0 \
task.env_runner.n_train_vis=0 \
hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
```
## Why These Overrides Were Used
- `logging.mode=offline`
- avoids needing a W&B login and still leaves local run metadata in the output dir
- `dataloader.num_workers=0` and `val_dataloader.num_workers=0`
- avoids extra multiprocessing on this host
- `task.env_runner.n_envs=1`
- keeps PushT eval on the serial `SyncVectorEnv` path
- `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0`
- avoids video-writing issues on this stack
- one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log`
## Output Locations
- Smoke run:
- `data/outputs/gpu_smoke2___pusht_gpu_smoke`
- Longer GPU run:
- `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`
- Files to inspect inside a run:
- `.hydra/overrides.yaml`
- `logs.json.txt`
- `train.log`
- `checkpoints/latest.ckpt`
## Known Caveats
- The default config is still tuned for older assumptions:
- `logging.mode=online`
- `dataloader.num_workers=8`
- `task.env_runner.n_envs=null`
- `task.env_runner.n_test_vis=4`
- `task.env_runner.n_train_vis=2`
- In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.

View File

@@ -8,8 +8,7 @@ import dill
import math
import wandb.sdk.data_types.video as wv
from diffusion_policy.env.pusht.pusht_image_env import PushTImageEnv
from diffusion_policy.gym_util.async_vector_env import AsyncVectorEnv
# from diffusion_policy.gym_util.sync_vector_env import SyncVectorEnv
from diffusion_policy.gym_util.sync_vector_env import SyncVectorEnv
from diffusion_policy.gym_util.multistep_wrapper import MultiStepWrapper
from diffusion_policy.gym_util.video_recording_wrapper import VideoRecordingWrapper, VideoRecorder
@@ -121,7 +120,9 @@ class PushTImageRunner(BaseImageRunner):
env_prefixs.append('test/')
env_init_fn_dills.append(dill.dumps(init_fn))
env = AsyncVectorEnv(env_fns)
# This environment can run without multiprocessing, which avoids
# shared-memory and semaphore restrictions on some machines.
env = SyncVectorEnv(env_fns)
# test env
# env.reset(seed=env_seeds)

View File

@@ -60,17 +60,44 @@ class SyncVectorEnv(VectorEnv):
for env, seed in zip(self.envs, seeds):
env.seed(seed)
def reset_wait(self):
def reset_async(self, seed=None, return_info=False, options=None):
if seed is None:
seeds = [None for _ in range(self.num_envs)]
elif isinstance(seed, int):
seeds = [seed + i for i in range(self.num_envs)]
else:
seeds = list(seed)
assert len(seeds) == self.num_envs
self._reset_seeds = seeds
self._reset_return_info = return_info
self._reset_options = options
def reset_wait(self, seed=None, return_info=False, options=None):
seeds = getattr(self, '_reset_seeds', None)
if seeds is None:
if seed is None:
seeds = [None for _ in range(self.num_envs)]
elif isinstance(seed, int):
seeds = [seed + i for i in range(self.num_envs)]
else:
seeds = list(seed)
self._dones[:] = False
observations = []
for env in self.envs:
infos = []
for env, seed_i in zip(self.envs, seeds):
if seed_i is not None:
env.seed(seed_i)
observation = env.reset()
observations.append(observation)
infos.append({})
self.observations = concatenate(
observations, self.observations, self.single_observation_space
self.single_observation_space, observations, self.observations
)
return deepcopy(self.observations) if self.copy else self.observations
obs = deepcopy(self.observations) if self.copy else self.observations
if return_info:
return obs, infos
return obs
def step_async(self, actions):
self._actions = actions
@@ -84,7 +111,7 @@ class SyncVectorEnv(VectorEnv):
observations.append(observation)
infos.append(info)
self.observations = concatenate(
observations, self.observations, self.single_observation_space
self.single_observation_space, observations, self.observations
)
return (

View File

@@ -0,0 +1,38 @@
# Direct package pins for the canonical PushT image workflow on host 5090.
# Torch/TorchVision/Torchaudio are installed separately from the cu128 index in setup_uv_pusht_5090.sh.
numpy==1.26.4
scipy==1.11.4
numba==0.59.1
llvmlite==0.42.0
cffi==1.15.1
cython==0.29.32
h5py==3.8.0
pandas==2.2.3
zarr==2.12.0
numcodecs==0.10.2
hydra-core==1.2.0
einops==0.4.1
tqdm==4.64.1
dill==0.3.5.1
scikit-video==1.1.11
scikit-image==0.19.3
gym==0.23.1
pymunk==6.2.1
wandb==0.13.3
threadpoolctl==3.1.0
shapely==1.8.5.post1
imageio==2.22.0
imageio-ffmpeg==0.4.7
termcolor==2.0.1
tensorboard==2.10.1
tensorboardx==2.5.1
psutil==7.2.2
click==8.1.8
boto3==1.24.96
diffusers==0.11.1
huggingface-hub==0.10.1
av==14.0.1
pygame==2.5.2
robomimic==0.2.0
opencv-python-headless==4.10.0.84

20
setup_uv_pusht_5090.sh Executable file
View File

@@ -0,0 +1,20 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$ROOT_DIR"
export UV_CACHE_DIR="${UV_CACHE_DIR:-$ROOT_DIR/.uv-cache}"
export UV_PYTHON_INSTALL_DIR="${UV_PYTHON_INSTALL_DIR:-$ROOT_DIR/.uv-python}"
uv venv --python 3.9 .venv
source .venv/bin/activate
uv pip install --upgrade pip wheel setuptools==80.9.0
uv pip install --python .venv/bin/python \
--index-url https://download.pytorch.org/whl/cu128 \
torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128
uv pip install --python .venv/bin/python -r requirements-pusht-5090.txt
uv pip install --python .venv/bin/python -e .
echo "uv environment ready at $ROOT_DIR/.venv"