chore(pusht): add 5090 repro docs and uv setup
This commit is contained in:
68
AGENTS.md
Normal file
68
AGENTS.md
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
# Agent Notes
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
`~/diffusion_policy` is the Diffusion Policy training repo. The main workflow here is Hydra-driven training via `train.py`, with the canonical PushT image experiment configured by `image_pusht_diffusion_policy_cnn.yaml`.
|
||||||
|
|
||||||
|
## Top Level
|
||||||
|
- `diffusion_policy/`: core code, configs, datasets, env runners, workspaces.
|
||||||
|
- `data/`: local datasets, outputs, checkpoints, run logs.
|
||||||
|
- `train.py`: main training entrypoint.
|
||||||
|
- `eval.py`: checkpoint evaluation entrypoint.
|
||||||
|
- `image_pusht_diffusion_policy_cnn.yaml`: canonical single-seed PushT image config from the README path.
|
||||||
|
- `.venv/`: local `uv`-managed virtualenv.
|
||||||
|
- `.uv-cache/`, `.uv-python/`: local `uv` cache and Python install state.
|
||||||
|
- `README.md`: upstream instructions and canonical commands.
|
||||||
|
|
||||||
|
## Canonical PushT Image Path
|
||||||
|
- Entrypoint: `python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml`
|
||||||
|
- Dataset path in config: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||||
|
- README canonical device override: `training.device=cuda:0`
|
||||||
|
|
||||||
|
## Data
|
||||||
|
- PushT archive currently present at `data/pusht.zip`
|
||||||
|
- Unpacked dataset used by training: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||||
|
|
||||||
|
## Local Compatibility Adjustments
|
||||||
|
- `diffusion_policy/env_runner/pusht_image_runner.py` now uses `SyncVectorEnv` instead of `AsyncVectorEnv`.
|
||||||
|
Reason: avoid shared-memory and semaphore failures on this host/session.
|
||||||
|
- `diffusion_policy/gym_util/sync_vector_env.py` has local compatibility changes:
|
||||||
|
- added `reset_async`
|
||||||
|
- seeded `reset_wait`
|
||||||
|
- updated `concatenate(...)` call order for the current `gym` API
|
||||||
|
|
||||||
|
## Environment Expectations
|
||||||
|
- Use the local `uv` env at `.venv`
|
||||||
|
- Verified local Python: `3.9.25`
|
||||||
|
- Verified local Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
|
||||||
|
- Other key installed versions verified in `.venv`:
|
||||||
|
- `gym 0.23.1`
|
||||||
|
- `hydra-core 1.2.0`
|
||||||
|
- `diffusers 0.11.1`
|
||||||
|
- `huggingface_hub 0.10.1`
|
||||||
|
- `wandb 0.13.3`
|
||||||
|
- `zarr 2.12.0`
|
||||||
|
- `numcodecs 0.10.2`
|
||||||
|
- `av 14.0.1`
|
||||||
|
- Important note: this shell currently reports `torch.cuda.is_available() == False`, so always verify CUDA access in the current session before assuming GPU is usable.
|
||||||
|
|
||||||
|
## Logging And Outputs
|
||||||
|
- Hydra run outputs: `data/outputs/...`
|
||||||
|
- Per-run files to check first:
|
||||||
|
- `.hydra/overrides.yaml`
|
||||||
|
- `logs.json.txt`
|
||||||
|
- `train.log`
|
||||||
|
- `checkpoints/latest.ckpt`
|
||||||
|
- Extra launcher logs may live under `data/run_logs/`
|
||||||
|
|
||||||
|
## Practical Guidance
|
||||||
|
- Inspect with `rg`, `sed`, and existing Hydra output folders before changing code.
|
||||||
|
- Prefer config overrides before code edits.
|
||||||
|
- On this host, start from these safety overrides unless revalidated:
|
||||||
|
- `logging.mode=offline`
|
||||||
|
- `dataloader.num_workers=0`
|
||||||
|
- `val_dataloader.num_workers=0`
|
||||||
|
- `task.env_runner.n_envs=1`
|
||||||
|
- `task.env_runner.n_test_vis=0`
|
||||||
|
- `task.env_runner.n_train_vis=0`
|
||||||
|
- If a run fails, inspect `.hydra/overrides.yaml`, then `logs.json.txt`, then `train.log`.
|
||||||
|
- Avoid driver or system changes unless the repo-local path is clearly blocked.
|
||||||
108
PUSHT_REPRO_5090.md
Normal file
108
PUSHT_REPRO_5090.md
Normal file
@@ -0,0 +1,108 @@
|
|||||||
|
# PushT Repro On 5090
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
Reproduce the canonical single-seed image PushT experiment from this repo in `~/diffusion_policy` using `image_pusht_diffusion_policy_cnn.yaml`.
|
||||||
|
|
||||||
|
## Current Verified Local Setup
|
||||||
|
- Virtualenv: `./.venv` managed with `uv`
|
||||||
|
- Python: `3.9.25`
|
||||||
|
- Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
|
||||||
|
- Version strategy used here:
|
||||||
|
- newer Torch/CUDA stack for current 5090-class hardware support
|
||||||
|
- keep older repo-era packages where they are still required by the code
|
||||||
|
- Verified key pins in `.venv`:
|
||||||
|
- `numpy 1.26.4`
|
||||||
|
- `gym 0.23.1`
|
||||||
|
- `hydra-core 1.2.0`
|
||||||
|
- `diffusers 0.11.1`
|
||||||
|
- `huggingface_hub 0.10.1`
|
||||||
|
- `wandb 0.13.3`
|
||||||
|
- `zarr 2.12.0`
|
||||||
|
- `numcodecs 0.10.2`
|
||||||
|
- `av 14.0.1`
|
||||||
|
- `robomimic 0.2.0`
|
||||||
|
|
||||||
|
## Dataset
|
||||||
|
- README source: `https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip`
|
||||||
|
- Local archive currently present: `data/pusht.zip`
|
||||||
|
- Unpacked dataset used by the config: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||||
|
|
||||||
|
## Repo-Local Code Adjustments
|
||||||
|
- `diffusion_policy/env_runner/pusht_image_runner.py`
|
||||||
|
- switched PushT image evaluation from `AsyncVectorEnv` to `SyncVectorEnv`
|
||||||
|
- `diffusion_policy/gym_util/sync_vector_env.py`
|
||||||
|
- added `reset_async`
|
||||||
|
- added seeded `reset_wait`
|
||||||
|
- updated `concatenate(...)` call order for current `gym`
|
||||||
|
|
||||||
|
These changes were needed to keep PushT evaluation working without the async shared-memory path.
|
||||||
|
|
||||||
|
## Validated GPU Smoke Command
|
||||||
|
This route is verified by `data/outputs/gpu_smoke2___pusht_gpu_smoke`, which contains `logs.json.txt` plus checkpoints:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
.venv/bin/python train.py \
|
||||||
|
--config-dir=. \
|
||||||
|
--config-name=image_pusht_diffusion_policy_cnn.yaml \
|
||||||
|
training.seed=42 \
|
||||||
|
training.device=cuda:0 \
|
||||||
|
logging.mode=offline \
|
||||||
|
dataloader.num_workers=0 \
|
||||||
|
val_dataloader.num_workers=0 \
|
||||||
|
task.env_runner.n_envs=1 \
|
||||||
|
training.debug=true \
|
||||||
|
task.env_runner.n_test=2 \
|
||||||
|
task.env_runner.n_test_vis=0 \
|
||||||
|
task.env_runner.n_train=1 \
|
||||||
|
task.env_runner.n_train_vis=0 \
|
||||||
|
task.env_runner.max_steps=20
|
||||||
|
```
|
||||||
|
|
||||||
|
## Practical Full Training Command Used Here
|
||||||
|
This matches the longer GPU run under `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
.venv/bin/python train.py \
|
||||||
|
--config-dir=. \
|
||||||
|
--config-name=image_pusht_diffusion_policy_cnn.yaml \
|
||||||
|
training.seed=42 \
|
||||||
|
training.device=cuda:0 \
|
||||||
|
logging.mode=offline \
|
||||||
|
dataloader.num_workers=0 \
|
||||||
|
val_dataloader.num_workers=0 \
|
||||||
|
task.env_runner.n_envs=1 \
|
||||||
|
task.env_runner.n_test_vis=0 \
|
||||||
|
task.env_runner.n_train_vis=0 \
|
||||||
|
hydra.run.dir=data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why These Overrides Were Used
|
||||||
|
- `logging.mode=offline`
|
||||||
|
- avoids needing a W&B login and still leaves local run metadata in the output dir
|
||||||
|
- `dataloader.num_workers=0` and `val_dataloader.num_workers=0`
|
||||||
|
- avoids extra multiprocessing on this host
|
||||||
|
- `task.env_runner.n_envs=1`
|
||||||
|
- keeps PushT eval on the serial `SyncVectorEnv` path
|
||||||
|
- `task.env_runner.n_test_vis=0` and `task.env_runner.n_train_vis=0`
|
||||||
|
- avoids video-writing issues on this stack
|
||||||
|
- one earlier GPU run with default vis settings logged libav/libx264 `profile=high` errors in `data/outputs/_train_diffusion_unet_hybrid_pusht_image_gpu_seed42/train.log`
|
||||||
|
|
||||||
|
## Output Locations
|
||||||
|
- Smoke run:
|
||||||
|
- `data/outputs/gpu_smoke2___pusht_gpu_smoke`
|
||||||
|
- Longer GPU run:
|
||||||
|
- `data/outputs/2026.03.13/15.37.00_train_diffusion_unet_hybrid_pusht_image_gpu_seed42`
|
||||||
|
- Files to inspect inside a run:
|
||||||
|
- `.hydra/overrides.yaml`
|
||||||
|
- `logs.json.txt`
|
||||||
|
- `train.log`
|
||||||
|
- `checkpoints/latest.ckpt`
|
||||||
|
|
||||||
|
## Known Caveats
|
||||||
|
- The default config is still tuned for older assumptions:
|
||||||
|
- `logging.mode=online`
|
||||||
|
- `dataloader.num_workers=8`
|
||||||
|
- `task.env_runner.n_envs=null`
|
||||||
|
- `task.env_runner.n_test_vis=4`
|
||||||
|
- `task.env_runner.n_train_vis=2`
|
||||||
|
- In this shell, `torch.cuda.is_available()` currently reports `False` even though the repo contains validated GPU smoke/full run artifacts. Re-check device visibility in the current session before restarting a GPU run.
|
||||||
@@ -8,8 +8,7 @@ import dill
|
|||||||
import math
|
import math
|
||||||
import wandb.sdk.data_types.video as wv
|
import wandb.sdk.data_types.video as wv
|
||||||
from diffusion_policy.env.pusht.pusht_image_env import PushTImageEnv
|
from diffusion_policy.env.pusht.pusht_image_env import PushTImageEnv
|
||||||
from diffusion_policy.gym_util.async_vector_env import AsyncVectorEnv
|
from diffusion_policy.gym_util.sync_vector_env import SyncVectorEnv
|
||||||
# from diffusion_policy.gym_util.sync_vector_env import SyncVectorEnv
|
|
||||||
from diffusion_policy.gym_util.multistep_wrapper import MultiStepWrapper
|
from diffusion_policy.gym_util.multistep_wrapper import MultiStepWrapper
|
||||||
from diffusion_policy.gym_util.video_recording_wrapper import VideoRecordingWrapper, VideoRecorder
|
from diffusion_policy.gym_util.video_recording_wrapper import VideoRecordingWrapper, VideoRecorder
|
||||||
|
|
||||||
@@ -121,7 +120,9 @@ class PushTImageRunner(BaseImageRunner):
|
|||||||
env_prefixs.append('test/')
|
env_prefixs.append('test/')
|
||||||
env_init_fn_dills.append(dill.dumps(init_fn))
|
env_init_fn_dills.append(dill.dumps(init_fn))
|
||||||
|
|
||||||
env = AsyncVectorEnv(env_fns)
|
# This environment can run without multiprocessing, which avoids
|
||||||
|
# shared-memory and semaphore restrictions on some machines.
|
||||||
|
env = SyncVectorEnv(env_fns)
|
||||||
|
|
||||||
# test env
|
# test env
|
||||||
# env.reset(seed=env_seeds)
|
# env.reset(seed=env_seeds)
|
||||||
|
|||||||
@@ -60,17 +60,44 @@ class SyncVectorEnv(VectorEnv):
|
|||||||
for env, seed in zip(self.envs, seeds):
|
for env, seed in zip(self.envs, seeds):
|
||||||
env.seed(seed)
|
env.seed(seed)
|
||||||
|
|
||||||
def reset_wait(self):
|
def reset_async(self, seed=None, return_info=False, options=None):
|
||||||
|
if seed is None:
|
||||||
|
seeds = [None for _ in range(self.num_envs)]
|
||||||
|
elif isinstance(seed, int):
|
||||||
|
seeds = [seed + i for i in range(self.num_envs)]
|
||||||
|
else:
|
||||||
|
seeds = list(seed)
|
||||||
|
assert len(seeds) == self.num_envs
|
||||||
|
self._reset_seeds = seeds
|
||||||
|
self._reset_return_info = return_info
|
||||||
|
self._reset_options = options
|
||||||
|
|
||||||
|
def reset_wait(self, seed=None, return_info=False, options=None):
|
||||||
|
seeds = getattr(self, '_reset_seeds', None)
|
||||||
|
if seeds is None:
|
||||||
|
if seed is None:
|
||||||
|
seeds = [None for _ in range(self.num_envs)]
|
||||||
|
elif isinstance(seed, int):
|
||||||
|
seeds = [seed + i for i in range(self.num_envs)]
|
||||||
|
else:
|
||||||
|
seeds = list(seed)
|
||||||
self._dones[:] = False
|
self._dones[:] = False
|
||||||
observations = []
|
observations = []
|
||||||
for env in self.envs:
|
infos = []
|
||||||
|
for env, seed_i in zip(self.envs, seeds):
|
||||||
|
if seed_i is not None:
|
||||||
|
env.seed(seed_i)
|
||||||
observation = env.reset()
|
observation = env.reset()
|
||||||
observations.append(observation)
|
observations.append(observation)
|
||||||
|
infos.append({})
|
||||||
self.observations = concatenate(
|
self.observations = concatenate(
|
||||||
observations, self.observations, self.single_observation_space
|
self.single_observation_space, observations, self.observations
|
||||||
)
|
)
|
||||||
|
|
||||||
return deepcopy(self.observations) if self.copy else self.observations
|
obs = deepcopy(self.observations) if self.copy else self.observations
|
||||||
|
if return_info:
|
||||||
|
return obs, infos
|
||||||
|
return obs
|
||||||
|
|
||||||
def step_async(self, actions):
|
def step_async(self, actions):
|
||||||
self._actions = actions
|
self._actions = actions
|
||||||
@@ -84,7 +111,7 @@ class SyncVectorEnv(VectorEnv):
|
|||||||
observations.append(observation)
|
observations.append(observation)
|
||||||
infos.append(info)
|
infos.append(info)
|
||||||
self.observations = concatenate(
|
self.observations = concatenate(
|
||||||
observations, self.observations, self.single_observation_space
|
self.single_observation_space, observations, self.observations
|
||||||
)
|
)
|
||||||
|
|
||||||
return (
|
return (
|
||||||
|
|||||||
38
requirements-pusht-5090.txt
Normal file
38
requirements-pusht-5090.txt
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
# Direct package pins for the canonical PushT image workflow on host 5090.
|
||||||
|
# Torch/TorchVision/Torchaudio are installed separately from the cu128 index in setup_uv_pusht_5090.sh.
|
||||||
|
|
||||||
|
numpy==1.26.4
|
||||||
|
scipy==1.11.4
|
||||||
|
numba==0.59.1
|
||||||
|
llvmlite==0.42.0
|
||||||
|
cffi==1.15.1
|
||||||
|
cython==0.29.32
|
||||||
|
h5py==3.8.0
|
||||||
|
pandas==2.2.3
|
||||||
|
zarr==2.12.0
|
||||||
|
numcodecs==0.10.2
|
||||||
|
hydra-core==1.2.0
|
||||||
|
einops==0.4.1
|
||||||
|
tqdm==4.64.1
|
||||||
|
dill==0.3.5.1
|
||||||
|
scikit-video==1.1.11
|
||||||
|
scikit-image==0.19.3
|
||||||
|
gym==0.23.1
|
||||||
|
pymunk==6.2.1
|
||||||
|
wandb==0.13.3
|
||||||
|
threadpoolctl==3.1.0
|
||||||
|
shapely==1.8.5.post1
|
||||||
|
imageio==2.22.0
|
||||||
|
imageio-ffmpeg==0.4.7
|
||||||
|
termcolor==2.0.1
|
||||||
|
tensorboard==2.10.1
|
||||||
|
tensorboardx==2.5.1
|
||||||
|
psutil==7.2.2
|
||||||
|
click==8.1.8
|
||||||
|
boto3==1.24.96
|
||||||
|
diffusers==0.11.1
|
||||||
|
huggingface-hub==0.10.1
|
||||||
|
av==14.0.1
|
||||||
|
pygame==2.5.2
|
||||||
|
robomimic==0.2.0
|
||||||
|
opencv-python-headless==4.10.0.84
|
||||||
20
setup_uv_pusht_5090.sh
Executable file
20
setup_uv_pusht_5090.sh
Executable file
@@ -0,0 +1,20 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ROOT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||||
|
cd "$ROOT_DIR"
|
||||||
|
|
||||||
|
export UV_CACHE_DIR="${UV_CACHE_DIR:-$ROOT_DIR/.uv-cache}"
|
||||||
|
export UV_PYTHON_INSTALL_DIR="${UV_PYTHON_INSTALL_DIR:-$ROOT_DIR/.uv-python}"
|
||||||
|
|
||||||
|
uv venv --python 3.9 .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
uv pip install --upgrade pip wheel setuptools==80.9.0
|
||||||
|
uv pip install --python .venv/bin/python \
|
||||||
|
--index-url https://download.pytorch.org/whl/cu128 \
|
||||||
|
torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128
|
||||||
|
uv pip install --python .venv/bin/python -r requirements-pusht-5090.txt
|
||||||
|
uv pip install --python .venv/bin/python -e .
|
||||||
|
|
||||||
|
echo "uv environment ready at $ROOT_DIR/.venv"
|
||||||
Reference in New Issue
Block a user