chore(pusht): add 5090 repro docs and uv setup
This commit is contained in:
68
AGENTS.md
Normal file
68
AGENTS.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Agent Notes
|
||||
|
||||
## Purpose
|
||||
`~/diffusion_policy` is the Diffusion Policy training repo. The main workflow here is Hydra-driven training via `train.py`, with the canonical PushT image experiment configured by `image_pusht_diffusion_policy_cnn.yaml`.
|
||||
|
||||
## Top Level
|
||||
- `diffusion_policy/`: core code, configs, datasets, env runners, workspaces.
|
||||
- `data/`: local datasets, outputs, checkpoints, run logs.
|
||||
- `train.py`: main training entrypoint.
|
||||
- `eval.py`: checkpoint evaluation entrypoint.
|
||||
- `image_pusht_diffusion_policy_cnn.yaml`: canonical single-seed PushT image config from the README path.
|
||||
- `.venv/`: local `uv`-managed virtualenv.
|
||||
- `.uv-cache/`, `.uv-python/`: local `uv` cache and Python install state.
|
||||
- `README.md`: upstream instructions and canonical commands.
|
||||
|
||||
## Canonical PushT Image Path
|
||||
- Entrypoint: `python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml`
|
||||
- Dataset path in config: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||
- README canonical device override: `training.device=cuda:0`
|
||||
|
||||
## Data
|
||||
- PushT archive currently present at `data/pusht.zip`
|
||||
- Unpacked dataset used by training: `data/pusht/pusht_cchi_v7_replay.zarr`
|
||||
|
||||
## Local Compatibility Adjustments
|
||||
- `diffusion_policy/env_runner/pusht_image_runner.py` now uses `SyncVectorEnv` instead of `AsyncVectorEnv`.
|
||||
Reason: avoid shared-memory and semaphore failures on this host/session.
|
||||
- `diffusion_policy/gym_util/sync_vector_env.py` has local compatibility changes:
|
||||
- added `reset_async`
|
||||
- seeded `reset_wait`
|
||||
- updated `concatenate(...)` call order for the current `gym` API
|
||||
|
||||
## Environment Expectations
|
||||
- Use the local `uv` env at `.venv`
|
||||
- Verified local Python: `3.9.25`
|
||||
- Verified local Torch stack: `torch 2.8.0+cu128`, `torchvision 0.23.0+cu128`
|
||||
- Other key installed versions verified in `.venv`:
|
||||
- `gym 0.23.1`
|
||||
- `hydra-core 1.2.0`
|
||||
- `diffusers 0.11.1`
|
||||
- `huggingface_hub 0.10.1`
|
||||
- `wandb 0.13.3`
|
||||
- `zarr 2.12.0`
|
||||
- `numcodecs 0.10.2`
|
||||
- `av 14.0.1`
|
||||
- Important note: this shell currently reports `torch.cuda.is_available() == False`, so always verify CUDA access in the current session before assuming GPU is usable.
|
||||
|
||||
## Logging And Outputs
|
||||
- Hydra run outputs: `data/outputs/...`
|
||||
- Per-run files to check first:
|
||||
- `.hydra/overrides.yaml`
|
||||
- `logs.json.txt`
|
||||
- `train.log`
|
||||
- `checkpoints/latest.ckpt`
|
||||
- Extra launcher logs may live under `data/run_logs/`
|
||||
|
||||
## Practical Guidance
|
||||
- Inspect with `rg`, `sed`, and existing Hydra output folders before changing code.
|
||||
- Prefer config overrides before code edits.
|
||||
- On this host, start from these safety overrides unless revalidated:
|
||||
- `logging.mode=offline`
|
||||
- `dataloader.num_workers=0`
|
||||
- `val_dataloader.num_workers=0`
|
||||
- `task.env_runner.n_envs=1`
|
||||
- `task.env_runner.n_test_vis=0`
|
||||
- `task.env_runner.n_train_vis=0`
|
||||
- If a run fails, inspect `.hydra/overrides.yaml`, then `logs.json.txt`, then `train.log`.
|
||||
- Avoid driver or system changes unless the repo-local path is clearly blocked.
|
||||
Reference in New Issue
Block a user