feat(train): 跑通训练脚本

2026-02-05 14:08:43 +08:00
parent dd2749cb12
commit b0a944f7aa
17 changed files with 1002 additions and 464 deletions
--- a/roboimi/vla/RESNET_TRAINING_GUIDE.md
+++ b/roboimi/vla/RESNET_TRAINING_GUIDE.md
@@ -0,0 +1,238 @@
+# ResNet VLA Training Guide
+
+This guide explains how to train the VLA agent with ResNet backbone and action_dim=16, obs_dim=16.
+
+## Configuration Overview
+
+### 1. Backbone Configuration
+**File**: `roboimi/vla/conf/backbone/resnet.yaml`
+- Model: microsoft/resnet-18
+- Output dim: 1024 (512 channels × 2 from SpatialSoftmax)
+- Frozen by default for faster training
+
+### 2. Agent Configuration
+**File**: `roboimi/vla/conf/agent/resnet_diffusion.yaml`
+- Vision backbone: ResNet-18 with SpatialSoftmax
+- Action dimension: 16
+- Observation dimension: 16
+- Prediction horizon: 16 steps
+- Observation horizon: 2 steps
+- Diffusion steps: 100
+- Number of cameras: 2
+
+### 3. Dataset Configuration
+**File**: `roboimi/vla/conf/data/resnet_dataset.yaml`
+- Dataset class: RobotDiffusionDataset
+- Prediction horizon: 16
+- Observation horizon: 2
+- Camera names: [r_vis, top]
+- Normalization: gaussian (mean/std)
+
+### 4. Training Configuration
+**File**: `roboimi/vla/conf/config.yaml`
+- Batch size: 8
+- Learning rate: 1e-4
+- Max steps: 10000
+- Log frequency: 100 steps
+- Save frequency: 1000 steps
+- Device: cuda
+- Num workers: 4
+
+## Prerequisites
+
+### 1. Prepare Dataset
+Your dataset should be organized as:
+```
+/path/to/your/dataset/
+├── episode_0.hdf5
+├── episode_1.hdf5
+├── ...
+└── data_stats.pkl
+```
+
+Each HDF5 file should contain:
+```
+episode_N.hdf5
+├── action              # (T, 16) float32
+└── observations/
+    ├── qpos           # (T, 16) float32
+    └── images/
+        ├── r_vis/     # (T, H, W, 3) uint8
+        └── top/       # (T, H, W, 3) uint8
+```
+
+### 2. Generate Dataset Statistics
+Create `data_stats.pkl` with:
+```python
+import pickle
+import numpy as np
+
+stats = {
+    'action': {
+        'mean': np.zeros(16),
+        'std': np.ones(16)
+    },
+    'qpos': {
+        'mean': np.zeros(16),
+        'std': np.ones(16)
+    }
+}
+
+with open('/path/to/your/dataset/data_stats.pkl', 'wb') as f:
+    pickle.dump(stats, f)
+```
+
+Or use the provided script:
+```bash
+python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/your/dataset
+```
+
+## Usage
+
+### 1. Update Dataset Path
+Edit `roboimi/vla/conf/data/resnet_dataset.yaml`:
+```yaml
+dataset_dir: "/path/to/your/dataset"  # CHANGE THIS
+camera_names:
+  - r_vis  # CHANGE TO YOUR CAMERA NAMES
+  - top
+```
+
+### 2. Run Training
+```bash
+# Basic training
+python roboimi/demos/vla_scripts/train_vla.py
+
+# Override configurations
+python roboimi/demos/vla_scripts/train_vla.py train.batch_size=16
+python roboimi/demos/vla_scripts/train_vla.py train.device=cpu
+python roboimi/demos/vla_scripts/train_vla.py train.max_steps=20000
+python roboimi/demos/vla_scripts/train_vla.py data.dataset_dir=/custom/path
+
+# Debug mode (CPU, small batch, few steps)
+python roboimi/demos/vla_scripts/train_vla.py \
+    train.device=cpu \
+    train.batch_size=2 \
+    train.max_steps=10 \
+    train.num_workers=0
+```
+
+### 3. Monitor Training
+Checkpoints are saved to:
+- `checkpoints/vla_model_step_1000.pt` - Periodic checkpoints
+- `checkpoints/vla_model_best.pt` - Best model (lowest loss)
+- `checkpoints/vla_model_final.pt` - Final model
+
+## Architecture Details
+
+### Data Flow
+1. **Input**: Images from multiple cameras + proprioception (qpos)
+2. **Vision Encoder**: ResNet-18 → SpatialSoftmax → (B, T, 1024) per camera
+3. **Feature Concatenation**: All cameras + qpos → Global conditioning
+4. **Diffusion Policy**: 1D U-Net predicts noise on action sequences
+5. **Output**: Clean action sequence (B, 16, 16)
+
+### Training Process
+1. Sample random timestep t from [0, 100]
+2. Add noise to ground truth actions
+3. Predict noise using vision + proprioception conditioning
+4. Compute MSE loss between predicted and actual noise
+5. Backpropagate and update weights
+
+### Inference Process
+1. Extract visual features from current observation
+2. Start with random noise action sequence
+3. Iteratively denoise over 10 steps (DDPM scheduler)
+4. Return clean action sequence
+
+## Common Issues
+
+### Issue: Out of Memory
+**Solution**: Reduce batch size or use CPU
+```bash
+python train_vla.py train.batch_size=4 train.device=cpu
+```
+
+### Issue: Dataset not found
+**Solution**: Check dataset_dir path in config
+```bash
+python train_vla.py data.dataset_dir=/absolute/path/to/dataset
+```
+
+### Issue: Camera names mismatch
+**Solution**: Update camera_names in data config
+```yaml
+# roboimi/vla/conf/data/resnet_dataset.yaml
+camera_names:
+  - your_camera_1
+  - your_camera_2
+```
+
+### Issue: data_stats.pkl missing
+**Solution**: Generate statistics file
+```bash
+python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/dataset
+```
+
+## Model Files Created
+
+```
+roboimi/vla/
+├── conf/
+│   ├── config.yaml (UPDATED)
+│   ├── backbone/
+│   │   └── resnet.yaml (NEW)
+│   ├── agent/
+│   │   └── resnet_diffusion.yaml (NEW)
+│   └── data/
+│       └── resnet_dataset.yaml (NEW)
+├── models/
+│   └── backbones/
+│       ├── __init__.py (UPDATED - added resnet export)
+│       └── resnet.py (EXISTING)
+└── demos/vla_scripts/
+    └── train_vla.py (REWRITTEN)
+```
+
+## Next Steps
+
+1. **Prepare your dataset** in the required HDF5 format
+2. **Update dataset_dir** in `roboimi/vla/conf/data/resnet_dataset.yaml`
+3. **Run training** with `python roboimi/demos/vla_scripts/train_vla.py`
+4. **Monitor checkpoints** in `checkpoints/` directory
+5. **Evaluate** the trained model using the best checkpoint
+
+## Advanced Configuration
+
+### Use Different ResNet Variant
+Edit `roboimi/vla/conf/agent/resnet_diffusion.yaml`:
+```yaml
+vision_backbone:
+  model_name: "microsoft/resnet-50"  # or resnet-34, resnet-101
+```
+
+### Adjust Diffusion Steps
+```yaml
+# More steps = better quality, slower training
+diffusion_steps: 200  # default: 100
+```
+
+### Change Horizons
+```yaml
+pred_horizon: 32  # Predict more future steps
+obs_horizon: 4    # Use more history
+```
+
+### Multi-GPU Training
+```bash
+# Use CUDA device 1
+python train_vla.py train.device=cuda:1
+
+# For multi-GPU, use torch.distributed (requires code modification)
+```
+
+## References
+
+- ResNet Paper: https://arxiv.org/abs/1512.03385
+- Diffusion Policy: https://diffusion-policy.cs.columbia.edu/
+- VLA Framework Documentation: See CLAUDE.md in project root