feat(train): 跑通训练脚本

This commit is contained in:
gouhanke
2026-02-05 14:08:43 +08:00
parent dd2749cb12
commit b0a944f7aa
17 changed files with 1002 additions and 464 deletions

View File

@@ -0,0 +1,238 @@
# ResNet VLA Training Guide
This guide explains how to train the VLA agent with ResNet backbone and action_dim=16, obs_dim=16.
## Configuration Overview
### 1. Backbone Configuration
**File**: `roboimi/vla/conf/backbone/resnet.yaml`
- Model: microsoft/resnet-18
- Output dim: 1024 (512 channels × 2 from SpatialSoftmax)
- Frozen by default for faster training
### 2. Agent Configuration
**File**: `roboimi/vla/conf/agent/resnet_diffusion.yaml`
- Vision backbone: ResNet-18 with SpatialSoftmax
- Action dimension: 16
- Observation dimension: 16
- Prediction horizon: 16 steps
- Observation horizon: 2 steps
- Diffusion steps: 100
- Number of cameras: 2
### 3. Dataset Configuration
**File**: `roboimi/vla/conf/data/resnet_dataset.yaml`
- Dataset class: RobotDiffusionDataset
- Prediction horizon: 16
- Observation horizon: 2
- Camera names: [r_vis, top]
- Normalization: gaussian (mean/std)
### 4. Training Configuration
**File**: `roboimi/vla/conf/config.yaml`
- Batch size: 8
- Learning rate: 1e-4
- Max steps: 10000
- Log frequency: 100 steps
- Save frequency: 1000 steps
- Device: cuda
- Num workers: 4
## Prerequisites
### 1. Prepare Dataset
Your dataset should be organized as:
```
/path/to/your/dataset/
├── episode_0.hdf5
├── episode_1.hdf5
├── ...
└── data_stats.pkl
```
Each HDF5 file should contain:
```
episode_N.hdf5
├── action # (T, 16) float32
└── observations/
├── qpos # (T, 16) float32
└── images/
├── r_vis/ # (T, H, W, 3) uint8
└── top/ # (T, H, W, 3) uint8
```
### 2. Generate Dataset Statistics
Create `data_stats.pkl` with:
```python
import pickle
import numpy as np
stats = {
'action': {
'mean': np.zeros(16),
'std': np.ones(16)
},
'qpos': {
'mean': np.zeros(16),
'std': np.ones(16)
}
}
with open('/path/to/your/dataset/data_stats.pkl', 'wb') as f:
pickle.dump(stats, f)
```
Or use the provided script:
```bash
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/your/dataset
```
## Usage
### 1. Update Dataset Path
Edit `roboimi/vla/conf/data/resnet_dataset.yaml`:
```yaml
dataset_dir: "/path/to/your/dataset" # CHANGE THIS
camera_names:
- r_vis # CHANGE TO YOUR CAMERA NAMES
- top
```
### 2. Run Training
```bash
# Basic training
python roboimi/demos/vla_scripts/train_vla.py
# Override configurations
python roboimi/demos/vla_scripts/train_vla.py train.batch_size=16
python roboimi/demos/vla_scripts/train_vla.py train.device=cpu
python roboimi/demos/vla_scripts/train_vla.py train.max_steps=20000
python roboimi/demos/vla_scripts/train_vla.py data.dataset_dir=/custom/path
# Debug mode (CPU, small batch, few steps)
python roboimi/demos/vla_scripts/train_vla.py \
train.device=cpu \
train.batch_size=2 \
train.max_steps=10 \
train.num_workers=0
```
### 3. Monitor Training
Checkpoints are saved to:
- `checkpoints/vla_model_step_1000.pt` - Periodic checkpoints
- `checkpoints/vla_model_best.pt` - Best model (lowest loss)
- `checkpoints/vla_model_final.pt` - Final model
## Architecture Details
### Data Flow
1. **Input**: Images from multiple cameras + proprioception (qpos)
2. **Vision Encoder**: ResNet-18 → SpatialSoftmax → (B, T, 1024) per camera
3. **Feature Concatenation**: All cameras + qpos → Global conditioning
4. **Diffusion Policy**: 1D U-Net predicts noise on action sequences
5. **Output**: Clean action sequence (B, 16, 16)
### Training Process
1. Sample random timestep t from [0, 100]
2. Add noise to ground truth actions
3. Predict noise using vision + proprioception conditioning
4. Compute MSE loss between predicted and actual noise
5. Backpropagate and update weights
### Inference Process
1. Extract visual features from current observation
2. Start with random noise action sequence
3. Iteratively denoise over 10 steps (DDPM scheduler)
4. Return clean action sequence
## Common Issues
### Issue: Out of Memory
**Solution**: Reduce batch size or use CPU
```bash
python train_vla.py train.batch_size=4 train.device=cpu
```
### Issue: Dataset not found
**Solution**: Check dataset_dir path in config
```bash
python train_vla.py data.dataset_dir=/absolute/path/to/dataset
```
### Issue: Camera names mismatch
**Solution**: Update camera_names in data config
```yaml
# roboimi/vla/conf/data/resnet_dataset.yaml
camera_names:
- your_camera_1
- your_camera_2
```
### Issue: data_stats.pkl missing
**Solution**: Generate statistics file
```bash
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/dataset
```
## Model Files Created
```
roboimi/vla/
├── conf/
│ ├── config.yaml (UPDATED)
│ ├── backbone/
│ │ └── resnet.yaml (NEW)
│ ├── agent/
│ │ └── resnet_diffusion.yaml (NEW)
│ └── data/
│ └── resnet_dataset.yaml (NEW)
├── models/
│ └── backbones/
│ ├── __init__.py (UPDATED - added resnet export)
│ └── resnet.py (EXISTING)
└── demos/vla_scripts/
└── train_vla.py (REWRITTEN)
```
## Next Steps
1. **Prepare your dataset** in the required HDF5 format
2. **Update dataset_dir** in `roboimi/vla/conf/data/resnet_dataset.yaml`
3. **Run training** with `python roboimi/demos/vla_scripts/train_vla.py`
4. **Monitor checkpoints** in `checkpoints/` directory
5. **Evaluate** the trained model using the best checkpoint
## Advanced Configuration
### Use Different ResNet Variant
Edit `roboimi/vla/conf/agent/resnet_diffusion.yaml`:
```yaml
vision_backbone:
model_name: "microsoft/resnet-50" # or resnet-34, resnet-101
```
### Adjust Diffusion Steps
```yaml
# More steps = better quality, slower training
diffusion_steps: 200 # default: 100
```
### Change Horizons
```yaml
pred_horizon: 32 # Predict more future steps
obs_horizon: 4 # Use more history
```
### Multi-GPU Training
```bash
# Use CUDA device 1
python train_vla.py train.device=cuda:1
# For multi-GPU, use torch.distributed (requires code modification)
```
## References
- ResNet Paper: https://arxiv.org/abs/1512.03385
- Diffusion Policy: https://diffusion-policy.cs.columbia.edu/
- VLA Framework Documentation: See CLAUDE.md in project root