feat(train): 跑通训练脚本
This commit is contained in:
238
roboimi/vla/RESNET_TRAINING_GUIDE.md
Normal file
238
roboimi/vla/RESNET_TRAINING_GUIDE.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# ResNet VLA Training Guide
|
||||
|
||||
This guide explains how to train the VLA agent with ResNet backbone and action_dim=16, obs_dim=16.
|
||||
|
||||
## Configuration Overview
|
||||
|
||||
### 1. Backbone Configuration
|
||||
**File**: `roboimi/vla/conf/backbone/resnet.yaml`
|
||||
- Model: microsoft/resnet-18
|
||||
- Output dim: 1024 (512 channels × 2 from SpatialSoftmax)
|
||||
- Frozen by default for faster training
|
||||
|
||||
### 2. Agent Configuration
|
||||
**File**: `roboimi/vla/conf/agent/resnet_diffusion.yaml`
|
||||
- Vision backbone: ResNet-18 with SpatialSoftmax
|
||||
- Action dimension: 16
|
||||
- Observation dimension: 16
|
||||
- Prediction horizon: 16 steps
|
||||
- Observation horizon: 2 steps
|
||||
- Diffusion steps: 100
|
||||
- Number of cameras: 2
|
||||
|
||||
### 3. Dataset Configuration
|
||||
**File**: `roboimi/vla/conf/data/resnet_dataset.yaml`
|
||||
- Dataset class: RobotDiffusionDataset
|
||||
- Prediction horizon: 16
|
||||
- Observation horizon: 2
|
||||
- Camera names: [r_vis, top]
|
||||
- Normalization: gaussian (mean/std)
|
||||
|
||||
### 4. Training Configuration
|
||||
**File**: `roboimi/vla/conf/config.yaml`
|
||||
- Batch size: 8
|
||||
- Learning rate: 1e-4
|
||||
- Max steps: 10000
|
||||
- Log frequency: 100 steps
|
||||
- Save frequency: 1000 steps
|
||||
- Device: cuda
|
||||
- Num workers: 4
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. Prepare Dataset
|
||||
Your dataset should be organized as:
|
||||
```
|
||||
/path/to/your/dataset/
|
||||
├── episode_0.hdf5
|
||||
├── episode_1.hdf5
|
||||
├── ...
|
||||
└── data_stats.pkl
|
||||
```
|
||||
|
||||
Each HDF5 file should contain:
|
||||
```
|
||||
episode_N.hdf5
|
||||
├── action # (T, 16) float32
|
||||
└── observations/
|
||||
├── qpos # (T, 16) float32
|
||||
└── images/
|
||||
├── r_vis/ # (T, H, W, 3) uint8
|
||||
└── top/ # (T, H, W, 3) uint8
|
||||
```
|
||||
|
||||
### 2. Generate Dataset Statistics
|
||||
Create `data_stats.pkl` with:
|
||||
```python
|
||||
import pickle
|
||||
import numpy as np
|
||||
|
||||
stats = {
|
||||
'action': {
|
||||
'mean': np.zeros(16),
|
||||
'std': np.ones(16)
|
||||
},
|
||||
'qpos': {
|
||||
'mean': np.zeros(16),
|
||||
'std': np.ones(16)
|
||||
}
|
||||
}
|
||||
|
||||
with open('/path/to/your/dataset/data_stats.pkl', 'wb') as f:
|
||||
pickle.dump(stats, f)
|
||||
```
|
||||
|
||||
Or use the provided script:
|
||||
```bash
|
||||
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/your/dataset
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Update Dataset Path
|
||||
Edit `roboimi/vla/conf/data/resnet_dataset.yaml`:
|
||||
```yaml
|
||||
dataset_dir: "/path/to/your/dataset" # CHANGE THIS
|
||||
camera_names:
|
||||
- r_vis # CHANGE TO YOUR CAMERA NAMES
|
||||
- top
|
||||
```
|
||||
|
||||
### 2. Run Training
|
||||
```bash
|
||||
# Basic training
|
||||
python roboimi/demos/vla_scripts/train_vla.py
|
||||
|
||||
# Override configurations
|
||||
python roboimi/demos/vla_scripts/train_vla.py train.batch_size=16
|
||||
python roboimi/demos/vla_scripts/train_vla.py train.device=cpu
|
||||
python roboimi/demos/vla_scripts/train_vla.py train.max_steps=20000
|
||||
python roboimi/demos/vla_scripts/train_vla.py data.dataset_dir=/custom/path
|
||||
|
||||
# Debug mode (CPU, small batch, few steps)
|
||||
python roboimi/demos/vla_scripts/train_vla.py \
|
||||
train.device=cpu \
|
||||
train.batch_size=2 \
|
||||
train.max_steps=10 \
|
||||
train.num_workers=0
|
||||
```
|
||||
|
||||
### 3. Monitor Training
|
||||
Checkpoints are saved to:
|
||||
- `checkpoints/vla_model_step_1000.pt` - Periodic checkpoints
|
||||
- `checkpoints/vla_model_best.pt` - Best model (lowest loss)
|
||||
- `checkpoints/vla_model_final.pt` - Final model
|
||||
|
||||
## Architecture Details
|
||||
|
||||
### Data Flow
|
||||
1. **Input**: Images from multiple cameras + proprioception (qpos)
|
||||
2. **Vision Encoder**: ResNet-18 → SpatialSoftmax → (B, T, 1024) per camera
|
||||
3. **Feature Concatenation**: All cameras + qpos → Global conditioning
|
||||
4. **Diffusion Policy**: 1D U-Net predicts noise on action sequences
|
||||
5. **Output**: Clean action sequence (B, 16, 16)
|
||||
|
||||
### Training Process
|
||||
1. Sample random timestep t from [0, 100]
|
||||
2. Add noise to ground truth actions
|
||||
3. Predict noise using vision + proprioception conditioning
|
||||
4. Compute MSE loss between predicted and actual noise
|
||||
5. Backpropagate and update weights
|
||||
|
||||
### Inference Process
|
||||
1. Extract visual features from current observation
|
||||
2. Start with random noise action sequence
|
||||
3. Iteratively denoise over 10 steps (DDPM scheduler)
|
||||
4. Return clean action sequence
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: Out of Memory
|
||||
**Solution**: Reduce batch size or use CPU
|
||||
```bash
|
||||
python train_vla.py train.batch_size=4 train.device=cpu
|
||||
```
|
||||
|
||||
### Issue: Dataset not found
|
||||
**Solution**: Check dataset_dir path in config
|
||||
```bash
|
||||
python train_vla.py data.dataset_dir=/absolute/path/to/dataset
|
||||
```
|
||||
|
||||
### Issue: Camera names mismatch
|
||||
**Solution**: Update camera_names in data config
|
||||
```yaml
|
||||
# roboimi/vla/conf/data/resnet_dataset.yaml
|
||||
camera_names:
|
||||
- your_camera_1
|
||||
- your_camera_2
|
||||
```
|
||||
|
||||
### Issue: data_stats.pkl missing
|
||||
**Solution**: Generate statistics file
|
||||
```bash
|
||||
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/dataset
|
||||
```
|
||||
|
||||
## Model Files Created
|
||||
|
||||
```
|
||||
roboimi/vla/
|
||||
├── conf/
|
||||
│ ├── config.yaml (UPDATED)
|
||||
│ ├── backbone/
|
||||
│ │ └── resnet.yaml (NEW)
|
||||
│ ├── agent/
|
||||
│ │ └── resnet_diffusion.yaml (NEW)
|
||||
│ └── data/
|
||||
│ └── resnet_dataset.yaml (NEW)
|
||||
├── models/
|
||||
│ └── backbones/
|
||||
│ ├── __init__.py (UPDATED - added resnet export)
|
||||
│ └── resnet.py (EXISTING)
|
||||
└── demos/vla_scripts/
|
||||
└── train_vla.py (REWRITTEN)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Prepare your dataset** in the required HDF5 format
|
||||
2. **Update dataset_dir** in `roboimi/vla/conf/data/resnet_dataset.yaml`
|
||||
3. **Run training** with `python roboimi/demos/vla_scripts/train_vla.py`
|
||||
4. **Monitor checkpoints** in `checkpoints/` directory
|
||||
5. **Evaluate** the trained model using the best checkpoint
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Use Different ResNet Variant
|
||||
Edit `roboimi/vla/conf/agent/resnet_diffusion.yaml`:
|
||||
```yaml
|
||||
vision_backbone:
|
||||
model_name: "microsoft/resnet-50" # or resnet-34, resnet-101
|
||||
```
|
||||
|
||||
### Adjust Diffusion Steps
|
||||
```yaml
|
||||
# More steps = better quality, slower training
|
||||
diffusion_steps: 200 # default: 100
|
||||
```
|
||||
|
||||
### Change Horizons
|
||||
```yaml
|
||||
pred_horizon: 32 # Predict more future steps
|
||||
obs_horizon: 4 # Use more history
|
||||
```
|
||||
|
||||
### Multi-GPU Training
|
||||
```bash
|
||||
# Use CUDA device 1
|
||||
python train_vla.py train.device=cuda:1
|
||||
|
||||
# For multi-GPU, use torch.distributed (requires code modification)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- ResNet Paper: https://arxiv.org/abs/1512.03385
|
||||
- Diffusion Policy: https://diffusion-policy.cs.columbia.edu/
|
||||
- VLA Framework Documentation: See CLAUDE.md in project root
|
||||
Reference in New Issue
Block a user