6.1 KiB
6.1 KiB
ResNet VLA Training Guide
This guide explains how to train the VLA agent with ResNet backbone and action_dim=16, obs_dim=16.
Configuration Overview
1. Backbone Configuration
File: roboimi/vla/conf/backbone/resnet.yaml
- Model: microsoft/resnet-18
- Output dim: 1024 (512 channels × 2 from SpatialSoftmax)
- Frozen by default for faster training
2. Agent Configuration
File: roboimi/vla/conf/agent/resnet_diffusion.yaml
- Vision backbone: ResNet-18 with SpatialSoftmax
- Action dimension: 16
- Observation dimension: 16
- Prediction horizon: 16 steps
- Observation horizon: 2 steps
- Diffusion steps: 100
- Number of cameras: 2
3. Dataset Configuration
File: roboimi/vla/conf/data/resnet_dataset.yaml
- Dataset class: RobotDiffusionDataset
- Prediction horizon: 16
- Observation horizon: 2
- Camera names: [r_vis, top]
- Normalization: gaussian (mean/std)
4. Training Configuration
File: roboimi/vla/conf/config.yaml
- Batch size: 8
- Learning rate: 1e-4
- Max steps: 10000
- Log frequency: 100 steps
- Save frequency: 1000 steps
- Device: cuda
- Num workers: 4
Prerequisites
1. Prepare Dataset
Your dataset should be organized as:
/path/to/your/dataset/
├── episode_0.hdf5
├── episode_1.hdf5
├── ...
└── data_stats.pkl
Each HDF5 file should contain:
episode_N.hdf5
├── action # (T, 16) float32
└── observations/
├── qpos # (T, 16) float32
└── images/
├── r_vis/ # (T, H, W, 3) uint8
└── top/ # (T, H, W, 3) uint8
2. Generate Dataset Statistics
Create data_stats.pkl with:
import pickle
import numpy as np
stats = {
'action': {
'mean': np.zeros(16),
'std': np.ones(16)
},
'qpos': {
'mean': np.zeros(16),
'std': np.ones(16)
}
}
with open('/path/to/your/dataset/data_stats.pkl', 'wb') as f:
pickle.dump(stats, f)
Or use the provided script:
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/your/dataset
Usage
1. Update Dataset Path
Edit roboimi/vla/conf/data/resnet_dataset.yaml:
dataset_dir: "/path/to/your/dataset" # CHANGE THIS
camera_names:
- r_vis # CHANGE TO YOUR CAMERA NAMES
- top
2. Run Training
# Basic training
python roboimi/demos/vla_scripts/train_vla.py
# Override configurations
python roboimi/demos/vla_scripts/train_vla.py train.batch_size=16
python roboimi/demos/vla_scripts/train_vla.py train.device=cpu
python roboimi/demos/vla_scripts/train_vla.py train.max_steps=20000
python roboimi/demos/vla_scripts/train_vla.py data.dataset_dir=/custom/path
# Debug mode (CPU, small batch, few steps)
python roboimi/demos/vla_scripts/train_vla.py \
train.device=cpu \
train.batch_size=2 \
train.max_steps=10 \
train.num_workers=0
3. Monitor Training
Checkpoints are saved to:
checkpoints/vla_model_step_1000.pt- Periodic checkpointscheckpoints/vla_model_best.pt- Best model (lowest loss)checkpoints/vla_model_final.pt- Final model
Architecture Details
Data Flow
- Input: Images from multiple cameras + proprioception (qpos)
- Vision Encoder: ResNet-18 → SpatialSoftmax → (B, T, 1024) per camera
- Feature Concatenation: All cameras + qpos → Global conditioning
- Diffusion Policy: 1D U-Net predicts noise on action sequences
- Output: Clean action sequence (B, 16, 16)
Training Process
- Sample random timestep t from [0, 100]
- Add noise to ground truth actions
- Predict noise using vision + proprioception conditioning
- Compute MSE loss between predicted and actual noise
- Backpropagate and update weights
Inference Process
- Extract visual features from current observation
- Start with random noise action sequence
- Iteratively denoise over 10 steps (DDPM scheduler)
- Return clean action sequence
Common Issues
Issue: Out of Memory
Solution: Reduce batch size or use CPU
python train_vla.py train.batch_size=4 train.device=cpu
Issue: Dataset not found
Solution: Check dataset_dir path in config
python train_vla.py data.dataset_dir=/absolute/path/to/dataset
Issue: Camera names mismatch
Solution: Update camera_names in data config
# roboimi/vla/conf/data/resnet_dataset.yaml
camera_names:
- your_camera_1
- your_camera_2
Issue: data_stats.pkl missing
Solution: Generate statistics file
python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/dataset
Model Files Created
roboimi/vla/
├── conf/
│ ├── config.yaml (UPDATED)
│ ├── backbone/
│ │ └── resnet.yaml (NEW)
│ ├── agent/
│ │ └── resnet_diffusion.yaml (NEW)
│ └── data/
│ └── resnet_dataset.yaml (NEW)
├── models/
│ └── backbones/
│ ├── __init__.py (UPDATED - added resnet export)
│ └── resnet.py (EXISTING)
└── demos/vla_scripts/
└── train_vla.py (REWRITTEN)
Next Steps
- Prepare your dataset in the required HDF5 format
- Update dataset_dir in
roboimi/vla/conf/data/resnet_dataset.yaml - Run training with
python roboimi/demos/vla_scripts/train_vla.py - Monitor checkpoints in
checkpoints/directory - Evaluate the trained model using the best checkpoint
Advanced Configuration
Use Different ResNet Variant
Edit roboimi/vla/conf/agent/resnet_diffusion.yaml:
vision_backbone:
model_name: "microsoft/resnet-50" # or resnet-34, resnet-101
Adjust Diffusion Steps
# More steps = better quality, slower training
diffusion_steps: 200 # default: 100
Change Horizons
pred_horizon: 32 # Predict more future steps
obs_horizon: 4 # Use more history
Multi-GPU Training
# Use CUDA device 1
python train_vla.py train.device=cuda:1
# For multi-GPU, use torch.distributed (requires code modification)
References
- ResNet Paper: https://arxiv.org/abs/1512.03385
- Diffusion Policy: https://diffusion-policy.cs.columbia.edu/
- VLA Framework Documentation: See CLAUDE.md in project root