Files
roboimi/roboimi/vla/RESNET_TRAINING_GUIDE.md
2026-02-05 14:08:43 +08:00

6.1 KiB
Raw Blame History

ResNet VLA Training Guide

This guide explains how to train the VLA agent with ResNet backbone and action_dim=16, obs_dim=16.

Configuration Overview

1. Backbone Configuration

File: roboimi/vla/conf/backbone/resnet.yaml

  • Model: microsoft/resnet-18
  • Output dim: 1024 (512 channels × 2 from SpatialSoftmax)
  • Frozen by default for faster training

2. Agent Configuration

File: roboimi/vla/conf/agent/resnet_diffusion.yaml

  • Vision backbone: ResNet-18 with SpatialSoftmax
  • Action dimension: 16
  • Observation dimension: 16
  • Prediction horizon: 16 steps
  • Observation horizon: 2 steps
  • Diffusion steps: 100
  • Number of cameras: 2

3. Dataset Configuration

File: roboimi/vla/conf/data/resnet_dataset.yaml

  • Dataset class: RobotDiffusionDataset
  • Prediction horizon: 16
  • Observation horizon: 2
  • Camera names: [r_vis, top]
  • Normalization: gaussian (mean/std)

4. Training Configuration

File: roboimi/vla/conf/config.yaml

  • Batch size: 8
  • Learning rate: 1e-4
  • Max steps: 10000
  • Log frequency: 100 steps
  • Save frequency: 1000 steps
  • Device: cuda
  • Num workers: 4

Prerequisites

1. Prepare Dataset

Your dataset should be organized as:

/path/to/your/dataset/
├── episode_0.hdf5
├── episode_1.hdf5
├── ...
└── data_stats.pkl

Each HDF5 file should contain:

episode_N.hdf5
├── action              # (T, 16) float32
└── observations/
    ├── qpos           # (T, 16) float32
    └── images/
        ├── r_vis/     # (T, H, W, 3) uint8
        └── top/       # (T, H, W, 3) uint8

2. Generate Dataset Statistics

Create data_stats.pkl with:

import pickle
import numpy as np

stats = {
    'action': {
        'mean': np.zeros(16),
        'std': np.ones(16)
    },
    'qpos': {
        'mean': np.zeros(16),
        'std': np.ones(16)
    }
}

with open('/path/to/your/dataset/data_stats.pkl', 'wb') as f:
    pickle.dump(stats, f)

Or use the provided script:

python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/your/dataset

Usage

1. Update Dataset Path

Edit roboimi/vla/conf/data/resnet_dataset.yaml:

dataset_dir: "/path/to/your/dataset"  # CHANGE THIS
camera_names:
  - r_vis  # CHANGE TO YOUR CAMERA NAMES
  - top

2. Run Training

# Basic training
python roboimi/demos/vla_scripts/train_vla.py

# Override configurations
python roboimi/demos/vla_scripts/train_vla.py train.batch_size=16
python roboimi/demos/vla_scripts/train_vla.py train.device=cpu
python roboimi/demos/vla_scripts/train_vla.py train.max_steps=20000
python roboimi/demos/vla_scripts/train_vla.py data.dataset_dir=/custom/path

# Debug mode (CPU, small batch, few steps)
python roboimi/demos/vla_scripts/train_vla.py \
    train.device=cpu \
    train.batch_size=2 \
    train.max_steps=10 \
    train.num_workers=0

3. Monitor Training

Checkpoints are saved to:

  • checkpoints/vla_model_step_1000.pt - Periodic checkpoints
  • checkpoints/vla_model_best.pt - Best model (lowest loss)
  • checkpoints/vla_model_final.pt - Final model

Architecture Details

Data Flow

  1. Input: Images from multiple cameras + proprioception (qpos)
  2. Vision Encoder: ResNet-18 → SpatialSoftmax → (B, T, 1024) per camera
  3. Feature Concatenation: All cameras + qpos → Global conditioning
  4. Diffusion Policy: 1D U-Net predicts noise on action sequences
  5. Output: Clean action sequence (B, 16, 16)

Training Process

  1. Sample random timestep t from [0, 100]
  2. Add noise to ground truth actions
  3. Predict noise using vision + proprioception conditioning
  4. Compute MSE loss between predicted and actual noise
  5. Backpropagate and update weights

Inference Process

  1. Extract visual features from current observation
  2. Start with random noise action sequence
  3. Iteratively denoise over 10 steps (DDPM scheduler)
  4. Return clean action sequence

Common Issues

Issue: Out of Memory

Solution: Reduce batch size or use CPU

python train_vla.py train.batch_size=4 train.device=cpu

Issue: Dataset not found

Solution: Check dataset_dir path in config

python train_vla.py data.dataset_dir=/absolute/path/to/dataset

Issue: Camera names mismatch

Solution: Update camera_names in data config

# roboimi/vla/conf/data/resnet_dataset.yaml
camera_names:
  - your_camera_1
  - your_camera_2

Issue: data_stats.pkl missing

Solution: Generate statistics file

python -m roboimi.vla.scripts.calculate_stats --dataset_dir /path/to/dataset

Model Files Created

roboimi/vla/
├── conf/
│   ├── config.yaml (UPDATED)
│   ├── backbone/
│   │   └── resnet.yaml (NEW)
│   ├── agent/
│   │   └── resnet_diffusion.yaml (NEW)
│   └── data/
│       └── resnet_dataset.yaml (NEW)
├── models/
│   └── backbones/
│       ├── __init__.py (UPDATED - added resnet export)
│       └── resnet.py (EXISTING)
└── demos/vla_scripts/
    └── train_vla.py (REWRITTEN)

Next Steps

  1. Prepare your dataset in the required HDF5 format
  2. Update dataset_dir in roboimi/vla/conf/data/resnet_dataset.yaml
  3. Run training with python roboimi/demos/vla_scripts/train_vla.py
  4. Monitor checkpoints in checkpoints/ directory
  5. Evaluate the trained model using the best checkpoint

Advanced Configuration

Use Different ResNet Variant

Edit roboimi/vla/conf/agent/resnet_diffusion.yaml:

vision_backbone:
  model_name: "microsoft/resnet-50"  # or resnet-34, resnet-101

Adjust Diffusion Steps

# More steps = better quality, slower training
diffusion_steps: 200  # default: 100

Change Horizons

pred_horizon: 32  # Predict more future steps
obs_horizon: 4    # Use more history

Multi-GPU Training

# Use CUDA device 1
python train_vla.py train.device=cuda:1

# For multi-GPU, use torch.distributed (requires code modification)

References