ResNet Multitoken IMF Design

Status: user-specified architecture, treated as approved on 2026-04-06.

Goal

Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.

Approved architecture

Vision trunk: standard resnet18 residual network
Cameras: front, top, r_vis
Each camera uses its own ResNet-18 weights (use_separate_rgb_encoder_per_camera=true)
Each camera produces one visual token
For each obs step and each camera:
1. take that camera visual token
2. concatenate robot state
3. project to one condition token
IMF input should receive 3 condition tokens per obs step, not one concatenated token
With obs_horizon=2, IMF cond sequence length becomes 2 * 3 = 6
IMF head remains on the existing IMF/AttnRes implementation path
Vision trunk remains standard ResNet; no AttnRes vision replacement

Design choices

Extend ResNetDiffusionBackbone with an opt-in mode that returns per-camera tokens shaped (B, T, num_cams, D) instead of concatenating camera features into (B, T, num_cams * D).
Teach VLAAgent to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten (T, num_cams) into one cond sequence for the IMF head.
Keep per_step_cond_dim as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
For the new experiments, set the condition-token width equal to n_emb via cond_projector.output_dim=${agent.head.n_emb}.

Files expected to change

roboimi/vla/models/backbones/resnet_diffusion.py
roboimi/vla/agent.py
new Hydra agent config for the multitoken ResNet IMF variant
focused tests in tests/test_imf_vla_agent.py and/or tests/test_resnet_transformer_agent_wiring.py

1.9 KiB Raw Blame History

ResNet Multitoken IMF Design

Goal

Approved architecture

Design choices

Files expected to change

1.9 KiB

Raw Blame History