1.9 KiB
1.9 KiB
ResNet Multitoken IMF Design
Status: user-specified architecture, treated as approved on 2026-04-06.
Goal
Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
Approved architecture
- Vision trunk: standard
resnet18residual network - Cameras:
front,top,r_vis - Each camera uses its own ResNet-18 weights (
use_separate_rgb_encoder_per_camera=true) - Each camera produces one visual token
- For each obs step and each camera:
- take that camera visual token
- concatenate robot state
- project to one condition token
- IMF input should receive 3 condition tokens per obs step, not one concatenated token
- With
obs_horizon=2, IMF cond sequence length becomes2 * 3 = 6 - IMF head remains on the existing IMF/AttnRes implementation path
- Vision trunk remains standard ResNet; no AttnRes vision replacement
Design choices
- Extend
ResNetDiffusionBackbonewith an opt-in mode that returns per-camera tokens shaped(B, T, num_cams, D)instead of concatenating camera features into(B, T, num_cams * D). - Teach
VLAAgentto detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten(T, num_cams)into one cond sequence for the IMF head. - Keep
per_step_cond_dimas the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length. - For the new experiments, set the condition-token width equal to
n_embviacond_projector.output_dim=${agent.head.n_emb}.
Files expected to change
roboimi/vla/models/backbones/resnet_diffusion.pyroboimi/vla/agent.py- new Hydra agent config for the multitoken ResNet IMF variant
- focused tests in
tests/test_imf_vla_agent.pyand/ortests/test_resnet_transformer_agent_wiring.py