feat: add vision transfer backbones and IMF variants
This commit is contained in:
@@ -0,0 +1,32 @@
|
||||
# ResNet Multitoken IMF Design
|
||||
|
||||
**Status:** user-specified architecture, treated as approved on 2026-04-06.
|
||||
|
||||
## Goal
|
||||
Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
|
||||
|
||||
## Approved architecture
|
||||
- Vision trunk: standard `resnet18` residual network
|
||||
- Cameras: `front`, `top`, `r_vis`
|
||||
- Each camera uses its **own** ResNet-18 weights (`use_separate_rgb_encoder_per_camera=true`)
|
||||
- Each camera produces one visual token
|
||||
- For each obs step and each camera:
|
||||
1. take that camera visual token
|
||||
2. concatenate robot state
|
||||
3. project to one condition token
|
||||
- IMF input should receive **3 condition tokens per obs step**, not one concatenated token
|
||||
- With `obs_horizon=2`, IMF cond sequence length becomes `2 * 3 = 6`
|
||||
- IMF head remains on the existing IMF/AttnRes implementation path
|
||||
- Vision trunk remains standard ResNet; **no AttnRes vision replacement**
|
||||
|
||||
## Design choices
|
||||
- Extend `ResNetDiffusionBackbone` with an opt-in mode that returns per-camera tokens shaped `(B, T, num_cams, D)` instead of concatenating camera features into `(B, T, num_cams * D)`.
|
||||
- Teach `VLAAgent` to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten `(T, num_cams)` into one cond sequence for the IMF head.
|
||||
- Keep `per_step_cond_dim` as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
|
||||
- For the new experiments, set the condition-token width equal to `n_emb` via `cond_projector.output_dim=${agent.head.n_emb}`.
|
||||
|
||||
## Files expected to change
|
||||
- `roboimi/vla/models/backbones/resnet_diffusion.py`
|
||||
- `roboimi/vla/agent.py`
|
||||
- new Hydra agent config for the multitoken ResNet IMF variant
|
||||
- focused tests in `tests/test_imf_vla_agent.py` and/or `tests/test_resnet_transformer_agent_wiring.py`
|
||||
Reference in New Issue
Block a user