feat: add vision transfer backbones and IMF variants

This commit is contained in:
Logic
2026-04-09 14:02:24 +08:00
parent d51b3ecafa
commit ff7c9c1f2a
58 changed files with 2788 additions and 26 deletions

View File

@@ -0,0 +1,92 @@
# LEWM ViT Backbone Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Replace the current ResNet visual encoder in roboimi VLA training with a frozen LEWM ViT visual backbone (encoder + projector) that consumes the three camera views jointly and outputs one 192-d CLS embedding per timestep, then launch two 50k runs on the 5880 machine.
**Architecture:** Add a new joint-multiview LEWM backbone that fuses `front/top/r_vis` into one LEWM-style image, reproduces LEWM preprocessing, loads frozen weights from the trained checkpoint, and exposes a `joint_output_dim=192`. Add a minimal `VLAAgent` compatibility branch so conditions can be sized from joint visual dim instead of `output_dim * num_cams`, while leaving the rest of the diffusion pipeline unchanged.
**Tech Stack:** PyTorch, transformers `ViTModel`, Hydra configs, existing roboimi VLA training/eval scripts, remote SSH/rsync to 100.73.14.65.
---
### Task 1: Add failing tests for LEWM joint-vision backbone contract
**Files:**
- Create: `tests/test_lewm_vit_backbone.py`
- Modify: `tests/test_imf_vla_agent.py`
- [ ] **Step 1: Write the failing backbone shape/load test**
- [ ] **Step 2: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify it fails**
- [ ] **Step 3: Extend `tests/test_imf_vla_agent.py` with a failing joint-output backbone case**
- [ ] **Step 4: Run `pytest tests/test_imf_vla_agent.py -q` and verify it fails**
### Task 2: Implement LEWM joint-multiview frozen backbone
**Files:**
- Create: `roboimi/vla/models/backbones/lewm_vit_backbone.py`
- Modify: `roboimi/vla/models/backbones/__init__.py` only if exports are needed
- [ ] **Step 1: Create `LEWMViTBackbone` with public attrs `camera_names`, `num_cameras`, `joint_output_dim=192`**
- [ ] **Step 2: Reproduce LEWM preprocessing and joint multiview fusion**
- [ ] **Step 3: Load checkpoint weights from `model.encoder.*` and `model.projector.*`**
- [ ] **Step 4: Freeze encoder/projector and keep them in eval mode via `train()` override**
- [ ] **Step 5: Run `pytest tests/test_lewm_vit_backbone.py -q` and verify green**
### Task 3: Add minimal agent support for joint visual dim
**Files:**
- Modify: `roboimi/vla/agent.py`
- Test: `tests/test_imf_vla_agent.py`
- [ ] **Step 1: Add a `joint_output_dim` branch in `VLAAgent.__init__` for `per_step_cond_dim` / `global_cond_dim`**
- [ ] **Step 2: Keep `_build_cond()` semantics unchanged except for matching the new dim contract**
- [ ] **Step 3: Run `pytest tests/test_imf_vla_agent.py -q` and verify green**
### Task 4: Add Hydra configs for LEWM backbone training
**Files:**
- Create: `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
- Create: `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
- [ ] **Step 1: Add backbone config pointing to the new LEWM backbone**
- [ ] **Step 2: Add `agent=lewm_imf_attnres` config with 3 cameras and `head.cond_dim=208`**
- [ ] **Step 3: Verify Hydra instantiation with a one-shot compose smoke**
### Task 5: Verify focused local tests
**Files:**
- Reuse the above
- [ ] **Step 1: Run `pytest tests/test_lewm_vit_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless_import.py -q`**
- [ ] **Step 2: If needed, run one tiny local import/forward smoke**
### Task 6: Sync to 5880 and remote smoke with real checkpoint
**Files:**
- Remote target: `/home/droid/roboimi_suite_20260404`
- [ ] **Step 1: Rsync modified source/config files to `100.73.14.65:/home/droid/roboimi_suite_20260404`**
- [ ] **Step 2: Run a 2-step smoke on GPU0 with `agent.head.n_emb=384`, `train.rollout_num_episodes=10`, real LEWM checkpoint**
- [ ] **Step 3: Run a 2-step smoke on GPU1 with `agent.head.n_emb=256`, same checkpoint**
### Task 7: Launch two real 50k runs on the 5880 machine
**Files:**
- Remote logs under `/home/droid/roboimi_suite_20260404/experiment_suite_launch_logs/`
- [ ] **Step 1: Launch embed384/layer12 on GPU0**
- [ ] **Step 2: Launch embed256/layer12 on GPU1**
- [ ] **Step 3: Ensure both use `data.camera_names=[r_vis,top,front]`, `pred_horizon=16`, `num_action_steps=8`, `train.rollout_num_episodes=10`, `max_steps=50000`**
- [ ] **Step 4: Record run names, pids, log paths, SwanLab URLs**
### Task 8: Update experiment tracking docs and commit
**Files:**
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/manifest.json`
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/status.json`
- Create: `experiment_suites/2026-04-05-lewm-vit-transfer/notes.md`
- [ ] **Step 1: Record checkpoint path, frozen LEWM design, rollout=10, and both run configs**
- [ ] **Step 2: Record running status after launch**
- [ ] **Step 3: Commit implementation + docs with a focused message**

View File

@@ -0,0 +1,81 @@
# ResNet Multitoken IMF Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Implement a standard-ResNet-18 multiview IMF variant that emits three condition tokens per obs step and launch four L20 experiments for `n_emb in {256,384}` and `n_layer in {12,16}`.
**Architecture:** The ResNet backbone will optionally return one token per camera instead of concatenating all cameras into one token. `VLAAgent` will pair each camera token with the current state, project each pair into a condition token, flatten the per-step camera tokens into one cond sequence, and feed that sequence into the existing IMF/AttnRes head.
**Tech Stack:** PyTorch, torchvision ResNet-18, Hydra, pytest, SwanLab, SSH/Tailscale.
---
### Task 1: Add failing tests for multi-token conditioning
**Files:**
- Modify: `tests/test_imf_vla_agent.py`
- Modify: `tests/test_resnet_transformer_agent_wiring.py`
- [ ] **Step 1: Add a direct agent test**
- Stub a vision backbone returning `(B,T,3,D)` and assert `_build_cond()` yields `(B, T*3, D_cond)`.
- Assert state is paired with each camera token, not concatenated across cameras first.
- [ ] **Step 2: Add Hydra wiring test**
- Instantiate a new `agent=resnet_imf_attnres_multitoken` config with small dims.
- Assert `condition_tokens_per_step == 3`, `condition_sequence_length == obs_horizon * 3`, and head `n_obs_steps` receives that sequence length.
- [ ] **Step 3: Run focused tests and verify RED**
- `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
### Task 2: Implement multi-token ResNet conditioning path
**Files:**
- Modify: `roboimi/vla/models/backbones/resnet_diffusion.py`
- Modify: `roboimi/vla/agent.py`
- Create: `roboimi/vla/conf/agent/resnet_imf_attnres_multitoken.yaml`
- [ ] **Step 1: Extend ResNet backbone**
- Add an opt-in flag to return `(B,T,num_cams,D)` camera tokens instead of one concatenated `(B,T,num_cams*D)` token.
- Keep standard ResNet-18 vision mode; do not switch to AttnRes vision.
- [ ] **Step 2: Extend VLAAgent condition building**
- Support visual features with rank 4 `(B,T,K,D)`.
- Broadcast state to `(B,T,K,D_state)`, concatenate per camera, apply projector per token, then flatten to `(B,T*K,D_cond)`.
- Track `condition_tokens_per_step` and `condition_sequence_length`.
- [ ] **Step 3: Update transformer-head instantiation**
- Pass `n_obs_steps=condition_sequence_length` when building transformer heads.
- [ ] **Step 4: Add Hydra config**
- New agent config uses:
- separate ResNet-18 per camera
- standard residual vision trunk (`vision_backbone_mode=resnet`)
- condition projector output dim tied to `${agent.head.n_emb}`
- rollout episodes `10`, `pred_horizon=16`, `num_action_steps=8`
### Task 3: Verify locally
**Files:**
- Modify only if verification reveals issues
- [ ] **Step 1: Run focused tests and make them pass**
- `python -m pytest tests/test_imf_vla_agent.py tests/test_resnet_transformer_agent_wiring.py -q`
- [ ] **Step 2: Run regression subset**
- `python -m pytest tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
- [ ] **Step 3: Run local smoke instantiation**
- instantiate the new Hydra config and verify cond shape / sequence length
### Task 4: Launch 4 L20 experiments
**Files:**
- Remote repo copy under `/home/droid/roboimi_suite_20260404`
- [ ] **Step 1: Sync code to `100.119.99.14`**
- [ ] **Step 2: Smoke the new config on remote**
- [ ] **Step 3: Launch runs**
- `(n_emb=256, n_layer=12)`
- `(n_emb=256, n_layer=16)`
- `(n_emb=384, n_layer=12)`
- `(n_emb=384, n_layer=16)`
- [ ] **Step 4: Keep fixed across runs**
- rollout episodes `10`
- `pred_horizon=16`
- `num_action_steps=8`
- standard ResNet-18 vision trunk
- three separate camera weights
- [ ] **Step 5: Record PIDs, GPUs, log paths, SwanLab URLs**

View File

@@ -0,0 +1,78 @@
# SigLIP2 Multiview VLA Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Integrate a frozen shared SigLIP2 multiview encoder into the IMF/AttnRes policy, preserve raw-256 image handling, and launch two 50k-step experiments on the 5880 host with per-view projection dims 96 and 192.
**Architecture:** A new backbone will independently encode each camera view with SigLIP2 and project each 768-d pooled feature to a configurable per-view dimension. `VLAAgent` will concatenate visual features with robot state, then optionally project the combined per-step condition to the head's required 384-d interface before diffusion training/inference.
**Tech Stack:** PyTorch, transformers SigLIP2, Hydra, pytest, SSH/Tailscale, SwanLab.
---
### Task 1: Add failing tests for SigLIP2 backbone and projected conditioning
**Files:**
- Create: `tests/test_siglip2_diffusion_backbone.py`
- Modify: `tests/test_imf_vla_agent.py`
- [ ] **Step 1: Write failing backbone tests**
- Instantiate the new backbone with a stub SigLIP2 vision model.
- Assert raw dataset resize is `None`, eval resize is `(256, 256)`, output shape is `(B, T, 3 * per_view_output_dim)`.
- Assert three views are encoded independently and projected.
- [ ] **Step 2: Run focused tests and verify RED**
- Run `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py -q`
- Expect failure because the backbone/config/projector do not exist yet.
- [ ] **Step 3: Extend agent wiring tests**
- Add a Hydra/instantiate test for a new SigLIP2 IMF config.
- Assert raw condition dim `3 * per_view_output_dim + obs_dim`, projected cond dim `384`, and head `cond_dim == 384`.
### Task 2: Implement SigLIP2 backbone and optional condition projector
**Files:**
- Create: `roboimi/vla/models/backbones/siglip2_diffusion_backbone.py`
- Create: `roboimi/vla/conf/backbone/siglip2_diffusion.yaml`
- Create: `roboimi/vla/conf/agent/siglip2_imf_attnres.yaml`
- Create: `roboimi/vla/conf/modules/linear_condition_projector.yaml`
- Modify: `roboimi/vla/models/backbones/__init__.py`
- Modify: `roboimi/vla/agent.py`
- [ ] **Step 1: Implement backbone**
- Load `SiglipVisionModel.from_pretrained("google/siglip2-base-patch16-256")`.
- Normalize `[0,1]` pixels with mean/std `0.5` and encode each view independently.
- Project each 768-d pooled feature to configurable per-view dim and concatenate across cameras.
- [ ] **Step 2: Implement optional condition projector**
- Allow `VLAAgent` to accept `cond_projector`.
- Track `raw_per_step_cond_dim` and projected `per_step_cond_dim` / `global_cond_dim`.
- Apply the projector in `_build_cond()` after visual+state concatenation.
- [ ] **Step 3: Add Hydra configs**
- New agent config should default to `n_emb=384`, `n_layer=12`, `pred_horizon=16`, `num_action_steps=8`, `head.cond_dim=384`.
- Backbone config should set `dataset_image_resize_shape: null` and `eval_image_resize_shape: [256, 256]`.
### Task 3: Verify locally and prepare remote execution
**Files:**
- Modify as needed only if tests/smoke reveal issues
- [ ] **Step 1: Run focused tests and make them pass**
- `pytest tests/test_siglip2_diffusion_backbone.py tests/test_imf_vla_agent.py tests/test_eval_vla_headless.py tests/test_train_vla_rollout_validation.py tests/test_simple_robot_dataset_image_loading.py -q`
- [ ] **Step 2: Run a local smoke instantiation**
- Instantiate the new Hydra config with stubbed optional modules or offline-safe monkeypatching.
- [ ] **Step 3: Review diffs for unintended LEWM/raw256 regressions**
### Task 4: Sync to 5880 and launch experiments
**Files:**
- Remote repo copy under `/home/droid/roboimi_suite_20260404`
- [ ] **Step 1: Stop superseded remote jobs**
- [ ] **Step 2: Sync updated code to remote**
- Prefer `rsync` or `git push/pull` without overwriting unrelated files.
- [ ] **Step 3: Remote smoke test**
- Confirm SigLIP2 model download/import works in `/home/droid/miniforge3/envs/roboimi/bin/python`.
- Confirm headless rollout path still uses `256x256` eval resize.
- [ ] **Step 4: Launch experiment A**
- `per_view_output_dim=96`, `embed=384`, `layer=12`, `pred=16`, `exec=8`, `steps=50000`.
- [ ] **Step 5: Launch experiment B**
- `per_view_output_dim=192`, same other hyperparameters.
- [ ] **Step 6: Record PIDs, GPUs, log paths, and SwanLab run URLs.**

View File

@@ -0,0 +1,138 @@
# LEWM ViT Backbone Replacement Design
## Goal
将当前 roboimi VLA policy 中的 ResNet 视觉编码器替换为来自 LEWM checkpoint 的冻结 ViT 视觉编码器encoder + projector仅使用最终 CLS token 的 192 维 embedding 作为视觉特征。
## User constraints
- 使用 `/home/droid/下载/lewm_sim_transfer_checkpoint_usage.md` 中确认的训练好 checkpoint
- 只使用视觉编码部分:`encoder + projector`
- 权重冻结
- 维持“视觉特征 + state 拼接,再送入 diffusion transformer”这一总体处理方式
- 输入使用三视角:`[r_vis, top, front]`
- 在 5880 机器上启动两个训练:`embed=384/layer=12``embed=256/layer=12`
- `pred_horizon=16`
- `num_action_steps=8`
- 每个训练 `50k` steps
- rollout 验证每次用 `10` 个 episodes不是之前的 `5`
## Trusted existing facts
1. LEWM checkpoint 路径:
- `/home/droid/le-wm/lewm-sim-transfer/pa1w85md8jop6bvol8oxp/checkpoints/epoch=99-step=47800.ckpt`
2. 需要加载的 state_dict 前缀:
- `model.encoder.*`
- `model.projector.*`
3. LEWM ViT 配置:
- encoder scale: `tiny`
- hidden size: `192`
- layers: `12`
- attention heads: `3`
- patch size: `14`
- projector: `MLP(192 -> 2048 -> 192)` with `BatchNorm1d + GELU`
4. LEWM 训练时三视角先拼成单图,再送入单个 ViT encoder输出整体视觉 embedding 是 **192 维**
## Key design decision
### Chosen design: fuse 3 cameras into one LEWM-style image, output one 192-d visual vector per timestep
不是把 LEWM ViT 当成“每相机一个 192-d encoder”而是按 LEWM 原训练方式:
- 输入三视角图像字典 `{r_vis, top, front}`
- 按固定顺序拼成一张 fused image
- 走单个 frozen ViT + projector
- 得到一个 **192 维总视觉特征**
### Why this is the right replacement
当前 ResNet backbone 对外给到 policy head 的**总视觉特征维度**是:
- 每相机 `64`
- 三相机总计 `192`
而 LEWM checkpoint 输出的 CLS/projector embedding 也是:
- 总计 `192`
因此,最自然的“直接平替当前 ResNet 视觉编码器”的方式是:
- 用 LEWM backbone 直接产出一个 192-d 总视觉向量
- 后续和 state `16-d` 拼接后,依旧得到 `208-d` 条件向量
- 不改 diffusion head 的总体接口和语义
## Interface compatibility plan
现有 `VLAAgent` 假设 backbone 暴露:
- `camera_names`
- `num_cameras`
- `output_dim`(语义上是“每相机特征维度”)
- `forward(images_dict) -> (B, T, total_visual_dim)`
为了最小改动兼容现有 agent
- 新 LEWM backbone 的 `forward()` 返回 `(B, T, 192)`
- `camera_names = ('r_vis', 'top', 'front')`
- `num_cameras = 3`
- `output_dim = 64`
这样 `VLAAgent` 内部仍会计算:
- `per_step_cond_dim = output_dim * num_cams + obs_dim = 64*3 + 16 = 208`
与实际 `forward()` 输出的 `192 + 16 = 208` 保持一致。
> 也就是说:`output_dim` 在这个 backbone 里保留为“与旧 ResNet 总特征等价的单相机占位维度”,而不是“真实 projector 输出维度”。这是一个兼容性 shim用来避免改 agent 主逻辑。
## Image preprocessing design
当前 roboimi dataset 已经把每个相机图像读成:
- `(C, 224, 224)`
- 值域 `[0, 1]`
新 LEWM backbone 将:
1. 按顺序取 `r_vis`, `top`, `front`
2. 在宽度方向拼接,得到 fused image
- `(C, 224, 672)`
3. 使用 LEWM 一致的 ImageNet normalize
- mean `[0.485, 0.456, 0.406]`
- std `[0.229, 0.224, 0.225]`
4. 调用 `ViTModel(..., interpolate_pos_encoding=True)`
5.`last_hidden_state[:, 0]`
6. 送入 frozen projector得到 `(B*T, 192)`
## Files to create / modify
### New files
- `roboimi/vla/models/backbones/lewm_vit_backbone.py`
- `roboimi/vla/conf/backbone/lewm_vit_diffusion.yaml`
- `roboimi/vla/conf/agent/lewm_imf_attnres.yaml`
- `tests/test_lewm_vit_backbone.py`
### Modified files
- `roboimi/vla/models/backbones/__init__`(如果需要导出)
- `tests/test_imf_vla_agent.py`(增加新 backbone 集成用例)
- `roboimi/demos/vla_scripts/train_vla.py`(如需仅调整 rollout 默认/日志;如果命令覆盖足够,则尽量不改主逻辑)
- 训练/实验 suite 文档(新增本次 LEWM ViT 训练记录)
## Testing plan
1. **Unit test: load + forward**
- 用 synthetic checkpoint 验证新 backbone 能正确加载 `model.encoder.*``model.projector.*`
- 输入 3 相机 `(B,T,C,224,224)`
- 输出 `(B,T,192)`
2. **Agent integration test**
- backbone.output_dim=64, num_cameras=3
- agent `_build_cond()` 输出最后维度为 `208`
3. **Remote smoke test on 5880**
- 使用真实 checkpoint
- `max_steps=2`
- 两个实验各自 smoke 一次
4. **Full run**
- GPU0: `embed=384, layer=12`
- GPU1: `embed=256, layer=12`
- `rollout_num_episodes=10`
## Training launch contract
- host: `100.73.14.65`
- code dir: `/home/droid/roboimi_suite_20260404`
- python: `/home/droid/miniforge3/envs/roboimi/bin/python`
- dataset: `/home/droid/sim_dataset/sim_transfer`
- cameras: `[r_vis, top, front]`
- agent: new `lewm_imf_attnres`
- max_steps: `50000`
- rollout every `5` epochs
- rollout episodes: `10`
## Risks
1. LEWM 训练时的 fused image 预处理如果方向实现错了224x672 vs 672x224会导致分布偏移。
2. 当前 roboimi env 需确保安装 `transformers`;从 `environment.yml` 看本地已有该依赖,但远端训练环境要 smoke 确认。
3. 因为这是 frozen ViT + projector若 projector BN 仍保持 train 模式,统计量会漂移,所以必须整体 `eval()` 并冻结。
## Recommended first implementation path
- 先实现一个独立 `LEWMViTBackbone` 类,不改现有 `ResNetDiffusionBackbone` 主逻辑。
- 再通过新的 hydra backbone/agent 配置接入。
- 优先做到“最少侵入 + smoke 可跑 + 远端可训”。

View File

@@ -0,0 +1,32 @@
# ResNet Multitoken IMF Design
**Status:** user-specified architecture, treated as approved on 2026-04-06.
## Goal
Keep a standard ResNet-18 visual trunk (no AttnRes in vision), but change IMF conditioning from one concatenated multiview token per obs step into three camera-specific condition tokens per obs step.
## Approved architecture
- Vision trunk: standard `resnet18` residual network
- Cameras: `front`, `top`, `r_vis`
- Each camera uses its **own** ResNet-18 weights (`use_separate_rgb_encoder_per_camera=true`)
- Each camera produces one visual token
- For each obs step and each camera:
1. take that camera visual token
2. concatenate robot state
3. project to one condition token
- IMF input should receive **3 condition tokens per obs step**, not one concatenated token
- With `obs_horizon=2`, IMF cond sequence length becomes `2 * 3 = 6`
- IMF head remains on the existing IMF/AttnRes implementation path
- Vision trunk remains standard ResNet; **no AttnRes vision replacement**
## Design choices
- Extend `ResNetDiffusionBackbone` with an opt-in mode that returns per-camera tokens shaped `(B, T, num_cams, D)` instead of concatenating camera features into `(B, T, num_cams * D)`.
- Teach `VLAAgent` to detect multi-token visual features, broadcast state per camera token, apply the existing condition projector on each token, then flatten `(T, num_cams)` into one cond sequence for the IMF head.
- Keep `per_step_cond_dim` as the width of a single condition token, and add explicit token-count metadata so transformer heads get the correct cond-sequence length.
- For the new experiments, set the condition-token width equal to `n_emb` via `cond_projector.output_dim=${agent.head.n_emb}`.
## Files expected to change
- `roboimi/vla/models/backbones/resnet_diffusion.py`
- `roboimi/vla/agent.py`
- new Hydra agent config for the multitoken ResNet IMF variant
- focused tests in `tests/test_imf_vla_agent.py` and/or `tests/test_resnet_transformer_agent_wiring.py`

View File

@@ -0,0 +1,41 @@
# SigLIP2 Multiview VLA Design
**Status:** user-specified architecture, treated as approved on 2026-04-06
## Goal
Replace the current vision encoder for the IMF/AttnRes diffusion policy with a frozen SigLIP2 image encoder while preserving the downstream action-diffusion stack and rollout behavior.
## Approved architecture
- Backbone model: `google/siglip2-base-patch16-256`
- Camera inputs: three views, encoded **independently** with a **shared** SigLIP2 vision encoder
- Input size:
- dataset images stay at native `256x256` (no dataset-side resize)
- eval/rollout images resize to `256x256` before SigLIP2 because env renders are larger
- Per-view feature: use the global pooled image feature (`pooler_output`, 768-d)
- Per-view projection experiments:
1. `768 -> 96`
2. `768 -> 192`
- Conditioning pipeline:
1. concatenate 3 projected camera vectors
2. concatenate robot state
3. project concatenated condition to `384`
4. feed that `384`-d per-step condition into the existing IMF/AttnRes diffusion head
- Training/run defaults for requested experiments:
- `n_emb=384`
- `n_layer=12`
- `pred_horizon=16`
- `num_action_steps=8`
- rollout count for validation: keep current requested behavior on this branch unless explicitly overridden later
## Design decisions
- The condition projector lives in `VLAAgent._build_cond()` so the backbone owns only visual features, while the agent owns the final conditioning contract expected by the diffusion head.
- The SigLIP2 backbone is frozen by default; only the per-view projectors and downstream policy layers train.
- The backbone exposes `dataset_image_resize_shape=None` and `eval_image_resize_shape=(256, 256)` so existing train/eval plumbing can reuse the raw-256 path already added in this branch.
- One shared vision encoder is used across cameras to keep memory and download size reasonable and to match the user's request for per-view independent encoding rather than a fused multiview image.
## Files expected to change
- `roboimi/vla/models/backbones/` for the new SigLIP2 backbone
- `roboimi/vla/agent.py` for optional post-concat condition projection
- Hydra configs under `roboimi/vla/conf/{agent,backbone,modules}`
- tests for backbone wiring and agent conditioning dims
- remote launch commands/scripts only as needed for training