Files
roboimi/experiment_suites/2026-04-05-camera-ablation-summary.md

70 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Camera Ablation Summary (`pred_horizon=16`, `num_action_steps=8`, ResNet IMF)
- Generated: 2026-04-05
- Common setup: original ResNet vision backbone, `n_emb=384`, `n_layer=12`, `batch_size=80`, `lr=2.5e-4`, `max_steps=50k`, rollout every 5 epochs with 5 episodes, headless eval.
- Metric for comparison: `checkpoints/vla_model_best.pt -> rollout_avg_reward`.
## Leaderboard
| Rank | Cameras | Best avg_reward | Best step | Final loss | Run name |
|---:|---|---:|---:|---:|---|
| 1 | `top + front` | **274.8** | 48124 | 0.0056 | `imf-resnet-topfront-2cam-ph16-ex08-emb384-l12-ms50k-5090-20260405-085023` |
| 2 | `top` | **271.2** | 43749 | 0.0052 | `imf-resnet-top-1cam-ph16-ex08-emb384-l12-ms50k-l20g4-20260405-125844` |
| 3 | `r_vis + front` | **244.0** | 21874 | 0.0043 | `imf-resnet-frontrvis-2cam-ph16-ex08-emb384-l12-ms50k-l20g1-20260405-102029` |
| 4 | `r_vis` | **6.4** | 17499 | 0.0047 | `imf-resnet-rvis-1cam-ph16-ex08-emb384-l12-ms50k-l20g3-20260405-125844` |
| 5 | `r_vis + top` | **1.2** | 4374 | 0.0047 | `imf-resnet-rvistop-2cam-ph16-ex08-emb384-l12-ms50k-l20g2-20260405-125844` |
| 6 | `front` | **0.0** | 4374 | 0.0074 | `imf-resnet-front-1cam-ph16-ex08-emb384-l12-ms50k-l20g0-20260405-095607` |
## Main takeaways
1. **`top` 是最关键的单相机视角**`top only = 271.2`,几乎与 `top + front = 274.8` 持平。
2. **`front` 单独几乎没有效用**`front only = 0.0`
3. **`r_vis` 单独也基本无效**`r_vis only = 6.4`
4. **`r_vis + front` 可以显著优于单独 `front` / `r_vis`**,说明这两个视角有一定互补性,但仍明显弱于任何包含 `top` 且表现正常的配置。
5. **`r_vis + top` 的结果异常差**:只有 `1.2`,远低于 `top only = 271.2`。这说明简单加入 `r_vis` 并不保证增益,甚至可能破坏当前设置下的学习。
6. **训练 loss 与 rollout reward 明显不一致**:例如 `r_vis + top``r_vis only` 的 final loss 都不高,但 reward 很差,因此本组实验必须以 rollout reward 而不是 loss 选型。
## Horizontal comparison views
### Single-camera comparison
- `top`: **271.2**
- `r_vis`: **6.4**
- `front`: **0.0**
结论:**`top >>> r_vis > front`**。
### Two-camera comparison
- `top + front`: **274.8**
- `r_vis + front`: **244.0**
- `r_vis + top`: **1.2**
结论:
- **最稳妥的双相机组合是 `top + front`**。
- `r_vis + front` 有效,但不如 `top + front`
- `r_vis + top` 在当前设置下几乎失效。
### Incremental effect of adding a second view
-`top` 基础上加 `front``271.2 -> 274.8`**增益很小**。
-`front` 基础上加 `r_vis``0.0 -> 244.0`**增益很大**。
-`top` 基础上加 `r_vis``271.2 -> 1.2`**显著退化**。
## Practical recommendation
如果只从这 6 个实验里选:
- **首选**`top + front`
- **次选**`top only`
- 如果必须不用 `top``r_vis + front` 明显优于 `front only` / `r_vis only`
- **不建议**`r_vis + top`
## Note relative to previous 3-camera baseline
此前 3 相机 `[r_vis, top, front]` 的最佳 reward 为 **610.8**
因此这次 6 个 camera ablation 的最佳结果(`top + front = 274.8`)说明:
- 当前这个训练批次里,**去掉任意一个视角都会显著低于之前的 3 相机最优结果**
- 但在去掉视角的约束下,**`top` 仍然是最核心的保留对象**。