Explore Help

JiajunLI/roboimi

1

0

You've already forked roboimi

Code Issues Pull Requests Actions Packages Projects Releases Wiki Activity

Files

20331698402e569e4de3ea2c25fc0c688c0a7ba5

roboimi/docs/superpowers/plans/2026-04-05-phase2-full-attnres-vision-plan.md

Logic 2033169840 feat: add full attnres vision backbone

2026-04-05 00:07:59 +08:00

3.1 KiB

Raw Blame History

Phase-2 Full-AttnRes Vision Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace all ResNet residual units in the vision backbone with AttnRes-based image blocks while preserving the current IMF agent interfaces and launch a Phase-2 experiment anchored on the best Phase-1 horizon setting.

Architecture: Keep the current multi-camera encoder shell and per-camera output contract, but introduce a new ResNet-like 2D AttnRes backbone that preserves stage-wise downsampling and final SpatialSoftmax conditioning. Wire it into the existing ResNetDiffusionBackbone via an opt-in mode and keep the agent/head/data interfaces unchanged.

Tech Stack: PyTorch, Hydra/OmegaConf, existing IMF AttnRes transformer components, pytest.

Task 1: Add failing tests for the new full-AttnRes visual backbone

Files:

Create: tests/test_attnres_resnet2d_backbone.py
Update: tests/test_imf_vla_agent.py
Step 1: Write a failing backbone shape test
Step 2: Run it to confirm the new backbone/config does not exist yet
Step 3: Add a failing IMF agent wiring test for unchanged cond_dim=208
Step 4: Run the targeted tests and capture the failure

Task 2: Implement a ResNet-like 2D AttnRes backbone

Files:

Create: roboimi/vla/models/backbones/attnres_resnet2d.py
Modify: roboimi/vla/models/backbones/resnet_diffusion.py
Step 1: Add minimal 2D tokenization helpers and positional encoding / bias handling
Step 2: Implement AttnResImageBlock2D for feature maps
Step 3: Implement AttnResResNetLikeBackbone2D with stage-wise downsampling
Step 4: Wire _SingleRgbEncoder to choose between original ResNet trunk and the new full-AttnRes trunk
Step 5: Run the new backbone tests

Task 3: Expose config switches and agent wiring

Files:

Modify: roboimi/vla/conf/backbone/resnet_diffusion.yaml
Modify: roboimi/vla/conf/agent/resnet_imf_attnres.yaml
Step 1: Add a backbone mode/config flag for the full-AttnRes vision trunk
Step 2: Add defaults for attnres image depth/heads/etc. if needed
Step 3: Add a Phase-2 launch override path that enables the new visual trunk
Step 4: Run agent wiring tests again

Task 4: Smoke-verify training path

Files:

Reuse existing training scripts and configs
Step 1: Run a short CPU or tiny-step smoke instantiation / compute_loss test
Step 2: If needed, run a very short training smoke launch
Step 3: Verify no cond-dim or rollout-loading regressions

Task 5: Launch the Phase-2 experiment

Files:

Update experiment tracking under experiment_suites/
Step 1: Use Phase-1 best setting (pred_horizon=16, num_action_steps=8)
Step 2: Launch baseline reference or reuse existing result
Step 3: Launch full-AttnRes vision experiment
Step 4: Track rollout metrics and compare max avg_reward

Reference in New Issue View Git Blame Copy Permalink

Powered by Gitea Version: 1.25.3 Page: 53ms Template: 4ms

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API