← Back ICRA 2026

Multi-Modal Affordance Planner with Temporal-Context Action Policy for Long-Horizon Bimanual Robot Manipulation

Ji-Heon Oh, Danbi Jung, Ismael Espinoza, Yong-Hyeok Choi, YoungOuk Kim, Dongin Shin, JongSul Moon, Wonha Kim, Tae-Seong Kim

PDF

AI summary

Key figure (auto-extracted from paper)

MAP-TCA achieves an 86.75% success rate on complex bimanual tasks without relying on extensive human demonstrations, matching performance of supervised baselines.

Long-horizon manipulation Bimanual robotics Retrieval-augmented generation Temporal context policy Multimodal planning Robot learning

Problem

Long-horizon bimanual manipulation requires robust planning and generalization, but existing methods suffer from heavy reliance on costly human demonstrations, LLM hallucinations, and poor temporal context understanding.

Approach

The framework uses a retrieval-augmented LLM to generate grounded plans, demonstrations, and rewards from multimodal affordances, which guide a transformer-based low-level policy trained via behavior cloning and online fine-tuning.

Key results

86.75% average success rate across long-horizon bimanual tasks
Matches baseline performance trained on extensive human demonstrations without requiring them
Mitigates planner hallucinations through Bi-RAG-enhanced multimodal grounding
Demonstrates successful sim-to-real transfer and generalization to unseen objects

Why it matters

Enables scalable, data-efficient bimanual manipulation for practical humanoid applications by drastically reducing dependency on human supervision.

Abstract

Bimanual robot manipulation for long-horizon (LH) tasks is crucial for the practical use of humanoids, but it struggles with robust planning and generalization. Approaches based on Task and Motion Planning (TAMP), transformers, and Large Language Models (LLMs) suffer from critical limitations, including costly human demonstrations, task planner hallucination, and unsatisfactory generalization performance. To address these challenges, this paper introduces the Multi-modal Affordance Planner with Temporal-Context Action Policy (MAP-TCA), a novel hierarchical framework that learns and performs diverse bimanual long-horizon (LH) tasks by generating action plans from MAP. The MAP-TCA consists of a planner based on Bimanual Robot Manipulation Retrieval-Augmented Generation (Bi-RAG)-enhanced Large- Language Model (LLM) and a low-level Temporal Context Action Policy (TCA). With multimodal inputs including vision, language, and affordance for primitive action demonstration, Bi-RAG generates a Primitive Action (PA)-specific embedded space. Then, MAP generates LH plans, LH demonstrations, and reward functions within the PA-specific embedded space, thereby mitigating hallucinations and reducing training cost. The generated plan, demos, and rewards then guide TCA, which learns the LH tasks via behavior cloning (BC) and online fine-tuning. We demonstrate that the proposed MAP- TCA achieves an average success rate of 86.75%, comparable to a baseline model, TCA, which is trained extensively on direct human demonstrations and manually designed rewards. Our work presents a scalable and generalizable solution for complex bimanual LH manipulation, significantly reducing the dependency on human supervision

Index terms

Dual Arm Manipulation Reinforcement Learning Dexterous Manipulation