← Back ICRA 2026

TARAD: Task-Aware Robot Affordance-Centric Diffusion Policy Learned from LLM-Generated Demonstrations

Site Hu, Takayuki Nagai, Takato Horii

PDF

AI summary

Key figure (auto-extracted from paper)

TARAD trains robots to perform diverse manipulation tasks from scratch using only natural language instructions and foundation models, matching expert-demonstration performance without predefined motion primitives.

Robot manipulation Diffusion policy Affordance learning LLMs Zero-shot generalization Imitation learning

Problem

Traditional robot manipulation learning depends heavily on extensive expert demonstrations and predefined motion primitives, limiting generalization to unseen tasks. Existing foundation-model approaches often lack the fine-grained environmental understanding required for precise low-level control.

Approach

The framework uses LLMs and vision-language models to decompose language instructions into plans and extract spatial affordances from robot observations. A heuristic planner generates trajectories that are automatically filtered and used to train an affordance-conditioned 3D diffusion policy.

Key results

Automatically generates language- and affordance-annotated datasets from natural language instructions alone
Trains an affordance-centric 3D diffusion policy that matches SOTA imitation learning performance without expert data
Achieves strong zero-shot generalization to unseen objects, scenes, and camera views in simulation and real-world tests
Eliminates reliance on predefined motion primitives while maintaining precise low-level control

Why it matters

Provides a scalable, demonstration-free pipeline for training general-purpose robot manipulation policies, bridging the gap between high-level foundation model reasoning and low-level precise control.

Abstract

In open-ended task settings, the ability of a robot to execute diverse tasks accurately by following language in- structions is critical. Methods based on traditional imitation learning typically depend on extensive expert demonstrations and often struggle to generalize in the case of unseen scenarios or tasks. Recently, approaches leveraging large foundational models have demonstrated improved generalization by enhancing task comprehension in novel scenarios based on the intrinsic world knowledge embedded in these models. However, these methods rely on predefined motion primitives and lack a detailed un- derstanding of the environment, which is essential for successful execution. Herein we introduce Task-Aware Robot Affordance- Centric Diffusion Policy (TARAD), a novel framework for robot manipulation. TARAD leverages large language models and vision-language models to perform high-level planning from natural language instructions and extract affordance information from the robot’s observations. A heuristic motion planner is em- ployed for low-level motion planning, enabling zero-shot trajec- tory synthesis and the fully automatic generation of a dataset with language labels and affordances. By incorporating affordances into the observation space, our approach integrates the intrinsic commonsense and reasoning capabilities of foundation models into imitation learning, enabling the training of an affordance- centric, multi-task three-dimensional (3D) diffusion policy. Em- pirical evaluations in both the RLBench simulated environments and real-world experiments with UR5e demonstrate that TARAD effectively combines the precise control of imitation learning with the strong generalization capabilities of foundation models, all without relying on expert demonstrations or predefined motion primitives.

Index terms

AI-Enabled Robotics Learning from Demonstration Manipulation Planning