TARAD: Task-Aware Robot Affordance-Centric Diffusion Policy Learned from LLM-Generated Demonstrations
Site Hu, Takayuki Nagai, Takato Horii
AI summary
Problem
Traditional robot manipulation learning depends heavily on extensive expert demonstrations and predefined motion primitives, limiting generalization to unseen tasks. Existing foundation-model approaches often lack the fine-grained environmental understanding required for precise low-level control.
Approach
The framework uses LLMs and vision-language models to decompose language instructions into plans and extract spatial affordances from robot observations. A heuristic planner generates trajectories that are automatically filtered and used to train an affordance-conditioned 3D diffusion policy.
Key results
- Automatically generates language- and affordance-annotated datasets from natural language instructions alone
- Trains an affordance-centric 3D diffusion policy that matches SOTA imitation learning performance without expert data
- Achieves strong zero-shot generalization to unseen objects, scenes, and camera views in simulation and real-world tests
- Eliminates reliance on predefined motion primitives while maintaining precise low-level control
Why it matters
Provides a scalable, demonstration-free pipeline for training general-purpose robot manipulation policies, bridging the gap between high-level foundation model reasoning and low-level precise control.
Abstract
In open-ended task settings, the ability of a robot to execute diverse tasks accurately by following language in- structions is critical. Methods based on traditional imitation learning typically depend on extensive expert demonstrations and often struggle to generalize in the case of unseen scenarios or tasks. Recently, approaches leveraging large foundational models have demonstrated improved generalization by enhancing task comprehension in novel scenarios based on the intrinsic world knowledge embedded in these models. However, these methods rely on predefined motion primitives and lack a detailed un- derstanding of the environment, which is essential for successful execution. Herein we introduce Task-Aware Robot Affordance- Centric Diffusion Policy (TARAD), a novel framework for robot manipulation. TARAD leverages large language models and vision-language models to perform high-level planning from natural language instructions and extract affordance information from the robot’s observations. A heuristic motion planner is em- ployed for low-level motion planning, enabling zero-shot trajec- tory synthesis and the fully automatic generation of a dataset with language labels and affordances. By incorporating affordances into the observation space, our approach integrates the intrinsic commonsense and reasoning capabilities of foundation models into imitation learning, enabling the training of an affordance- centric, multi-task three-dimensional (3D) diffusion policy. Em- pirical evaluations in both the RLBench simulated environments and real-world experiments with UR5e demonstrate that TARAD effectively combines the precise control of imitation learning with the strong generalization capabilities of foundation models, all without relying on expert demonstrations or predefined motion primitives.