← Back ICRA 2026

Cross-Modal Instructions for Robot Motion Generation

William Baron, Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi

PDF

AI summary

Key figure (auto-extracted from paper)

CrossInstruct enables robots to generate executable 3D motion trajectories from simple human sketches and text without requiring physical demonstrations or extensive fine-tuning.

cross-modal instructions robot motion generation vision-language models sketch-based control reinforcement learning imitation learning

Problem

Teaching robots novel behaviors traditionally relies on cumbersome physical demonstrations like teleoperation or kinesthetic teaching, which are difficult to scale and generalize across changing environments.

Approach

The CrossInstruct framework integrates human sketches and text as in-context examples for a large vision-language model, which collaborates with a fine-tuned pointing model to localize keypoints and fuse multi-view 2D trajectories into coherent 3D motion via raycasting.

Key results

Introduces a Learning from Cross-Modal Instructions paradigm that replaces physical demonstrations with free-form sketches and text
Develops CrossInstruct, a hierarchical VLM framework that fuses multi-view 2D sketches into executable 3D robot trajectories
Demonstrates robust out-of-the-box success on RLBench simulation and real-world hardware without additional fine-tuning
Shows CrossInstruct outputs effectively warm-start reinforcement learning, accelerating policy convergence and improving task precision

Why it matters

It offers a scalable, low-effort alternative to traditional imitation learning for rapidly teaching and refining complex robot manipulation skills across diverse environments.

Abstract

Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alterna- tive paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as exam- ples into the context input to a foundational vision–language model (VLM). The VLM then iteratively queries a smaller, fine- tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot’s workspace. By incorporating the reasoning of the large VLM with a fine- grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the lim- ited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct ∗email: Weiming.Zhi@sydney.edu.au. 1 College of Connected Computing, Vanderbilt University, TN, USA 2 Robotics Institute, Carnegie Mellon University, PA, USA 3 School of Computer Science, The University of Sydney, Australia. 4 Australian Centre for Robotics, The University of Sydney, Australia. outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.

Index terms

Learning from Demonstration Big Data in Robotics and Automation