← Back ICRA 2026

MotIF: Motion Instruction Fine-Tuning

Minyoung Hwang, Donald Hejna, Dorsa Sadigh, Yonatan Bisk

PDF

AI summary

Key figure (auto-extracted from paper)

Fine-tuning vision-language models with abstract trajectory visualizations enables accurate evaluation of nuanced robotic motions, significantly outperforming state-of-the-art models.

Motion Understanding Vision-Language Models Success Detection Robot Fine-tuning Trajectory Visualization MotIF-1K

Problem

Existing robotic success detectors rely solely on initial and final states, ignoring the critical 'how' of task execution, while current vision-language models fail to evaluate full trajectories due to single-frame limitations and a lack of robot-specific training data.

Approach

The authors propose MotIF, which fine-tunes vision-language models by overlaying abstract keypoint trajectories onto final frames, enabling the model to evaluate whether a robot's full motion aligns with task and motion instructions.

Key results

Introduction of MotIF-1K dataset with 1,024 human and robot demonstrations across 13 tasks
Achieves at least twice the F1 score and precision of state-of-the-art single-frame and video VLMs
Demonstrates strong generalization to unseen motions, tasks, and environments
Successfully ranks real robot trajectories by alignment with task descriptions, outperforming baselines by over 20% win rate

Why it matters

It enables robots to be evaluated on nuanced, context-aware motions rather than just endpoints, which is critical for safe human-robot interaction and complex task execution.

Abstract

While success in many robotics tasks can be deter- mined by only observing the final state and how it differs from the initial state – e.g., if an apple is picked up – many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of- the-art VLMs with an input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine- tunes VLMs using the aforementioned abstract representations to semantically ground the robot’s behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories with motion descriptions. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art API-based single-frame VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions. Dataset, code, and checkpoints are in https://motif-1k.github.io/

Index terms

Intention Recognition Data Sets for Robot Learning Semantic Scene Understanding