Research Analyzer
← Back ICRA 2026

RoboSQ: Semantic Queries for Task-Aligned Robot Training Data

Kaiyuan Chen, Shuangyu Xie, Kush Hari, Andrew Goldberg, Kavish Kondap, Ken Goldberg

PDF

AI summary

Key figure (auto-extracted from paper)
Filtering robot demonstration data with VLM-based semantic queries significantly improves policy training success rates compared to using raw, uncurated datasets.
Robot data management Semantic querying Vision-Language Models Visual Question Answering Policy training Dataset filtering

Problem

Manually inspecting and filtering large-scale, noisy robot demonstration datasets is prohibitively expensive, and existing data management systems lack the ability to perform semantic queries on raw visual and sensor data.

Approach

RoboSQ pipelines frame extraction and Vision-Language Model inference to convert robot trajectories into structured Visual Question Answering prompts, enabling natural language queries over heterogeneous sensor streams.

Key results

  • Filters failed trajectories with 78% accuracy and 86% F1 score
  • Detects incorrect extrinsic camera-to-end-effector calibration at 86% accuracy and 88% F1 score
  • Achieves 13 out of 15 pick-and-place successes using filtered data versus 1 out of 15 on raw mixed data
  • Introduces a parallelized pipeline for efficient VLM inference on large-scale robot datasets

Why it matters

Enables researchers to efficiently curate high-quality, task-specific robot training data at scale without manual annotation, accelerating robust policy learning.

Abstract

Training robot policies often requires extracting appropriate subsets of data from large and noisy datasets. For example, one might want to extract only robot demonstrations with accurate captions or only those related to cooking. We present RoboSQ, a robot data management system that allows semantic queries. RoboSQ samples temporally distributed frames and overlays projected sensor information from robot trajec- tories and constructs structured Visual Question Answering (VQA) prompts for Vision-Language Models (VLMs). RoboSQ efficiently handles queries by pipelining data loading, frame extraction, and VLM inference. We evaluate RoboSQ on the DROID dataset with three semantic queries: 1) failure detection, 2) calibration error detection and 3) visual complexity scoring. It filters out the failure trajectories with 78% accuracy and 86% F1 score, and identifies the trajectories with incorrect extrinsic calibration between camera frame and end effector frame at 86% accuracy and 88% F1 score. We evaluate RoboSQ by training a pick-and-place Action Chunking Transformer policy with a UR5 robot arm with mixed quality demonstration data. Data extracted by RoboSQ is closely aligned with the expert-curated data. A policy trained on RoboSQ-selected data achieves 13 successes out of 15 trials, compared to only 1 out of 15 when trained on the full mixed dataset. Code, video and supplementary information can be found on website https://berkeleyautomation.github.io/robosq/ I. I N T RO D U C T I O N Vision-Language-Action models [1–3] require large-scale human teleoperated demonstration trajectories, such as Open X-Embodiment [4] and Distributed Robot Interaction Dataset (DROID) [5] curated from research institutions worldwide. At tera or even penta-byte scale, which is sometimes character- ized as Big Data [6], manually inspecting and filtering robot demonstrations becomes prohibitively expensive. Emerging robot data management systems, such as RoboDM [7] and LeRobot [8], efficiently store the data and load them for policy training. However, these systems rely primarily on predefined metadata and do not support semantic querying, such as “does the robot perform a successful grasp?”, “is the extrinsic calibration correct?”, or “does the task occur in low-light conditions?”. This limits their ability to handle semi-structured, heterogeneous robot datasets where critical information is often embedded in raw visual or sensor data rather than explicit labels. To address this challenge, we present RoboSQ, a robot data management system that supports semantic queries – finding subsets of data that satisfy semantic conditions described in natural language. 1Department of Electrical Engineering and Computer Science 2Department of Industrial Engineering and Operations Research 1,2University of California, Berkeley, CA, USA ∗Equal Contribution. †For correspondence and questions: kych@berkeley.edu Fig. 1: RoboSQ: A semantic query system with efficient frame selection and VLM to extract high-quality, task-specific data for effective robot learning. Recent Vision-Language Models (VLMs) [9–15] are increasingly capable of performing spatial and semantic analysis of images, and semantic querying of image datasets. VLMs are trained from Internet-scale data, which enables generalizable understanding of complex and diverse contexts in robotics data, as tested and evaluated in recent efforts [16, 17], and can be further improved with more robotics data [18]. Visual Question Answering (VQA) is an interface to VLMs that structures the language and image as input, and allow users to pose complex visual and semantic queries. In robotics, this can be applied to data management tasks such as identifying failure cases, summarizing behaviors, or checking spatial relations without requiring manual annotations. In this paper, we present RoboSQ, a semantic query system for robot data based on VLM using VQA. RoboSQ structures multi-modal robot trajectory data into structured VQAs by sensor projection, image concatenation, and streaming. To handle the heterogeneous semantic query pipeline, RoboSQ organize these modules and VQAs into a combination data transformation primitives, such as sort, filter, as the output of the semantic query response. The pipeline of RoboSQ efficiently parallelizes data frame extraction and VLM inference. We evaluate the semantic query capability of RoboSQ on DROID [5] with three unique case studies: (1) Given only one camera stream, RoboSQ filters out the trajectories that fail to complete the task. RoboSQ agrees with the metadata annotated by original human teleoperators at 0.86 F1 score; (2) RoboSQ could identify the many demonstrations with incorrect extrinsic calibration between camera frame and end effector frame; (3) RoboSQ ranks the trajectory semantic quality with a metric based on visual and task complexity. We integrate RoboSQ into a robot learning task where a UR5 arm is to pick up a stuffed animal and place 2026 IEEE International Conference on Robotics and Automation (ICRA 2026) June 1-5, 2026. Vienna, Austria 979-8-3315-8160-2/26/$31.00 ©2026 IEEE 13756

Index terms

Big Data in Robotics and Automation Data Sets for Robot Learning Deep Learning in Grasping and Manipulation

Related papers