← Back ICRA 2026

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel J. Szafir

PDF

AI summary

Key figure (auto-extracted from paper)

Observation Space Shift severely degrades visuomotor policy performance in long-horizon tasks, and common mitigation strategies fail to resolve it.

Observation Space Shift Long-Horizon Tasks Skill Chaining Imitation Learning Robot Learning Benchmarks Policy Robustness

Problem

Hierarchical skill chaining for long-horizon robotic tasks frequently fails because preceding skills alter visual observations, disrupting subsequent policies. This gap, termed Observation Space Shift (OSS), lacks dedicated evaluation frameworks to quantify its impact or test solutions.

Approach

The authors introduce BOSS, a simulator-based benchmark with three progressive challenges to quantify OSS, and evaluate four imitation learning algorithms alongside three intuitive mitigation strategies.

Key results

Formulation and benchmarking of Observation Space Shift across three progressive challenges
Average performance drops of 34% to 67% across four imitation learning algorithms under OSS
Demonstration that frozen vision encoders, 3D inputs, and data augmentation fail to mitigate OSS
Release of the Rule-based Automatic Modification Generator for scalable task variation

Why it matters

Highlights a critical, overlooked failure mode in hierarchical robot learning that undermines real-world task execution, urging the robotics community to develop robust transition mechanisms.

Abstract

Robotics has long sought to develop robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To understand OSS and evaluate its impact on long- horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: “Single Predicate Shift”, “Accumulated Predicate Shift”, and “Skill Chaining”, each designed to assess a different aspect of OSS’s negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate three potential solutions, including using frozen robotics-specific vision encoders, switching to 3D pointcloud-based inputs, and applying data augmentation to expand visual diversity. Our results show that none of these approaches are sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

Index terms

Imitation Learning AI-Based Methods Data Sets for Robot Learning