BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel J. Szafir
AI summary
Problem
Hierarchical skill chaining for long-horizon robotic tasks frequently fails because preceding skills alter visual observations, disrupting subsequent policies. This gap, termed Observation Space Shift (OSS), lacks dedicated evaluation frameworks to quantify its impact or test solutions.
Approach
The authors introduce BOSS, a simulator-based benchmark with three progressive challenges to quantify OSS, and evaluate four imitation learning algorithms alongside three intuitive mitigation strategies.
Key results
- Formulation and benchmarking of Observation Space Shift across three progressive challenges
- Average performance drops of 34% to 67% across four imitation learning algorithms under OSS
- Demonstration that frozen vision encoders, 3D inputs, and data augmentation fail to mitigate OSS
- Release of the Rule-based Automatic Modification Generator for scalable task variation
Why it matters
Highlights a critical, overlooked failure mode in hierarchical robot learning that undermines real-world task execution, urging the robotics community to develop robust transition mechanisms.
Abstract
Robotics has long sought to develop robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To understand OSS and evaluate its impact on long- horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: “Single Predicate Shift”, “Accumulated Predicate Shift”, and “Skill Chaining”, each designed to assess a different aspect of OSS’s negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate three potential solutions, including using frozen robotics-specific vision encoders, switching to 3D pointcloud-based inputs, and applying data augmentation to expand visual diversity. Our results show that none of these approaches are sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/