Reward-Free Continual Adaptation for Resilient Space Robots
Andrej Orsula, Miguel A. Olivares-Mendez, Carol Martinez
AI summary
Problem
Hardware degradation in space environments catastrophically breaks pre-trained control policies, but continual reinforcement learning cannot be deployed because precise reward computation is often impossible without external tracking or privileged simulation states.
Approach
The framework freezes the observation encoder and reward predictor of a pre-trained world model while updating only its transition dynamics through unsupervised environmental rollouts, allowing the agent to adapt its policy using purely synthetic trajectories.
Key results
- Rapid initial policy recovery across planetary traversal, orbital navigation, and precision assembly tasks
- Successful adaptation to severe morphological failures without external reward signals
- Late-stage performance decay caused by representation drift in continuously updated dynamics
- Validation that pre-trained latent reward landscapes generalize sufficiently for short-term autonomous recovery
Why it matters
Enables long-duration space missions to maintain operational capability after hardware failures without relying on impractical onboard reward computation or extensive retraining.
Abstract
Space robots operate in extreme environments where hardware degradation can critically compromise tradi- tional control strategies. While continual reinforcement learn- ing offers a promising mechanism for online adaptation, it inherently requires access to a reward signal during deploy- ment. However, precise reward computation in space is often infeasible due to the lack of external tracking systems and the overall complexity of the environment. To address the challenge of unobservable rewards, we introduce a reward-free continual learning framework that leverages latent-state world models. By pre-training a model-based agent across diverse simulations, the world model learns a robust predictor of the reward structure within its latent space. Upon deployment to an environment with severe hardware degradation, we freeze the observation encoder and reward predictor to update only the transition dynamics of the world model through unsu- pervised rollouts. By training the policy entirely on imagined trajectories generated by this updated world model, the agent adapts to altered dynamics without receiving new rewards. We demonstrate our approach across simulated planetary traversal, orbital navigation, and precision assembly tasks subjected to severe morphological failures. The source code is available at github.com/AndrejOrsula/space_robotics_bench.