← Back ICRA 2026

TADPO: Reinforcement Learning Goes Off-Road

Zhouchonghao Wu, Raymond Song, Vedant Mundheda, Luis E. Navarro-Serment, Christof Schoenborn, Jeff Schneider

PDF

AI summary

Key figure (auto-extracted from paper)

TADPO enables robust, high-speed off-road driving by combining teacher demonstrations with on-policy RL, achieving zero-shot sim-to-real transfer on a full-scale vehicle.

Reinforcement Learning Off-road Driving Sim-to-Real Transfer Policy Gradient Teacher-Student Distillation Autonomous Navigation

Problem

Standard reinforcement learning struggles with off-road autonomous driving due to long-horizon planning, low-signal rewards, complex terrain dynamics, and inefficient exploration.

Approach

TADPO extends Proximal Policy Optimization (PPO) to concurrently learn from fixed expert demonstrations and on-policy student rollouts, using a clipped teacher-student distillation loss to guide exploration while maintaining independent value estimation.

Key results

Novel TADPO algorithm extending PPO with teacher action distillation
Vision-based end-to-end RL system navigating extreme slopes and obstacle-rich terrain in simulation
First zero-shot sim-to-real deployment of RL policies on a full-scale off-road vehicle
High-speed, long-horizon autonomous navigation without fine-tuning or dense mapping

Why it matters

Enables reliable autonomous navigation in unstructured, unmapped environments where traditional mapping and modeling fail, advancing practical off-road robotics and exploration.

Abstract

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on- policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high- speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full- scale off-road platform. Source code is available at this link and video at this link.

Index terms

Reinforcement Learning Field Robots Machine Learning for Robot Control