← Back ICRA 2026

A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving

Timur Anvar, Jeffrey Chen, Yuyan Wang, Rohan Chandra

PDF

AI summary

Key figure (auto-extracted from paper)

Small local LLMs can improve RL safety through reward shaping but introduce a systematic conservative bias and degrade driving efficiency, making them unsuitable as direct controllers but useful as training guides.

Autonomous Driving Reinforcement Learning Large Language Models Reward Shaping Hybrid Control

Problem

RL for autonomous driving struggles with crafting reward functions that capture complex semantic and social driving norms, while direct LLM control is unstable, inconsistent, and impractical for real-time safety-critical tasks.

Approach

The authors conduct a case study comparing RL-only, LLM-only, and hybrid approaches where small local LLMs shape RL rewards during training by scoring state-action transitions, while standard RL policies handle real-time control.

Key results

RL-only agents achieve 73–89% success rates with higher speeds
LLM-only agents reach up to 94% success but exhibit severely degraded speed and conservative bias
Hybrid reward shaping balances safety and efficiency but inherits systematic conservative bias
Gemma3-12B yields safer yet slower hybrid policies than Qwen3-14B

Why it matters

Highlights practical limits of resource-constrained LLMs for safety-critical control, guiding researchers and engineers on effectively integrating semantic reasoning into autonomous driving systems without compromising real-time performance.

Abstract

Autonomous vehicle navigation in complex envi- ronments such as dense and fast-moving highways and merging scenarios remains an active area of research. In the past decade, many planning and control approaches have used reinforcement learning (RL) with notable success. However, a key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (≤14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. These models are attractive for practical deployment as they can run on a single GPU and avoid external API dependencies. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Criti- cally, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.

Index terms

Intelligent Transportation Systems