VisuaLLMPlanner - a Maneuver Planner for Automated Vehicles Using Large Language Models
Daniel Neurath, Bernd Schäufele, Ilja Radusch
AI summary
Problem
Conventional motion planners fail to handle rare, unpredictable long-tail driving scenarios due to a lack of high-level contextual reasoning. Existing LLM-based approaches often lack spatial precision or cannot isolate the model's actual decision-making contribution.
Approach
The system triggers a multimodal LLM only when a standard planner encounters an unresolved obstacle, feeding it a bird’s-eye view image and structured scene description. The model then selects from a discrete set of pre-computed, validated trajectory options rather than generating plans from scratch.
Key results
- Outperforms prior LLM-based planners and fixed heuristics on the interPlan long-tail benchmark
- Achieves high success rates in safety-critical categories like Jaywalker and Construction scenarios
- Demonstrates that querying foundation models to choose from validated options yields more robust and explainable decisions
- Successfully isolates and quantifies the LLM's contribution by restricting base planner autonomy during decision phases
Why it matters
Offers a practical, interpretable blueprint for safely integrating foundation models into automated driving stacks while clarifying their real-world reasoning limits.
Abstract
Achieving safe and reliable automated driving in real-world conditions requires the ability to handle rare and unpredictable situations, commonly known as long-tail scenarios. These cases are often underrepresented in training data and remain a major challenge for conventional motion planning systems. In this work, we present VisuaLLMPlanner, a maneuver planning framework that integrates a multimodal large language model (MLLM) into the high-level decision- making loop of an automated driving pipeline. The system is triggered when the ego vehicle encounters a situation with an obstacle that cannot be resolved by a standard lane-following planner. At this point, a structured input comprising a bird’s- eye view image and a textual scene description is generated and passed to the MLLM. Rather than generating plans directly, the model selects from a discrete set of pre-generated and validated maneuver options, allowing for interpretable and structured decision-making. We evaluate our approach on the interPlan benchmark, which focuses explicitly on long-tail sce- narios, and demonstrate that VisuaLLMPlanner achieves strong performance in comparison to prior LLM-based planners. The results highlight both the potential and current limitations of foundation models for high-level reasoning in automated vehicle planning.