← Back ICRA 2026

Robustness Is a Function, Not a Number: A Factorized Comprehinsive Study of OOD Robustness in Vision-Based Driving

Amir Mallak, Alaa Maalouf

PDF

AI summary

Vision Transformers with frozen foundation-model features achieve state-of-the-art out-of-distribution robustness in autonomous driving, but aggregate metrics mask critical, factor-specific failure modes that only a decomposed evaluation reveals.

OOD robustness Vision Transformers Autonomous driving Factorized evaluation Foundation models Closed-loop control

Problem

Out-of-distribution robustness in vision-based autonomous driving is typically reduced to a single aggregate score, obscuring which environmental factors break a policy and by how much. This hides actionable insights needed for safe real-world deployment.

Approach

The study decomposes driving environments into five semantic axes (scene, season, weather, time, agents) and benchmarks FC, CNN, and ViT policies under controlled k-factor perturbations in the VISTA simulator, systematically varying training data scale, diversity, and foundation-model features.

Key results

ViT architectures outperform CNNs/MLPs, with frozen foundation-model features maintaining >85% success under triple-factor shifts
Non-additive performance drops, with rural-to-urban and day-to-night shifts causing ~31% single-factor degradation
Naive multi-frame inputs fail to beat single-frame baselines, while winter/snow training yields best single-shift robustness
Training data diversity broadens OOD coverage at the cost of peak in-distribution performance

Why it matters

These findings provide actionable, factor-specific design rules for data curation, model selection, and simulation curriculum planning to improve safety-critical autonomous driving deployment.

Abstract

Out-of-distribution (OOD) robustness in vision- based autonomous driving is often reduced to a single number, hiding what breaks a policy and by how much. We adopt a fac- torized view, decomposing environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled k-factor perturba- tions (k ∈{0, 1, 2, 3}). Using closed-loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary in-distribution (ID) support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and adding FM features yields state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single-factor drops are rural →urban and day →night (∼31% each); actor swaps ∼10% and moderate rain ∼7%; several season shifts are drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above 85% under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below 50% by three changes. (5) Interactions are non-additive: some pairings (e.g., urban- night) partially offset, whereas season–time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views of the same configuration improves robustness (about +11.8 points from 5 to 14 traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD 60.6% →70.1%) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

Index terms

Control Architectures and Programming Agent-Based Systems AI-Based Methods