VLADriver-RAG

VLADriver-RAG:Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao^1,2, Haofeng Hu^1,2, Zhenhai Gao^1,2, Jiaqiao Liu³ and Gao Fei^1,2

¹College of Automotive Engineering, Jilin University, ²The National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University ³ReeFocus AI Technology

📄 arXiv

🔬Overview

Overview of the proposed VLADriver-RAG framework.

Abstract. Vision-Language-Action models have emerged as a promising paradigm for end-to-end autonomous driving. However, their reliance on implicit parametric knowledge limits generalization in long-tail driving scenarios. VLADriver-RAG addresses this problem by introducing a retrieval-augmented autonomous driving framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, sensory inputs are abstracted into spatiotemporal semantic graphs through a Visual-to-Scenario mechanism, and a Scenario-Aligned Embedding Model retrieves topologically consistent historical driving priors. These retrieved priors are then fused into a query-based VLA backbone to generate accurate and disentangled path and speed planning outputs.

🏗️Architecture

VLADriver-RAG is built upon a retrieval-augmented Vision-Language-Action architecture that integrates real-time visual perception with structure-aware historical driving priors.

The framework first converts raw visual observations into semantic scenario representations through a Visual-to-Scenario mechanism. These structured representations are used to query a scenario primitive database and retrieve topologically aligned historical cases.

The retrieved context is then projected into the VLA backbone together with visual tokens, target points, and ego-state information. Finally, path queries and speed queries are decoded into disentangled trajectory and velocity planning outputs.

Architecture of VLADriver-RAG, including Visual-to-Scenario abstraction, Scenario-Aligned Embedding, retrieval augmentation, and query-based VLA planning.

📊Quantitative Results

We evaluate VLADriver-RAG on the Bench2Drive benchmark and further analyze the effects of retrieval design, graph priors, training strategy, and database scale. In addition to quantitative comparisons, we also provide qualitative visualizations to illustrate the practical advantages of retrieval-augmented planning in challenging driving scenarios.

Comparison of closed-loop driving performance on the Bench2Drive benchmark.

Ablation Study on Retrieval and Embedding Design

Ablation results on retrieval strategy and embedding design.

Ablation Study on Graph Prior and Training Strategy

Ablation results on graph prior modeling and training strategy.

Qualitative visualization of corner-case driving scenarios

In challenging corner-case driving scenarios, the baseline often produces unstable or unsafe planning results, whereas VLADriver-RAG (b) is able to generate a safer and more reliable trajectory. These qualitative results demonstrate that retrieved historical knowledge effectively improves planning robustness and decision stability under uncertain environments.

Impact of database scale on driving performance and retrieval latency

Expanding the retrieval database generally leads to better driving performance by providing a richer pool of historical priors. However, this improvement also introduces higher retrieval latency. The results therefore reveal a clear trade-off between planning quality and retrieval efficiency, suggesting that the database size should be selected to balance driving score and inference speed in practical deployment.

📚Citation

@misc{zhao2026vladriverragretrievalaugmentedvisionlanguageactionmodels,
      title={VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving}, 
      author={Rui Zhao and Haofeng Hu and Zhenhai Gao and Jiaqiao Liu and Gao Fei},
      year={2026},
      eprint={2605.08133},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.08133}, 
}