The main architecture consists of a Transformer-based perception module and a Hybrid Diffusion and Supervision Decoder. The blue arrows indicate the data flow exclusively used for the CARLA benchmark, while the black arrows represent the data flow shared between both the CARLA and NAVSIM benchmarks.