CVPR 2026

EventDrive

Event Cameras for Vision-Language Driving Intelligence

1NUS   2HKUST(GZ)   3CNRS@CREATE   4Horizon Robotics   5A*STAR, I2R
6IPAL, CNRS IRL 2955, Singapore   7University Toulouse, CNRS, CerCo   8ETIS, CY Cergy Paris University, ENSEA, CNRS

EventDrive benchmark overview across perception, understanding, prediction, and planning

EventDrive brings asynchronous events, RGB frames, and language supervision together across the full driving loop.

471,543event-frame-language samples
17driving subtasks
4reasoning stages
42,869hard-split samples

Overview

From asynchronous sensing to driving decisions

Event cameras record brightness changes with microsecond latency and high dynamic range. They preserve motion structure in low light, glare, and rapid motion, where conventional frames can become unreliable.

01

Full-stack benchmark

Language-grounded tasks connect perception, object understanding, motion prediction, and ego planning.

02

Real-world multimodal data

Synchronized event streams and RGB frames are paired with structured supervision from three driving datasets.

03

Event-frame reasoning

EventDrive-VLM combines high-frequency motion cues with frame semantics inside one multimodal model.

Benchmark

Four stages. One driving loop.

Each stage probes a distinct role for temporal cues, from robust environmental perception to motion-aware planning.

01

Perception

Scene-level context under challenging illumination and motion.

  • Scene type
  • Visibility
  • Traffic flow
  • Weather
  • Traffic light
  • Road condition
02

Understanding

Object-centric semantics, relationships, and grounding.

  • Object presence
  • Appearance
  • Motion state
  • Ego relation
  • Environment relation
  • Grounding
03

Prediction

Short-horizon behavioral forecasting for surrounding agents.

  • Speed change
  • Direction change
04

Planning

Ego-centric intent estimation and future waypoint generation.

  • Speed intent
  • Direction intent
  • Waypoint planning

Dataset

Language-grounded data generation

EventDrive builds structured supervision from DSEC, M3ED, and PKU-DAVIS-SOD, combining events with frames, boxes, LiDAR, and ego poses.

EventDrive annotation pipelines for four reasoning stages
01

Scene-level captions

Frames are captioned and converted into QA pairs covering visibility, traffic flow, weather, road conditions, scene type, and traffic lights.

02

Object-centric semantics

Ground-truth boxes guide attribute captions, structured QA, and spatial grounding for object appearance, status, and relationships.

03

Agent motion labels

Tracked trajectories are transformed into the ego frame and discretized into speed and direction intents for surrounding agents.

04

Ego planning supervision

Ego poses define speed intent, direction intent, and future waypoints, forming decision-oriented planning queries.

Hard Split

Designed for the moments when frames struggle

The hard split contains low-light and motion-blur sequences, enabling targeted evaluation of event sensing under adverse conditions.

Explore the dataset

Model

EventDrive-VLM

A dynamic event pathway captures temporal structure at multiple horizons, then aligns motion-aware event tokens with frame and language representations.

EventDrive-VLM architecture
01

Multi-horizon voxelization

Short-, medium-, and long-horizon tensors preserve motion patterns across temporal scales.

02

Dynamic aggregation

A mixture-of-experts gate emphasizes the most informative temporal horizon for each scene.

03

Event Q-Former alignment

Learnable queries extract compact motion-aware tokens for coherent multimodal reasoning.

Results

Temporal cues complement frame semantics

Event-frame fusion improves reasoning across the driving chain, particularly for grounding, motion prediction, and planning.

Comparison across four driving-reasoning tasks in the EventDrive benchmark
Comparison with event-only, frame-only, and event-frame models across the full driving loop.
01

Event-frame fusion is strongest overall. It leads most metrics across perception, understanding, prediction, and planning.

02

Events add motion awareness. High-frequency temporal cues improve speed reasoning and stabilize waypoint prediction.

03

Frames and events are complementary. RGB contributes semantics, while events remain informative under blur and difficult illumination.

Qualitative Comparison

Reliable reasoning under low light and rapid motion

Frames preserve appearance semantics, while events reveal temporal gradients and motion structure. EventDrive-VLM fuses both signals for more robust scene understanding and ego decisions.

Qualitative comparison for perception and understanding under strong sunlight
Qualitative comparison for prediction and planning at night

Citation

Use EventDrive in your research

The dataset and toolkit are available on Hugging Face. Please cite EventDrive when using the benchmark in your research.

Dataset & Toolkit
@InProceedings{Lu_2026_CVPR,
  author    = {Lu, Dongyue and Li, Rong and Liang, Ao and Kong, Lingdong and Yin, Wei and Ng, Lai Xing and Cottereau, Benoit R. and Chane, Camille Simon and Ooi, Wei Tsang},
  title     = {EventDrive: Event Cameras for Vision-Language Driving Intelligence},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}