CVPR 2026

EventDrive

Event Cameras for Vision-Language Driving Intelligence

Dongyue Lu^1,6 Rong Li² Ao Liang¹ Lingdong Kong^1,3 Wei Yin⁴ Lai Xing Ng⁵ Benoit R. Cottereau^6,7 Camille Simon Chane⁸ Wei Tsang Ooi^1,6

¹NUS ²HKUST(GZ) ³CNRS@CREATE ⁴Horizon Robotics ⁵A*STAR, I²R
⁶IPAL, CNRS IRL 2955, Singapore ⁷University Toulouse, CNRS, CerCo ⁸ETIS, CY Cergy Paris University, ENSEA, CNRS

Paper arXiv Dataset & Toolkit

EventDrive benchmark overview across perception, understanding, prediction, and planning

EventDrive brings asynchronous events, RGB frames, and language supervision together across the full driving loop.

471,543event-frame-language samples

17driving subtasks

4reasoning stages

42,869hard-split samples

Overview

From asynchronous sensing to driving decisions

Event cameras record brightness changes with microsecond latency and high dynamic range. They preserve motion structure in low light, glare, and rapid motion, where conventional frames can become unreliable.

Full-stack benchmark

Language-grounded tasks connect perception, object understanding, motion prediction, and ego planning.

Real-world multimodal data

Synchronized event streams and RGB frames are paired with structured supervision from three driving datasets.

Event-frame reasoning

EventDrive-VLM combines high-frequency motion cues with frame semantics inside one multimodal model.

Benchmark

Four stages. One driving loop.

Each stage probes a distinct role for temporal cues, from robust environmental perception to motion-aware planning.

Perception

Scene-level context under challenging illumination and motion.

Scene type
Visibility
Traffic flow
Weather
Traffic light
Road condition

Understanding

Object-centric semantics, relationships, and grounding.

Object presence
Appearance
Motion state
Ego relation
Environment relation
Grounding

Prediction

Short-horizon behavioral forecasting for surrounding agents.

Speed change
Direction change

Planning

Ego-centric intent estimation and future waypoint generation.

Speed intent
Direction intent
Waypoint planning

Dataset

Language-grounded data generation

EventDrive builds structured supervision from DSEC, M3ED, and PKU-DAVIS-SOD, combining events with frames, boxes, LiDAR, and ego poses.

EventDrive annotation pipelines for four reasoning stages

Scene-level captions

Frames are captioned and converted into QA pairs covering visibility, traffic flow, weather, road conditions, scene type, and traffic lights.

Object-centric semantics

Ground-truth boxes guide attribute captions, structured QA, and spatial grounding for object appearance, status, and relationships.

Agent motion labels

Tracked trajectories are transformed into the ego frame and discretized into speed and direction intents for surrounding agents.

Ego planning supervision

Ego poses define speed intent, direction intent, and future waypoints, forming decision-oriented planning queries.

Hard Split

Designed for the moments when frames struggle

The hard split contains low-light and motion-blur sequences, enabling targeted evaluation of event sensing under adverse conditions.

Explore the dataset

Model

EventDrive-VLM

A dynamic event pathway captures temporal structure at multiple horizons, then aligns motion-aware event tokens with frame and language representations.

Multi-horizon voxelization

Short-, medium-, and long-horizon tensors preserve motion patterns across temporal scales.

Dynamic aggregation

A mixture-of-experts gate emphasizes the most informative temporal horizon for each scene.

Event Q-Former alignment

Learnable queries extract compact motion-aware tokens for coherent multimodal reasoning.

Results

Temporal cues complement frame semantics

Event-frame fusion improves reasoning across the driving chain, particularly for grounding, motion prediction, and planning.

Comparison across four driving-reasoning tasks in the EventDrive benchmark — Comparison with event-only, frame-only, and event-frame models across the full driving loop.

Event-frame fusion is strongest overall. It leads most metrics across perception, understanding, prediction, and planning.

Events add motion awareness. High-frequency temporal cues improve speed reasoning and stabilize waypoint prediction.

Frames and events are complementary. RGB contributes semantics, while events remain informative under blur and difficult illumination.

Qualitative Comparison

Reliable reasoning under low light and rapid motion

Frames preserve appearance semantics, while events reveal temporal gradients and motion structure. EventDrive-VLM fuses both signals for more robust scene understanding and ego decisions.

Qualitative comparison for perception and understanding under strong sunlight

Qualitative comparison for prediction and planning at night

Citation

Use EventDrive in your research

The dataset and toolkit are available on Hugging Face. Please cite EventDrive when using the benchmark in your research.

Dataset & Toolkit

@InProceedings{Lu_2026_CVPR,
  author    = {Lu, Dongyue and Li, Rong and Liang, Ao and Kong, Lingdong and Yin, Wei and Ng, Lai Xing and Cottereau, Benoit R. and Chane, Camille Simon and Ooi, Wei Tsang},
  title     = {EventDrive: Event Cameras for Vision-Language Driving Intelligence},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}