Chonghao Sima

Chonghao Sima (司马崇昊)

Chonghao Sima is a Ph.D. student in Computer Science at The University of Hong Kong (HKU), working at MMLab@HKU, advised by Prof. Ping Luo and Prof. Hongyang Li. He closely works with Kashyap Chitta and Prof. Andreas Geiger. His research focuses on autonomous driving and embodied AI, spanning 3D perception, end-to-end planning, vision-language models for driving, and robot manipulation.

He received his B.S. from Huazhong University of Science and Technology (2015–2019), then moved to the US to pursue his Ph.D. at Purdue University (2019–2023) with Prof. Yexiang Xue, before transferring to HKU in 2023. He has interned at NVIDIA Autonomous Vehicle Applied Research with Dr. Jose M. Alvarez and Dr. Zhiding Yu, and at OpenDriveLab, Shanghai AI Lab.

He was a core contributor to UniAD, which received the CVPR 2023 Best Paper Award (1/9155), and received Best Paper Finalist (29/4306) at IROS 2025 and Outstanding Reviewer (232/7000) at CVPR 2023. He won the Waymo Open Challenge 2022 and hosted challenges at CVPR 2024 Autonomous Grand Challenge (DriveLM) and CVPR 2023 Autonomous Driving Challenge (3D Occupancy Prediction). He holds 3 US patents, organizes workshops at CVPR and ICLR, and reviews for CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, ICRA, IROS, T-PAMI, and IJCV.

Email / CV / Google Scholar / GitHub / Twitter / Bluesky / LinkedIn

Publications

I believe in building benchmarks to accelerate the research cycle in the community, and a good module should be scalably applicable in real-world scenarios. My research addresses the curse of reality for physical agents — the challenge that the real world will always present conditions a model has never seen. I organize my work around four research questions spanning the perceive-reason-act loop:

Architecture: End-to-end systems with self-correction
Act: Robust policy deployment under distributional shift
Reason: LLM-grounded real-time planning
Perceive: Universal 3D representations

Some papers are highlighted.

Architecture End-to-end systems with self-correction
	Intelligent Robot Manipulation Requires Self-Directed Learning Li Chen, Chonghao Sima, Kashyap Chitta, Antonio Loquercio, Ping Luo, Yi Ma, Hongyang Li Authorea Preprints, 2025 paper / bibtex @article{chen2025robot, title={Intelligent Robot Manipulation Requires Self-Directed Learning}, author={Chen, Li and Sima, Chonghao and Chitta, Kashyap and Loquercio, Antonio and Luo, Ping and Ma, Yi and Li, Hongyang}, journal={Authorea Preprints}, year={2025} } This Perspective argues that achieving human-level dexterity requires transcending imitation learning in favor of self-directed learning, which enables autonomous improvement in environments lacking resets or explicit rewards. We propose a framework structured around goal identification, skill acquisition, and performance evaluation, leveraging cross-modal synergies to navigate these algorithmic challenges.
	Planning-oriented Autonomous Driving Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li CVPR, 2023 (Best Paper Award) paper / code / bibtex @inproceedings{hu2023uniad, title={Planning-oriented Autonomous Driving}, author={Hu, Yihan and Yang, Jiazhi and Chen, Li and Li, Keyu and Sima, Chonghao and Zhu, Xizhou and Chai, Siqi and Du, Senyao and Lin, Tianwei and Wang, Wenhai and Lu, Lewei and Jia, Xiaosong and Liu, Qiang and Dai, Jifeng and Qiao, Yu and Li, Hongyang}, booktitle={CVPR}, year={2023} } A unified end-to-end framework that hierarchically integrates full-stack driving tasks—perception, prediction, and planning—under a planning-oriented philosophy, achieving state-of-the-art across all tasks on nuScenes.
Act Robust policy deployment under distributional shift
	χ₀: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies Chonghao Sima, Checheng Yu, Modi Shi, Lirui Zhao, et al arXiv preprint arXiv:2602.09021, 2026 project page / paper / code / bibtex @article{sima2026kai0, title={$\chi_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies}, author={Yu, Checheng and Sima, Chonghao and Jiang, Gangcheng and Zhang, Hai and Mai, Haoguang and Li, Hongyang and Wang, Huijie and Chen, Jin and Wu, Kaiyang and Chen, Li and Zhao, Lirui and Shi, Modi and Luo, Ping and Bu, Qingwen and Peng, Shijia and Li, Tianyu and Yuan, Yibo}, journal={arXiv preprint arXiv:2602.09021}, year={2026} } We present χ₀ to address distributional shifts across robot learning, through Model Arithmetic, Stage Advantage, and Train-Deploy Alignment.
	AgiBot World Colosseo: A Large-Scale Manipulation Platform for Scalable and Intelligent Embodied Systems AgiBot-World-Contributors, Chonghao Sima, et al IROS, 2025 (Best Paper Finalist) project page / paper / code / challenge / bibtex @inproceedings{bu2025agibot_iros, title={Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems}, author={Bu, Qingwen and Cai, Jisong and Chen, Li and Cui, Xiuqi and Ding, Yan and Feng, Siyuan and He, Xindong and Huang, Xu and others}, booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year={2025}, organization={IEEE} } AgiBot World aspires to transform large-scale robot learning and advance scalable robotic systems for production. This open-source platform invites researchers and practitioners to collaboratively shape the future of Embodied AI.
	Centaur: Robust End-to-End Autonomous Driving with Test-Time Training Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, Jose M. Alvarez arXiv preprint arXiv:2503.11650, 2025 paper / bibtex @article{sima2024centaur, title={Centaur: Robust End-to-End Autonomous Driving with Test-Time Training}, author={Sima, Chonghao and Chitta, Kashyap and Yu, Zhiding and Lan, Shiyi and Luo, Ping and Geiger, Andreas and Li, Hongyang and Alvarez, Jose M}, journal={arXiv preprint arXiv:2503.11650}, year={2025} } A test-time training framework for end-to-end driving that minimizes a novel Cluster Entropy uncertainty measure, improving planner robustness without hand-engineered rules or cost functions. Ranks 1st on the NAVSIM leaderboard at the time of submission.
Reason LLM-grounded real-time planning
	ETA: Efficiency Through Thinking Ahead, A Dual Approach to Self-Driving with Large Models Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, Fatma Guney ICCV, 2025 paper / code / bibtex @inproceedings{hamdan2025eta, title={ETA: Efficiency Through Thinking Ahead, A Dual Approach to Self-Driving with Large Models}, author={Hamdan, Shadi and Sima, Chonghao and Yang, Zetong and Li, Hongyang and G{\"u}ney, Fatma}, booktitle={ICCV}, year={2025} } An asynchronous dual-system architecture that pairs a large model for anticipatory reasoning with a small model for real-time responsiveness, achieving state-of-the-art driving performance at near-real-time speed.
	Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan ICCV, 2025 project page / paper / code / leaderboard / bibtex @inproceedings{xie2025vlms, title={Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives}, author={Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang}, booktitle={ICCV}, year={2025} } DriveLM following work. A comprehensive benchmark evaluating the reliability of 12 VLMs for autonomous driving across 17 settings, revealing that current models often rely on textual cues rather than true visual grounding.
	DriveLM: Driving with Graph Visual Question Answering Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li ECCV, 2024 (Oral, 2.3%) project page / paper / code / challenge / bibtex @inproceedings{sima2024drivelm, title={DriveLM: Driving with Graph Visual Question Answering}, author={Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Bei{\ss}wenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang}, booktitle={ECCV}, year={2024} } A graph-structured VQA framework and dataset for integrating vision-language models into end-to-end driving, enabling multi-step reasoning through perception, prediction, and planning.
	Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, Hao Zhao CoRL, 2024 project page / paper / code / bibtex @inproceedings{ding2024hint, title={Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving}, author={Ding, Kairui and Chen, Boyuan and Su, Yuchen and Gao, Huan-ang and Jin, Bu and Sima, Chonghao and Zhang, Wuqiang and Li, Xiaohui and Barsch, Paul and Li, Hongyang and Zhao, Hao}, booktitle={CoRL}, year={2024} } An integrated AD-language system that grounds natural language explanations in the model's intermediate perception, prediction, and planning outputs, with the human-labeled Nu-X dataset for driving explanation research.
Perceive Universal 3D representations
	OpenScene: The Largest Up-to-Date 3D Occupancy Prediction Benchmark in Autonomous Driving Chonghao Sima and OpenScene Contributors github.com/OpenDriveLab/OpenScene , 2023 / bibtex @misc{sima2023openscene, title={OpenScene: The Largest Up-to-Date 3D Occupancy Prediction Benchmark in Autonomous Driving}, author={Sima, Chonghao and {OpenScene Contributors}}, howpublished={\url{https://github.com/OpenDriveLab/OpenScene}}, year={2023} } A compact redistribution of the large-scale nuPlan dataset, retaining only relevant annotations and sensor data at 2 Hz to reduce dataset size by over 10x. OpenScene spans 120+ hours of driving across Boston, Pittsburgh, Las Vegas, and Singapore with occupancy labels, and serves as the official dataset for the End-to-End Driving and Predictive World Model tracks at the CVPR 2024 and CVPR 2025 Autonomous Grand Challenges.
	Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li NeurIPS, 2023 paper / code / bibtex @inproceedings{huang2023vcd, title={Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection}, author={Huang, Linyan and Li, Zhiqi and Sima, Chonghao and Wang, Wenhai and Wang, Jingdong and Qiao, Yu and Li, Hongyang}, booktitle={NeurIPS}, year={2023} } A vision-centric knowledge distillation framework (VCD) that bridges the LiDAR-camera domain gap with an apprentice-friendly multi-modal expert and trajectory-based temporal alignment, achieving 63.1% NDS on nuScenes with camera-only input.
	OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, Feng Wen, Hang Xu, Ping Luo, Junchi Yan, Wei Zhang, Hongyang Li NeurIPS, Datasets and Benchmarks Track, 2023 paper / code / challenge 2024 / challenge 2023 / bibtex @inproceedings{wang2023openlanev2, title={OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping}, author={Wang, Huijie and Li, Tianyu and Li, Yang and Chen, Li and Sima, Chonghao and Liu, Zhenbo and Wang, Bangjun and Jia, Peijin and Wang, Yuting and Jiang, Shengyin and Wen, Feng and Xu, Hang and Luo, Ping and Yan, Junchi and Zhang, Wei and Li, Hongyang}, booktitle={NeurIPS, Track Datasets and Benchmarks}, year={2023} } The first benchmark for topology reasoning in driving scenes, requiring joint perception of 3D lanes and traffic elements together with their structural relationships to build a unified scene representation.
	Scene as Occupancy Chonghao Sima, Wenwen Tong, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, Hongyang Li ICCV, 2023 paper / code / challenge 2024 / challenge 2023 / bibtex @inproceedings{sima2023occnet, title={Scene as Occupancy}, author={Sima, Chonghao and Tong, Wenwen and Wang, Tai and Chen, Li and Wu, Silei and Deng, Hanming and Gu, Yi and Lu, Lewei and Luo, Ping and Lin, Dahua and Li, Hongyang}, booktitle={ICCV}, year={2023} } A vision-centric 3D occupancy prediction framework that captures fine-grained scene geometry beyond bounding boxes, serving as a general representation for detection, segmentation, and planning on par with LiDAR-based methods.
	Sparse Dense Fusion for 3D Object Detection Yulu Gao, Chonghao Sima, Shaoshuai Shi, Shangzhe Di, Si Liu, Hongyang Li IROS, 2023 paper / bibtex @inproceedings{gao2023sdf, title={Sparse Dense Fusion for 3D Object Detection}, author={Gao, Yulu and Sima, Chonghao and Shi, Shaoshuai and Di, Shangzhe and Liu, Si and Li, Hongyang}, booktitle={IROS}, year={2023} } We propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture for camera-LiDAR 3D object detection, compensating the information loss in either manner. Through our SDF strategy, we outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark.
	Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe Hongyang Li, Chonghao Sima, et al T-PAMI, 2023 paper / code / bibtex @inproceedings{li2023bevdevils, title={Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe}, author={Li, Hongyang and Sima, Chonghao and Dai, Jifeng and Wang, Wenhai and Lu, Lewei and Wang, Huijie and Zeng, Jia and Li, Zhiqi and Yang, Jiazhi and Deng, Hanming and others}, booktitle={T-PAMI}, year={2023} } A comprehensive survey and practical toolbox for bird's-eye-view perception across camera, LiDAR, and fusion modalities, with a bag of tricks that achieved 1st place on the Waymo Open Challenge 2022 camera-based detection track.
	PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark Li Chen, Chonghao Sima, Yang Li, Zehan Zheng, Jiajie Xu, Xiangwei Geng, Hongyang Li, Conghui He, Jianping Shi, Yu Qiao, Junchi Yan ECCV, 2022 (Oral, 2.3%) paper / code / data / third-party blog / bibtex @inproceedings{chen2022persformer, title={PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark}, author={Chen, Li and Sima, Chonghao and Li, Yang and Zheng, Zehan and Xu, Jiajie and Geng, Xiangwei and Li, Hongyang and He, Conghui and Shi, Jianping and Qiao, Yu and Yan, Junchi}, booktitle={ECCV}, year={2022} } We present PersFormer, an end-to-end monocular 3D lane detector with a novel Transformer-based spatial feature transformation module that generates BEV features by attending to related front-view local regions. PersFormer adopts a unified 2D/3D anchor design to detect 2D/3D lanes simultaneously. We also release OpenLane, one of the first large-scale real-world 3D lane datasets with 200K frames, 880K+ instance-level lanes, 14 categories, and diverse scenario annotations.
	BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, Jifeng Dai ECCV, 2022 paper / code / slides / bibtex @inproceedings{li2022bevformer, title={BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers}, author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng}, booktitle={ECCV}, year={2022} } We present BEVFormer, which learns unified BEV representations with spatiotemporal transformers for multi-camera autonomous driving perception. BEVFormer exploits spatial cross-attention across camera views and temporal self-attention over history BEV features through predefined grid-shaped BEV queries. It achieves 56.9% NDS on nuScenes test, 9.0 points above prior art and on par with LiDAR-based methods.
Earlier Work
	LSH-SMILE: Locality Sensitive Hashing Accelerated Simulation and Learning Chonghao Sima, Yexiang Xue NeurIPS, 2021 paper / bibtex @inproceedings{sima2021lshsmile, title={LSH-SMILE: Locality Sensitive Hashing Accelerated Simulation and Learning}, author={Sima, Chonghao and Xue, Yexiang}, booktitle={NeurIPS}, year={2021} } We propose LSH-SMILE, a unified framework that leverages locality sensitive hashing to scale up both forward simulation and backward learning of PDE-based physics systems. By hashing elements with similar dynamics into shared buckets and processing them collectively, LSH-SMILE reduces complexity to the number of non-empty hash buckets. We provide theoretical error bounds and demonstrate comparable simulation quality with drastically less time and space, enabling gradient propagation over longer durations for improved learning.

Patents

H. Li, Z. Li, W. Wang, C. Sima, L. Chen, Y. Li, Y. Qiao, J. Dai. Image Processing Method, Apparatus and Device, and Computer-Readable Storage Medium. US Patent 12,469,277, 2025.

H. Li, L. Chen, S. Gao, J. Yang, Y. Qiu, C. Sima, T. Li, J. Zeng, Y. Li, H. Wang, J. Yan, P. Luo, Y. Qiao. Method for Training Autonomous Driving Model, Electronic Device, and Storage Medium. US Patent App. 18/936,908, 2025.

H. Li, L. Chen, S. Gao, J. Yang, Y. Qiu, C. Sima, T. Li, J. Zeng, Y. Li, H. Wang, J. Yan, P. Luo, Y. Qiao. Method for Training Autonomous Driving Model, Method for Predicting Autonomous Driving Video, Electronic Device, and Storage Medium. US Patent App. 18/888,671, 2025.

Awards

Best Paper Award (1/9155), CVPR 2023, for the UniAD paper "Planning-oriented Autonomous Driving".

Best Paper Finalist (29/4306), IROS 2025, for the "AgiBot World Colosseo" paper.

Outstanding Reviewer (232/7000), CVPR 2023.

BEVFormer ranked 1st (1/300) on Waymo Open Challenge 2022, 3D Camera-only Detection Track.

BEVFormer ranked 1st (1/81) on nuScenes detection leaderboard with camera-only modality, at the time of submission.

Experience

NVIDIA, Santa Clara, USA — Deep Learning Intern, Autonomous Vehicle Applied Research
Apr. 2024 – Dec. 2024
Mentored by Dr. José M. Álvarez and Dr. Zhiding Yu. Research and development of a test-time training pipeline for end-to-end autonomous driving planner to improve performance on safety-critical scenarios.

Shanghai AI Lab, Shanghai, China — Research Intern, OpenDriveLab
Jun. 2019 – Mar. 2024
Mentored by Prof. Hongyang Li. Research and development of 3D perception and end-to-end autonomous driving with foundation models, including bird's-eye-view representation, multi-modality fusion, 3D occupancy prediction, and vision-language model integration for driving.

Academic Activities

Reviewing & Service

Journal Reviewer: T-PAMI, IJCV
Conference Reviewer: CVPR, ICCV, ECCV, ICRA, IROS, NeurIPS, ICLR, ICML
Challenge Host: CVPR 2024 DriveLM at Autonomous Grand Challenge, CVPR 2023 3D Occupancy Prediction at Autonomous Driving Challenge

Workshop Organization

Embodied Intelligence for Autonomous Systems on the Horizon, CVPR 2025
Foundation Models for Autonomous Systems, CVPR 2024
End-to-end Autonomous Driving: Emerging Tasks and Challenges, CVPR 2023
Scene Representation for Autonomous Driving, ICLR 2023

Recorded Talks

End-to-end Autonomous Driving: Past, Current and Onwards, CVPR 2025 Tutorial on Robotics 101
End-to-end Autonomous Driving at Scale and with Language, CVPR 2024 Tutorial
3D Occupancy Prediction Challenge: Introduction and Technical Reports, CVPR 2023 Workshop

Personal

Outside of research, I enjoy hiking, J-pop, scenery photography, and anime. I also play CS:GO, Genshin Impact, Honkai: Star Rail, Zenless Zone Zero, Wuthering Waves, and Arknights: Endfield. I have been a fan of Arsenal since 2009.

Website template from Jon Barron. Last updated: Feb 2026.