šŸ’¬ About Me

I am currently a senior researcher at Huawei Noahā€™s Ark Lab in Hong Kong, working on LLMs, MLLMs, and generative models. I obtained my Ph.D. from National University of Singapore in 2018, receiving National Semiconductor Gold Medal. Prior to that, I received my bachelorā€™s degree from Shanghai Jiao Tong University in 2014.

My group focuses on building generalizable AI systems from a data-centric perspective. Our mission is to understand the power and limitations of existing models, explore their corner cases, and propose efficient next-generation models and algorithms. Representative projects include:

šŸ”„ News

  • 2025.01: One paper accepted by NAACL 2025!
  • 2025.01: Two papers accepted by ICLR 2025!
  • 2024.10: Two papers accepted by WACV 2025!
  • 2024.09: Ā šŸŽ‰šŸŽ‰ We have released EMOVA, the very first end-to-end omni-modal model with SoTA vision-language and speech capabilities, further supporting emotional dialogue. Stay tuned for more details!
  • 2024.09: We hosted the ECCV Workshop ā€œMultimodal Perception and Comprehension of Corner Cases in Autonomous Driving: Towards Next-Generation Solutionsā€ (W-CODA) in Milan, Italy!
  • 2024.09: Two papers accepted by NeurIPS 2024!
  • 2024.09: One paper accepted by EMNLP 2024!
  • 2024.08: One paper accepted by COLM 2024!
  • 2024.07: Two papers accepted by ECCV 2024!
  • 2024.06: The First Autonomous Driving Corner Case Understanding and Video Generation Challenge is now open with generous prizes! We welcome your participation! [see details]
  • 2024.06: Ā šŸŽ‰šŸŽ‰ MagicDrive, as a core video generation feature of PanGu Large Model 5.0, was unveiled at Huawei Developer Conference 2024 (HDC 2024)! [see details]
  • 2024.02: Two papers accepted by CVPR 2024!
  • 2024.01: Three papers accepted by ICLR 2024! See you in Vienna!

āœØ Selected Projects

Omni-modal Large Language Model (2024)

  • Proposeed EMOVA, an end-to-end omni-modal LLM that can see, hear and speak. We use a continuous vision encoder and a semantic-acoustic disentangled speech tokenizer for seamless omni-modal alignment and diverse speech style controllability.
  • Introduced an efficient text-centric omni-modal alignment which can further improve the vision-language and speech capabilities, even compared with the corresponding bi-modal aligned counterparts (i.e., image-text only and speech-text only alignment).
  • For the first time, EMOVA achieve SoTA comparable performance on both the vision-language and speech benchmarks simultaneously, while supporting flexible spoken dialogues with vivid emotions, featured by Synced.

Alignment of Large Language Model (2023-2024)

  • Proposed LLMs and MLLMs self-alignment framework Mistake Analysis (ICLR) and ECSO (ECCV), enhancing LLMsā€™ safety pass rates by over 20% while maintaining the performance;
  • Established CoSafe (EMNLP), a benchmark for evaluating LLMā€™s safety in multi-turn dialogues, systematically assessing LLMā€™s safety performance across multiple dialogue rounds.
  • Featured by QbitAI; supported the PanGu Large Modelā€™s compliance with the National Cyberspace Administrationā€™s AIGC Large Model Regulatory Filing.

Corner Case Understanding and Video Generation for Self-Driving (2021-2023)

  • Established a controlable video generation framework for corner cases in autonomous driving by integrating physical laws, featuring works such as GeoDiffusion (ICLR), MagicDrive (ICLR), and DetDiffusion (CVPR), addressing challenges of cross-view and cross-frame spatiotemporal consistency in video generation.
  • Developed CODA (ECCV) and CODA-LM, autonomous driving corner case datasets, covering over 5000 rare scenes; these significantly reduced model perception and understanding performance (including GPT-4V), effectively evaluating and pinpointing model weaknesses in autonomous driving corner cases.
  • Featured by QbitAI and other public channels; implemented in Huawei vehicles and highlighted as a core feature of the PanGu Large Model 5.0 at the Huawei HDC 2024 [see details].

Generalization of Deep Learning Models (2019-2021)

  • Proposed a multi-dimensional out-of-distribution (OOD) generalization benchmark, addressing OOD generalization challenges across three dimensions: training data, paradigms, and model architectures. Published works include OOD-Bench (CVPR Oral), NAS-OOD (ICCV), and DecAug (AAAI), among others.
  • Explored improving model generalization through self-supervised learning (SSL), with methods such as MultiSiam (ICCV) and MixedAE (CVPR) for complex multi-instance scenarios, MoCE (ICLR Spotlight) for task-customized SSL, Continual SSL (ICLR), and SADE (NeurIPS) for SSL multi-expert integration.
  • Widely cited by prominent researchers, including Kaiming He and Percy Liang; featured by Synced (OOD-Bench, SADE) and AI Era (DecAug). Algorithms applied in Huawei Music to decouple irrelevant user features, effectively reducing the ā€œMatthew Effectā€ in recommendation systems.

šŸ“ Recent Publications

The full publication list can be found on Google Scholar.

Preprint:

ā€¢ EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen*, Yunhao Gou*, Runhui Huang*, Zhili Liu*, Daxin Tan*, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hongā€ , Lu Houā€ , Hang Xuā€ 

Paper Project

ā€¢ MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hongā€ , Zhenguo Li, Qiang Xuā€ 

Paper Project

ā€¢ Automated Evaluation of Large Vision-Language Models on Self-Driving Corner Cases

Kai Chen*, Yanze Li*, Wenhua Zhang*, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hongā€ , Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu Jiaā€ 

Paper

ā€¢ Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang

Paper

2024

ā€¢ CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Zuchen Gao, Fei Mi, Lanqing Hong

Empirical Methods in Natural Language Processing (EMNLP), 2024

Paper

ā€¢ CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li

Conference on Language Modeling (COLM), 2024

Paper

ā€¢ Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang

European Conference on Computer Vision (ECCV), 2024

Paper Project

ā€¢ Implicit Concept Removal of Diffusion Models

Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, and James T. Kwok

European Conference on Computer Vision (ECCV), 2024

Paper

ā€¢ CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs

Yingji Zhong, Lanqing Hong, Zhenguo Li, Dan Xu

Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Paper

ā€¢ DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Yibo Wang*, Ruiyuan Gao*, Kai Chen*, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang

Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Paper

ā€¢ Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

Kai Chen*, Chunwei Wang*, Kuo Yang, Jianhua Han, Lanqing Hongā€ , Fei Miā€ , Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-Yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu

International Conference on Learning Representations (ICLR), 2024

Paper

ā€¢ MagicDrive: Street View Generation with Diverse 3D Geometry Control

Ruiyuan Gao*, Kai Chen*, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu

International Conference on Learning Representations (ICLR), 2024

Paper Project

ā€¢ GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Kai Chen*, Enze Xie*, Zhe Chen, Yibo Wang, Lanqing Hongā€ , Zhenguo Li, Dit-Yan Yeung

International Conference on Learning Representations (ICLR), 2024

Paper Project

šŸŽ– Professional Services

  • Area Chair of IJCAI 2025
  • Industrial Chair of 3DV 2025
  • Senior Program Committee Members of IJCAI 2023, 2024
  • Organizer of ECCV Workshop W-CODA
  • Reviewer of TPAMI, ICLR, NeurIPS, CVPR, ECCV, ICCV, etc.

šŸŒ Internship Opportunities

We are now recruiting self-motivated interns/full-time researchers. If you are interested in, please directly send your CV to my email.

Current Interns

  • Kai CHEN (Hong Kong University of Science and Technology)
  • Ruiyuan GAO (The Chinese University of Hong Kong)
  • Yunhao GOU (Hong Kong University of Science and Technology)
  • Runhui HUANG (The University of Hong Kong)
  • Kaican LI (Hong Kong University of Science and Technology)
  • Zhili LIU (Hong Kong University of Science and Technology)
  • Yingji ZHONG (Hong Kong University of Science and Technology)

Former Interns