š¬ About Me
I am currently a senior researcher at Huawei Noahās Ark Lab in Hong Kong, working on LLMs, MLLMs, and generative models. I obtained my Ph.D. from National University of Singapore in 2018, receiving National Semiconductor Gold Medal. Prior to that, I received my bachelorās degree from Shanghai Jiao Tong University in 2014.
My group focuses on building generalizable AI systems from a data-centric perspective. Our mission is to understand the power and limitations of existing models, explore their corner cases, and propose efficient next-generation models and algorithms. Representative projects include:
- Omni-modal Large Language Model: EMOVA
- Alignment of Large Language Model: Mistake Analysis, CoSafe, ECSO
- Corner Case Understanding and Video Generation for Self-Driving: MagicDrive, GeoDiffusion, CODA
- Generalization of Deep Learning Models: OOD-Bench, Continual Self-Supervised Learning, MixedAE
š„ News
- 2025.01: One paper accepted by NAACL 2025!
- 2025.01: Two papers accepted by ICLR 2025!
- 2024.10: Two papers accepted by WACV 2025!
- 2024.09: Ā šš We have released EMOVA, the very first end-to-end omni-modal model with SoTA vision-language and speech capabilities, further supporting emotional dialogue. Stay tuned for more details!
- 2024.09: We hosted the ECCV Workshop āMultimodal Perception and Comprehension of Corner Cases in Autonomous Driving: Towards Next-Generation Solutionsā (W-CODA) in Milan, Italy!
- 2024.09: Two papers accepted by NeurIPS 2024!
- 2024.09: One paper accepted by EMNLP 2024!
- 2024.08: One paper accepted by COLM 2024!
- 2024.07: Two papers accepted by ECCV 2024!
- 2024.06: The First Autonomous Driving Corner Case Understanding and Video Generation Challenge is now open with generous prizes! We welcome your participation! [see details]
- 2024.06: Ā šš MagicDrive, as a core video generation feature of PanGu Large Model 5.0, was unveiled at Huawei Developer Conference 2024 (HDC 2024)! [see details]
- 2024.02: Two papers accepted by CVPR 2024!
- 2024.01: Three papers accepted by ICLR 2024! See you in Vienna!
āØ Selected Projects
Omni-modal Large Language Model (2024)
- Proposeed EMOVA, an end-to-end omni-modal LLM that can see, hear and speak. We use a continuous vision encoder and a semantic-acoustic disentangled speech tokenizer for seamless omni-modal alignment and diverse speech style controllability.
- Introduced an efficient text-centric omni-modal alignment which can further improve the vision-language and speech capabilities, even compared with the corresponding bi-modal aligned counterparts (i.e., image-text only and speech-text only alignment).
- For the first time, EMOVA achieve SoTA comparable performance on both the vision-language and speech benchmarks simultaneously, while supporting flexible spoken dialogues with vivid emotions, featured by Synced.
Alignment of Large Language Model (2023-2024)
- Proposed LLMs and MLLMs self-alignment framework Mistake Analysis (ICLR) and ECSO (ECCV), enhancing LLMsā safety pass rates by over 20% while maintaining the performance;
- Established CoSafe (EMNLP), a benchmark for evaluating LLMās safety in multi-turn dialogues, systematically assessing LLMās safety performance across multiple dialogue rounds.
- Featured by QbitAI; supported the PanGu Large Modelās compliance with the National Cyberspace Administrationās AIGC Large Model Regulatory Filing.
Corner Case Understanding and Video Generation for Self-Driving (2021-2023)
- Established a controlable video generation framework for corner cases in autonomous driving by integrating physical laws, featuring works such as GeoDiffusion (ICLR), MagicDrive (ICLR), and DetDiffusion (CVPR), addressing challenges of cross-view and cross-frame spatiotemporal consistency in video generation.
- Developed CODA (ECCV) and CODA-LM, autonomous driving corner case datasets, covering over 5000 rare scenes; these significantly reduced model perception and understanding performance (including GPT-4V), effectively evaluating and pinpointing model weaknesses in autonomous driving corner cases.
- Featured by QbitAI and other public channels; implemented in Huawei vehicles and highlighted as a core feature of the PanGu Large Model 5.0 at the Huawei HDC 2024 [see details].
Generalization of Deep Learning Models (2019-2021)
- Proposed a multi-dimensional out-of-distribution (OOD) generalization benchmark, addressing OOD generalization challenges across three dimensions: training data, paradigms, and model architectures. Published works include OOD-Bench (CVPR Oral), NAS-OOD (ICCV), and DecAug (AAAI), among others.
- Explored improving model generalization through self-supervised learning (SSL), with methods such as MultiSiam (ICCV) and MixedAE (CVPR) for complex multi-instance scenarios, MoCE (ICLR Spotlight) for task-customized SSL, Continual SSL (ICLR), and SADE (NeurIPS) for SSL multi-expert integration.
- Widely cited by prominent researchers, including Kaiming He and Percy Liang; featured by Synced (OOD-Bench, SADE) and AI Era (DecAug). Algorithms applied in Huawei Music to decouple irrelevant user features, effectively reducing the āMatthew Effectā in recommendation systems.
š Recent Publications
The full publication list can be found on Google Scholar.
Preprint:
ā¢ EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen*, Yunhao Gou*, Runhui Huang*, Zhili Liu*, Daxin Tan*, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hongā , Lu Houā , Hang Xuā
ā¢ MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hongā , Zhenguo Li, Qiang Xuā
ā¢ Automated Evaluation of Large Vision-Language Models on Self-Driving Corner Cases
Kai Chen*, Yanze Li*, Wenhua Zhang*, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hongā , Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu Jiaā
ā¢ Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning
Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang
2024
ā¢ CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Zuchen Gao, Fei Mi, Lanqing Hong
Empirical Methods in Natural Language Processing (EMNLP), 2024
ā¢ CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration
Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li
Conference on Language Modeling (COLM), 2024
ā¢ Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang
European Conference on Computer Vision (ECCV), 2024
ā¢ Implicit Concept Removal of Diffusion Models
Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, and James T. Kwok
European Conference on Computer Vision (ECCV), 2024
ā¢ CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs
Yingji Zhong, Lanqing Hong, Zhenguo Li, Dan Xu
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
ā¢ DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception
Yibo Wang*, Ruiyuan Gao*, Kai Chen*, Kaiqiang Zhou, Yingjie Cai, Lanqing Hong, Zhenguo Li, Lihui Jiang, Dit-Yan Yeung, Qiang Xu, Kai Zhang
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
ā¢ Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis
Kai Chen*, Chunwei Wang*, Kuo Yang, Jianhua Han, Lanqing Hongā , Fei Miā , Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-Yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu
International Conference on Learning Representations (ICLR), 2024
ā¢ MagicDrive: Street View Generation with Diverse 3D Geometry Control
Ruiyuan Gao*, Kai Chen*, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu
International Conference on Learning Representations (ICLR), 2024
ā¢ GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation
Kai Chen*, Enze Xie*, Zhe Chen, Yibo Wang, Lanqing Hongā , Zhenguo Li, Dit-Yan Yeung
International Conference on Learning Representations (ICLR), 2024
š Professional Services
- Area Chair of IJCAI 2025
- Industrial Chair of 3DV 2025
- Senior Program Committee Members of IJCAI 2023, 2024
- Organizer of ECCV Workshop W-CODA
- Reviewer of TPAMI, ICLR, NeurIPS, CVPR, ECCV, ICCV, etc.
š Internship Opportunities
We are now recruiting self-motivated interns/full-time researchers. If you are interested in, please directly send your CV to my email.
Current Interns
- Kai CHEN (Hong Kong University of Science and Technology)
- Ruiyuan GAO (The Chinese University of Hong Kong)
- Yunhao GOU (Hong Kong University of Science and Technology)
- Runhui HUANG (The University of Hong Kong)
- Kaican LI (Hong Kong University of Science and Technology)
- Zhili LIU (Hong Kong University of Science and Technology)
- Yingji ZHONG (Hong Kong University of Science and Technology)
Former Interns
- Haoyue BAI (University of Wisconsin-Madison)
- Shoukang HU (Sony AI)
- Haonan WANG (National University of Singapore)
- Shipeng YAN (ByteDance)
- Longhui YU (Peking University)
- Xinyun ZHANG (The Chinese University of Hong Kong)
- Kaichen ZHOU (University of Oxford)