|
Zuyan Liu (刘祖炎)
I am a third-year Ph.D. student at the Intelligent Vision Group (IVG), Department of Automation, Tsinghua University, advised by Prof. Jiwen Lu . Prior to that, I received my Bachelor's degree from the Department of Automation, Tsinghua University in 2023 (Ranking 1/170).
I am broadly interested in large language model and computer vision. My current research focuses on multi-modal large language models and large vision models.
Email  / 
Google Scholar  / 
GitHub
|
|
|
Publications
* indicates equal contribution
|
|
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu*,
Yuhao Dong*,
Ziwei Liu,
Winston Hu,
Jiwen Lu,
Yongming Rao,
International Conference on Learning Representations (ICLR), 2025
[arXiv]
[Code]
[Project Page]
[中文解读]
Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
|
|
Ola: Pushing the Frontiers of Omni-Modal Language Model
Zuyan Liu*,
Yuhao Dong*,
Jiahui Wang,
Ziwei Liu,
Winston Hu,
Jiwen Lu,
Yongming Rao
arXiv, 2025
[arXiv]
[Code]
[Project Page]
[中文解读]
[Rank 1st on OpenCompass Leaderboard (<15B)]
Ola is an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized models, pushing the frontiers of the omni-modal language model.
|
|
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang*,
Zuyan Liu*,
Yongming Rao,
Jiwen Lu
IEEE International Conference on Computer Vision (ICCV), 2025
[arXiv]
[Code]
[Project Page]
[中文解读]
SparseMM observe the sparsity of attention heads for vision-language multi-modal models, termed Visual Heads, and applies asymmetric operations to achieve model pruning and reasoning acceleration.
|
|
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Yuhao Dong*,
Zuyan Liu*,
Hailong Sun,
Jingkang Yang,
Winston Hu,
Yongming Rao,
Ziwei Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Highlight
[arXiv]
[Code]
[中文解读]
Insight-V is a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results.
|
|
Efficient High-Order Spatial Interactions for Visual Perception
Zuyan Liu,
Yongming Rao,
Wenliang Zhao,
Jie Zhou,
Jiwen Lu
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF:18.6), 2025
[Paper]
[Code]
[Project Page]
We propose the general HorNet-Family, including HorNet, Hor3D, and HorCLIP for a comprehensive visual fundamental architecture with better performance-efficiency trade-off.
|
|
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
Zuyan Liu,
Benlin Liu,
Jiahui Wang,
Yuhao Dong,
Guangyi Chen,
Ranjay Krishna,
Yongming Rao,
Jiwen Lu
European Conference on Computer Vision (ECCV), 2024
[arXiv]
[Code]
[Project Page]
Elastic Cache is a novel approach for KV Cache acceleration in multi-modal large language models that benefits from applying distinct acceleration methods for instruction encoding and output generation stages.
|
|
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
Zuyan Liu*,
Yuhao Dong*,
Yongming Rao,
Jie Zhou,
Jiwen Lu
arXiv, 2024
[arXiv]
[Code]
[Project Page]
The Chain-of-Spot (CoS) method is an approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions.
|
|
Unleashing Text-to-Image Diffusion Models for Visual Perception
Wenliang Zhao*,
Yongming Rao*,
Zuyan Liu*,
Benlin Liu
Jie Zhou,
Jiwen Lu
IEEE International Conference on Computer Vision (ICCV), 2023
[arXiv]
[Code]
[Project Page]
[Rank 1st on NYUv2 Depth Estimation]
VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
|
|
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
Yongming Rao*,
Zuyan Liu*,
Wenliang Zhao*,
Jie Zhou,
Jiwen Lu
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF:24.31), 2023
[arXiv]
[Code]
[Project Page]
The dynamic spatial sparsification framework can be applied to general visual architectures (e.g. Transformers, ConvNeXt, Swin Transformers) and visual tasks (e.g. classification, object detection, semantic segmentation) for efficient inference.
|
|
DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion
Wenliang Zhao,
Yongming Rao*,
Weikang Shi,
Zuyan Liu,
Wenliang Zhao*,
Jie Zhou,
Jiwen Lu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[arXiv]
[Code]
DiffSwap is a diffusion model based framework for high-fidelity and controllable face swapping.
|
|
PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers
Xumin Yu*,
Yongming Rao*, Ziyi Wang, Zuyan Liu,
Jiwen Lu ,
Jie Zhou
IEEE International Conference on Computer Vision (ICCV), 2021
Oral Presentation
[arXiv]
[Code]
[中文解读]
PoinTr is a transformer-based framework that reformulates point cloud completion as a set-to-set translation problem.
|
|
Tencent Hunyuan
Multi-Modal Model Group, Research Intern
Topic: Multi-Modal
|
|
ByteDance Seed
Seed Vision Group, Research Intern
Topic: Video Generation
|
|
Beijing Academy of Artificial Intelligence
Vision Model Research Center, Research Intern
Topic: Multi-Modal
|
|
ByteDance
Intelligent Creation Group, Research Intern
Topic: Human AIGC
|
|
Honors and Awards
2025 China National Scholarship (PhD Student) / 国家奖学金(博士生)
2025 Hunyuan Scholarship, Tencent / 混元学者(中国电子学会-腾讯博士生科研激励计划)
2023 Outstanding Undergraduate, Tsinghua University / 清华大学优秀毕业生
2023 Outstanding Undergraduate, Beijing / 北京市优秀毕业生
2022 China National Scholarship (Undergraduate) / 国家奖学金(本科生)
2021 Jiang Nanxiang Scholarship, Tsinghua University / 蒋南翔奖学金
2020 December 9th Scholarship, Tsinghua University / 一二·九奖学金
2022,2021,2020 Comprehensive Excellence Scholarship, Tsinghua University / 清华大学综合优秀奖学金
|
|
Academic Services
Conference Reviewer: ICLR 2026, WACV 2026, NeurIPS 2025, ICLR 2025, ICCV 2025, CVPR 2025, ECCV 2024, CVPR 2024, ICCV 2023
Journal Reviewer: IEEE Transactions on Multimedia, Pattern Recognition
|
© Zuyan Liu | Last updated: May 25, 2025
|