Zuyan Liu

Zuyan Liu (刘祖炎)

I am a third-year Ph.D. student at the Intelligent Vision Group (IVG), Department of Automation, Tsinghua University, advised by Prof. Jiwen Lu . Prior to that, I received my Bachelor's degree from the Department of Automation, Tsinghua University in 2023 (Ranking 1/170).

I am broadly interested in large language model and computer vision. My current research focuses on multi-modal large language models and large vision models.

Email / Google Scholar / GitHub

Selected Publications

* indicates equal contribution

	HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents Xumin Yu, Zuyan Liu*, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Shunyu Yao Technical Report*, 2026 [arXiv] [Code] HY-Embodied-0.5 is a family of embodied foundation models built for real-world agents, achieving state-of-the-art performance across 22 benchmarks in visual perception, spatial reasoning, and embodied understanding, with effective downstream robot control.
	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao, International Conference on Learning Representations (ICLR), 2025 [arXiv] [Code] [Project Page] [中文解读] Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
	Ola: Pushing the Frontiers of Omni-Modal Language Model Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao arXiv, 2025 [arXiv] [Code] [Project Page] [中文解读] [Rank 1st on OpenCompass Leaderboard (<15B)] Ola is an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized models, pushing the frontiers of the omni-modal language model.
	Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Yuhao Dong, Zuyan Liu*, Hailong Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 Highlight* [arXiv] [Code] [中文解读] Insight-V is a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results.
	Efficient High-Order Spatial Interactions for Visual Perception Zuyan Liu, Yongming Rao, Wenliang Zhao, Jie Zhou, Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF:18.6), 2025 [Paper] [Code] [Project Page] We propose the general HorNet-Family, including HorNet, Hor3D, and HorCLIP for a comprehensive visual fundamental architecture with better performance-efficiency trade-off.
	Efficient Inference of Vision Instruction-Following Models with Elastic Cache Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Ranjay Krishna, Yongming Rao, Jiwen Lu European Conference on Computer Vision (ECCV), 2024 [arXiv] [Code] [Project Page] Elastic Cache is a novel approach for KV Cache acceleration in multi-modal large language models that benefits from applying distinct acceleration methods for instruction encoding and output generation stages.
	SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Jiahui Wang, Zuyan Liu*, Yongming Rao, Jiwen Lu IEEE International Conference on Computer Vision (ICCV)*, 2025 [arXiv] [Code] [Project Page] [中文解读] SparseMM observe the sparsity of attention heads for vision-language multi-modal models, termed Visual Heads, and applies asymmetric operations to achieve model pruning and reasoning acceleration.
	Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu arXiv, 2024 [arXiv] [Code] [Project Page] The Chain-of-Spot (CoS) method is an approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions.
	Unleashing Text-to-Image Diffusion Models for Visual Perception Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu Jie Zhou, Jiwen Lu IEEE International Conference on Computer Vision (ICCV)*, 2023 [arXiv] [Code] [Project Page] [Rank 1st on NYUv2 Depth Estimation] VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
	Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks Yongming Rao, Zuyan Liu*, Wenliang Zhao, Jie Zhou, Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI, IF:24.31), 2023 [arXiv] [Code] [Project Page] The dynamic spatial sparsification framework can be applied to general visual architectures (e.g. Transformers, ConvNeXt, Swin Transformers) and visual tasks (e.g. classification, object detection, semantic segmentation) for efficient inference.

Full Publications ▶

	Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu arXiv, 2026 [arXiv] Insight-V++ is a unified multi-agent visual reasoning framework that evolves from Insight-V into a generalized spatial-temporal architecture, introducing ST-GRPO and J-GRPO algorithms with a self-improving training loop to enhance long-chain reasoning across image and video domains.
	GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Yikun Wang, Zuyan Liu, Ziyi Wang, Han Hu, Pengfei Liu, Yongming Rao arXiv, 2025 [arXiv] [Code] [Project Page] GeoVista is an agentic model that integrates image-zoom-in and web-search tools within the reasoning loop for geolocalization, achieving performance comparable to closed-source models on the curated GeoBench benchmark.
	ProtoComp++: Diverse Point Cloud Completion with Controllable Prototype Xumin Yu, Zuyan Liu*, Yanbo Wang, Jie Zhou, Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)*, 2026 [Paper] ProtoComp++ is a prototype-based approach for point cloud completion that generates and refines prototypes with geometric details, demonstrating strong generalization to unseen categories and real-world scenarios beyond synthetic training data.
	RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026 [arXiv] [Code] RealUnify is a comprehensive benchmark designed to evaluate bidirectional capability synergy in unified multimodal models, assessing whether architectural unification truly enables understanding and generation to enhance each other.
	PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna arXiv, 2026 [arXiv] [Code] [Project Page] PerceptionComp is a manually annotated video benchmark for complex, long-horizon, perception-centric reasoning where no single moment suffices, requiring temporally distributed evidence and compositional constraints across 1,114 questions on 279 high-complexity videos.
	Vision Generalist Model: A Survey Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou, Jiwen Lu International Journal of Computer Vision (IJCV), 2025 [arXiv] A comprehensive survey on vision generalist models, reviewing their frameworks, techniques, and applications for handling diverse computer vision tasks within a unified model.
	DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Wenliang Zhao, Jie Zhou, Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023 [arXiv] [Code] DiffSwap is a diffusion model based framework for high-fidelity and controllable face swapping.
	PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu , Jie Zhou IEEE International Conference on Computer Vision (ICCV), 2021 Oral Presentation [arXiv] [Code] [中文解读] PoinTr is a transformer-based framework that reformulates point cloud completion as a set-to-set translation problem.