I am a third-year Ph.D. student at the Intelligent Vision Group (IVG), Department of Automation, Tsinghua University, advised by Prof. Jiwen Lu . Prior to that, I received my Bachelor's degree from the Department of Automation, Tsinghua University in 2023 (Ranking 1/170).
I am broadly interested in large language model and computer vision. My current research focuses on multi-modal large language models and large vision models.
HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents Xumin Yu*,
Zuyan Liu*,
Ziyi Wang*,
He Zhang*,
Yongming Rao,
Fangfu Liu,
Yani Zhang,
Ruowen Zhao,
Oran Wang,
Yves Liang,
Haitao Lin,
Minghui Wang,
Yubo Dong,
Kevin Cheng,
Bolin Ni,
Rui Huang,
Han Hu,
Zhengyou Zhang,
Shunyu Yao
Technical Report, 2026
[arXiv][Code]
HY-Embodied-0.5 is a family of embodied foundation models built for real-world agents, achieving state-of-the-art performance across 22 benchmarks in visual perception, spatial reasoning, and embodied understanding, with effective downstream robot control.
Ola is an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized models, pushing the frontiers of the omni-modal language model.
Insight-V is a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results.
We propose the general HorNet-Family, including HorNet, Hor3D, and HorCLIP for a comprehensive visual fundamental architecture with better performance-efficiency trade-off.
Elastic Cache is a novel approach for KV Cache acceleration in multi-modal large language models that benefits from applying distinct acceleration methods for instruction encoding and output generation stages.
SparseMM observe the sparsity of attention heads for vision-language multi-modal models, termed Visual Heads, and applies asymmetric operations to achieve model pruning and reasoning acceleration.
The Chain-of-Spot (CoS) method is an approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions.
VPD (Visual Perception with Pre-trained Diffusion Models) is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks.
The dynamic spatial sparsification framework can be applied to general visual architectures (e.g. Transformers, ConvNeXt, Swin Transformers) and visual tasks (e.g. classification, object detection, semantic segmentation) for efficient inference.
Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models Yuhao Dong,
Zuyan Liu,
Shulin Tian,
Yongming Rao,
Ziwei Liu arXiv, 2026
[arXiv]
Insight-V++ is a unified multi-agent visual reasoning framework that evolves from Insight-V into a generalized spatial-temporal architecture, introducing ST-GRPO and J-GRPO algorithms with a self-improving training loop to enhance long-chain reasoning across image and video domains.
GeoVista is an agentic model that integrates image-zoom-in and web-search tools within the reasoning loop for geolocalization, achieving performance comparable to closed-source models on the curated GeoBench benchmark.
ProtoComp++: Diverse Point Cloud Completion with Controllable Prototype Xumin Yu*,
Zuyan Liu*,
Yanbo Wang,
Jie Zhou,
Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2026
[Paper]
ProtoComp++ is a prototype-based approach for point cloud completion that generates and refines prototypes with geometric details, demonstrating strong generalization to unseen categories and real-world scenarios beyond synthetic training data.
RealUnify is a comprehensive benchmark designed to evaluate bidirectional capability synergy in unified multimodal models, assessing whether architectural unification truly enables understanding and generation to enhance each other.
PerceptionComp is a manually annotated video benchmark for complex, long-horizon, perception-centric reasoning where no single moment suffices, requiring temporally distributed evidence and compositional constraints across 1,114 questions on 279 high-complexity videos.
Vision Generalist Model: A Survey Ziyi Wang,
Yongming Rao,
Shuofeng Sun,
Xinrun Liu,
Yi Wei,
Xumin Yu,
Zuyan Liu,
Yanbo Wang,
Hongmin Liu,
Jie Zhou,
Jiwen Lu International Journal of Computer Vision (IJCV), 2025
[arXiv]
A comprehensive survey on vision generalist models, reviewing their frameworks, techniques, and applications for handling diverse computer vision tasks within a unified model.