CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Sicheng Zhao; Xi Chen; Hongxun Yao; Haosen Yang; Yanhao Zhang; Sheng Jin; Xiatian Zhu; Haonan Lu; Kui Jiang; Guiguang Ding

doi:10.1007/s11263-026-02886-0

Back

CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Journal article

Peer reviewed

CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Sicheng Zhao, Xi Chen, Hongxun Yao, Haosen Yang, Yanhao Zhang, Sheng Jin, Xiatian Zhu, Haonan Lu, Kui Jiang and Guiguang Ding

International journal of computer vision, Vol.134(6), p.296

01/06/2026

DOI: https://doi.org/10.1007/s11263-026-02886-0

Abstract

Article

Artificial Intelligence

Computer Imaging

Computer Science

Image Processing and Computer Vision

Pattern Recognition

Pattern Recognition and Graphics

Vision

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-shot capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating and recognizing precise mask proposals for unseen categories and scenarios , resulting in inferior segmentation perfaormance eventually. To address this challenge, we introduce a novel C ross- M odel P rior F usion (CMPF) framework, an innovative framework that fuses visual knowledge from a localization foundation model (e.g., SAM) and text knowledge from a ViL model (e.g., CLIP), leveraging their complementary knowledge priors to overcome inherent limitations in mask proposal generation. Taking the ViL model’s visual encoder as the feature backbone, we propose Query Injector and Feature Injector to inject the visual localization feature into the learnable queries and CLIP features respectively, within a transformer decoder. In addition, an OpenSeg Ensemble strategy is designed to further improve mask quality by incorporating SAM’s universal segmentation masks during inference. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation – the performance bottleneck. Extensive experiments demonstrate that CMPF advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at https://github.com/chenxi52/CMPF .

Metrics

1 Record Views

Details

Title: CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation
Creators: Sicheng Zhao - Tsinghua University
Xi Chen - Harbin Institute of Technology
Hongxun Yao - Harbin Institute of Technology
Haosen Yang - University of Surrey
Yanhao Zhang - OPPO AI Center
Sheng Jin - Nanyang Technological University
Xiatian Zhu - Surrey Institute for People-Centred Artificial Intelligence, University of Surrey
Haonan Lu - OPPO AI Center
Kui Jiang - Harbin Institute of Technology
Guiguang Ding - Tsinghua University
Publication Details: International journal of computer vision, Vol.134(6), p.296
Publisher: Springer US; DORDRECHT
Number of pages: 19
Publication Date: 01/06/2026
Grant note: 62571294; 62476069 / National Natural Science Foundation of China (http://dx.doi.org/10.13039/501100001809) L252009 / Beijing Natural Science Foundation CCF-DiDi GAIA Collaborative Research Funds
Identifiers: 991129739102346; WOS:001779020700002
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Journal article

CMPF: Harmonizing Cross-Model Prior Fusion for Open-Vocabulary Segmentation

Abstract

Metrics

Details

Usage Policy