Abstract
Crowd counting aims to estimate the number of individuals in images, and the use of multimodal data has been shown to significantly enhance counting accuracy. However, such approaches are highly sensitive to the loss or corruption of data from any single modality, leading to severe performance degradation. To address this limitation, a new problem setting-Modality-Reconfigurable Crowd Counting-is introduced, in which a model is required to maintain robust performance even when one of the input modalities (e.g., RGB or thermal) is perturbed or entirely unavailable. Modality reconfigurability is achieved through effective cross-modal information transfer, enabled by a Feature Patches Generator that leverages Margin Ranking Loss across multiple network layers to align and transfer discriminative features between modalities. Additionally, a Negative Knowledge Transfer Prevention module is incorporated to suppress misleading or detrimental cross-modal signals. State-of-the-art performance is demonstrated on RGB-T crowd counting benchmarks, with consistent accuracy maintained under both complete and degraded modality conditions.