Abstract
Despite significant progress, the shortage of labeled data and expert knowledge remains a challenge for Fine-grained Visual Classification (FGVC). Some multi-source approaches that incorporate additional modalities, such as sound or bounding boxes, show promise for data enrichment but introduce added complexity to data collection. In this paper, we pose the question: can multi-source capabilities be achieved solely with existing images? The answer, confirmed by a pilot study, is affirmative. By analyzing the probability distribution of model output with different resolutions image, we find that complementary information beneficial to FGVC exists among images of different resolutions. Although the classification accuracy of low-resolution images is lower than high-resolution images, it can provide additional information for high-resolution input images. We designed a naive baseline that uses mixed training of multi-resolution images. Through the experimental results of the baseline, we find that i) not all low-resolution images are beneficial, and ii) adaptively selecting low-resolution images is what we need. Therefore, we proposed a meta-learning-based adaptive "resolution" pooling layer. Through the pooling operation, the features of low-resolution images are obtained from high-resolution images, and the most appropriate complementary features are selected for the features of high-resolution images through the gating mechanism, which enables the model to fully and autonomously exploit the complementary information. Experimental results on three FGVC datasets validate the effectiveness of our proposed method. Our code is available at https://github.com/PRIS-CV/Adaptive-Multi-Resolution-Feature-Fusion.