Abstract
We propose a fast method for classifying images as containing text, or with no scene text. The typical application is in processing large image streams, as encountered in social networks, for detection and recognition of scene text. The proposed classifier efficiently removes non-text images from consideration, thus allowing to apply the potentially computationally heavy scene text detection and OCR on only a fraction of the images.
The proposed method, called Fast-Text-Classifier (FTC), utilizes a MobileNetV2 architecture as a feature extractor for fast inference. The text vs. non-text prediction is based on a block-level approach. FTC achieves 94.2% F-measure, 0.97 area under the ROC curve, and 74.8 ms and 8.6 ms inference times for CPU and GPU, respectively. A dataset of 1M images, automatically annotated with masks indicating text presence, is introduced and made public at http://cmp.felk.cvut.cz/data/twitter1M.