Abstract
Existing RGB-D tracking algorithms advance the performance by constructing typical appearance models from the RGB-only tracking frameworks. There is no attempt to exploit any complementary visual information from the multi-modal input. This paper addresses this deficit and presents a novel algorithm to boost the performance of RGB-D tracking by taking advantage of collaborative clues. To guarantee input consistency, depth images are encoded into the three-channel HHA representation to create input of a similar structure to the RGB images, so that the deep CNN features can be extracted from both modalities. To highlight the discriminatory information in multi-modal features, a feature enhancement module using a cross-attention strategy is proposed. With the attention map produced by the proposed cross-attention method, the target area of the features can be enhanced and the negative influence of the background is suppressed. Besides, we address the potential tracking failure by introducing a long-term mechanism. The experimental results obtained on the well-known benchmarking datasets, including PTB, STC, and CTDB, demonstrate the superiority of the proposed RGB-D tracker. On PTB, the proposed method achieves the highest AUC scores against compared trackers across scenarios with five distinct challenging attributes. On STC and CDTB, our FECD obtains an overall AUC of 0.630 and an F-score of 0.630, respectively.
•The single-channel depth maps were encoded into three-channel HHA images.•A feature enhancement method with a cross-attention module to enhance features.•A long-term tracking mechanism to detect failures and recapture lost targets.•Experiments were conducted on several standard tracking benchmarking datasets.