Abstract
Images and videos are powerful media for storytelling. Yet, the continued improvement in the quality and availability of image and video editing tools has made it increasingly difficult to trust the authenticity of visual media. This presents a growing societal threat via the amplification of misinformation and spread of fake news. This thesis explores the problems of robust visual content provenance re-attribution and change summarisation, both visual and textual.
First, we present a novel scalable image provenance framework to match a query image back to a trusted database of originals and identify possible manipulations on the query. Our approach consists of three stages: scalable search stage; re-ranking and near-duplicate detection; and a manipulation detection and visualisation stage for localising regions within the query that may have been manipulated. We show that our method is robust to benign image transformations that commonly occur during online redistribution, such as artifacts due to noise and recompression degradation, as well as out-of-place transformations due to image padding, warping, and changes in size and shape. Robustness towards out-of-place transformations is achieved via the end-to-end training of a differentiable warping module within the comparator architecture. We demonstrate effective retrieval and manipulation detection over a dataset of 100 million images.
Second, we present VADER; a spatio-temporal matching, alignment, and change summarisation method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust audio-visual descriptor and scalable search using an inverted index. A transformer-based alignment module then refines the temporal localisation of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artefacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.
Third, we introduce VIXEN – a technique that succinctly summarises in text the differences between two images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing Image Difference Captioning (IDC) datasets by training on synthetically manipulated images from the InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT3. We show that VIXEN produces succinct, comprehensible difference captions for diverse image contents and edit types.
Finally, we explore the problem of difference captioning in the presence of multiple edits. We present FVTC - a technique for image difference captioning that is able to benefit from additional visual and/or textual inputs. FVTC is able to succinctly summarise multiple manipulations that were applied to an image in a sequence. Optionally, it can take several intermediate thumbnails of the image editing sequence as input, as well as coarse machine-generated annotations of the individual manipulations. We demonstrate that the presence of intermediate images and/or auxiliary textual information improves the model's captioning performance. To train FVTC, we introduce METS - a new dataset of image editing sequences, with machine annotations of each editorial step and human edit summarisation captions after the 5th, 10th and 15th manipulation.