Abstract
In this paper we elaborate a near-duplicate and plagiarism detection service that combines both Crowd and Cloud computing in searching for and evaluating matching documents. We believe that our approach could be used across collaborating or competing Enterprises, or against the web, without any Enterprise needing to reveal the contents of its corporate (confidential) documents. The Cloud service involves a novel document fingerprinting approach which derives grammatical patterns but does not require grammatical knowledge and does not rely on hash-based approaches. Our approach generates a lossy and highly compressed document signature from which it is possible to generate fixed-length patterns as fingerprints or shingles. Fingerprint sizes are established by estimating likely random hit rates resulting from the size of the pattern and target search. Our Cloud service is geared towards enabling detection of Clowns, those who may attempt to, or have, leaked confidential or sensitive information, or have otherwise plagiarized, without needing to provide a copy of the original information. Crowds are to be used to validate results emerging from systematic evaluation of the service, ensuring that service modifications continue to act effectively and enabling continuous scaling-up. We discuss the formulation of the service and assess the efficacy of the fingerprinting approach by reference to an international benchmarking competition where we believe our system achieves top 5 performance (Precision=0.96 Recall=0.39).