They have developed a massive corpus of English-language artefacts including various sizes of documents, various amounts and types of copying and have also included automatic translation from Spanish and German into English. They give the following statistics about their corpus:
- Corpus size: 20 611 suspicious documents, 20 612 source documents.
- Document lengths: small (up to paper size), medium, large (up to book size).
- Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities).
- Plagiarized passage length: short (few sentences), medium, long (many pages).
- Plagiarism types: monolingual (obfuscation degrees none, low, and high), and multilingual (automatic translation).
They calculate precision, recall, and granularity for each of the contestants on a character sequence level. Precision is the name given for how many of the detections were correct. Recall is the amount of plagiarism that was there was actually identified. Granularity demonstrates how often a particular copy is flagged - this should be close to one, that is, that any given copy is found only once.
They split the competition into external copy identification (but for a given, finite corpus, not against the open Internet) in which a matching with a given set of papers is to be found, and an intrinsic plagiarism identification, in which a stylistic analysis without use of any external documents is to identify the plagiarisms.
The results are, as I expected, wildly different between external and intrinsic. I find the recall values important - how many of the possible copies were found, although the precision is also important, so that not too many false positives are registered.
The recall for the 10 systems doing the external identification ranges wildly between 1 % and 69 % of possible copies found. This corresponds with my results from 2008 with a small corpus of hand-made plagiarisms and hand-detection, in which we found a recall of between 20% and 75% (the ones finding nothing were disqualified in our test). The median recall of the competition is 49%, the average 45%, which validates my informal assertion that flipping a coin to decide if a paper is plagiarized is about as effective as running software over a digital version of the paper (of course, flipping a coin gives no indication as to what part is indeed plagiarized). The precision ran between a median of 63% and an average of 60%.
The intrinsic identification was quite different. Although the recall was good (median 51% and average 56% with one of the four systems reaching 94%), the precision gave a median of 15 % with an average of 16%. The best system only had 23 % correct answers - that means that over 3 in 4 identified plagiarisms using stilistic analysis was, in fact, incorrectly flagged as plagiarism. This has interesting ramifications for stylistic analysis.
The overall score (I am not sure exactly what this is) has a median of 32 % and an average of 29% over all of the systems for recall, and a precision of only 39 % (average 28%) on precision.
I can identify only two of the authors as having written software that I have tested. The group from Zhytomyr State University, Ukraine, are the authors of Plagiarism Detector, this system was removed from our ranking for installing a trojan on systems using it, although their results gave them second place in my test (overall fourth place in this test). I also tested WCopyFind, but this is a system that is for detecting collusion. It's recall was overall about 32 %, but with less than 1 % precision it generates a *lot* of false positives!
I applaud the competition organizers