They have now focused the system on comparing a text with the current version of the Wikipedia in 10 different languages, English, German, Spanish, Catalan, Basque, French, Italian, Dutch, Portuguese, and Swedish. That would make it a useful tool for teachers looking for Wikipedia copies in student papers, as students will tend to take the current version of an article and not a historic one.
I decided to test the system using the 2013 test cases that were based on the Wikipedia, as well as a translation from a Wikipedia, an original work, and a plagiarism with no Wikipedia sources. Both .pdf and .doc files were used. The results:
- 21-Tibet (3 sources including WP:DE): WP source found, and another Wikipedia article with an identical phrase
- 33-Eyjafjallajoekull (1 source, WP:EN): WP source found
- 36-Champagne (translation from WP:FR): nothing found
- 43-Brüder-Grimm (2 sources, one WP:DE): WP source found
- 45-Strelitzia (highly disguised plagiarism from WP:EN): WP source found
- 46-Thermoskanne (25% automatically disguised plagiarism from WP:EN): WP source found
- 47-Tessellation (4 sources, none WP): no sources found
- 50-Union-Jack (original paper with correct references to WP:DE): It notes a similarity to the Union Jack lemma, but does not flag any reused text.
- 51-London-Blitz (4 sources, one from WP:EN): WP source found
- 52-Boxer-Rebellion (disguised plagiarism from WP:DE): WP source found
- 57-Fallingwater (1 source, WP:DE): 2 possible WP sources named, one is correct
- 58-Phillip-K-Dick (disguised plagiarism from WP:DE): WP source found
- 60-Rolltreppe (3 sources, 1 WP:EN properly referenced): It notes a similarity to the Escalator lemma, but does not flag any reused text.
- 63-Hebrew-Plag (1 source, WP:HE): nothing found, this language is not given as a possible one
The text is first uploaded to their server (in Germany) and deleted after examination, according to their privacy policy. However, they do keep search results and thus may have portions of the uploaded text stored in some form. I repeated the test a day later and saw no trace of the results from the previous day influencing the repeat tests. They do record IP addresses and use Google Analytics, but offer the service free of charge (even for commercial use) as long as it is not abused and as long as the user does not pretend that they developed the system themselves.
So the system does appear to be quite useful for a small subset of plagiarism detection problems, namely identifying text that has been taken from a current Wikipedia. If it is necessary to look for text in older versions of Wikipedia articles, the tool WikiBlame can be quite useful for identifying and dating text taken from the Wikipedia many years prior.
http://www.uni-weimar.de/en/media/chairs/webis/research/projects/plagiarism-detection
ReplyDelete