Sunday, May 4, 2014

Test of the Picapedia System

Stefan Weber noted recently that the plagiarism research group in Weimar has tweaked their experimental system picapica. We tested the system back in 2007 while it was in a beta stage. We have looked at it occasionally in recent years, but were not able to test it for various reasons. The system back in 2007 took quite a long time to return very little in the way of finding sources.

They have now focused the system on comparing a text with the current version of the Wikipedia in 10 different languages,  English, German, Spanish, Catalan, Basque, French, Italian, Dutch, Portuguese, and Swedish. That would make it a useful tool for teachers looking for Wikipedia copies in student papers, as students will tend to take the current version of an article and not a historic one. 

I decided to test the system using the 2013 test cases that were based on the Wikipedia, as well as a translation from a Wikipedia, an original work, and a plagiarism with no Wikipedia sources. Both .pdf and .doc files were used. The results:
  • 21-Tibet (3 sources including WP:DE): WP source found, and another Wikipedia article with an identical phrase
  • 33-Eyjafjallajoekull (1 source, WP:EN): WP source found
  • 36-Champagne (translation from WP:FR): nothing found
  • 43-Brüder-Grimm (2 sources, one WP:DE): WP source found
  • 45-Strelitzia (highly disguised plagiarism from WP:EN): WP source found
  • 46-Thermoskanne (25% automatically disguised plagiarism from WP:EN): WP source found
  • 47-Tessellation (4 sources, none WP): no sources found
  • 50-Union-Jack (original paper with correct references to WP:DE): It notes a similarity to the Union Jack lemma, but does not flag any reused text.
  • 51-London-Blitz (4 sources, one from WP:EN): WP source found
  • 52-Boxer-Rebellion (disguised plagiarism from WP:DE): WP source found
  • 57-Fallingwater (1 source, WP:DE): 2 possible WP sources named, one is correct
  • 58-Phillip-K-Dick (disguised plagiarism from WP:DE): WP source found
  • 60-Rolltreppe (3 sources, 1 WP:EN properly referenced): It notes a similarity to the Escalator lemma, but does not flag any reused text.
  • 63-Hebrew-Plag (1 source, WP:HE): nothing found, this language is not given as a possible one
The system did not always find the total amount of plagiarism, but it pointed to the correct source in all cases except the impossible ones (#36, #47, #63). It also did not report plagiarism for correctly quoted Wikipedia, something many systems do not get right.

The text is first uploaded to their server (in Germany) and deleted after examination, according to their privacy policy. However, they do keep search results and thus may have portions of the uploaded text stored in some form. I repeated the test a day later and saw no trace of the results from the previous day influencing the repeat tests. They do record IP addresses and use Google Analytics, but offer the service free of charge (even for commercial use) as long as it is not abused and as long as the user does not pretend that they developed the system themselves.

So the system does appear to be quite useful for a small subset of plagiarism detection problems, namely identifying text that has been taken from a current Wikipedia. If it is necessary to look for text in older versions of Wikipedia articles, the tool WikiBlame can be quite useful for identifying and dating text taken from the Wikipedia many years prior.

1 comment: