Sunday, May 4, 2014

Test of the Picapedia System

Stefan Weber noted recently that the plagiarism research group in Weimar has tweaked their experimental system picapica. We tested the system back in 2007 while it was in a beta stage. We have looked at it occasionally in recent years, but were not able to test it for various reasons. The system back in 2007 took quite a long time to return very little in the way of finding sources.

They have now focused the system on comparing a text with the current version of the Wikipedia in 10 different languages,  English, German, Spanish, Catalan, Basque, French, Italian, Dutch, Portuguese, and Swedish. That would make it a useful tool for teachers looking for Wikipedia copies in student papers, as students will tend to take the current version of an article and not a historic one. 

I decided to test the system using the 2013 test cases that were based on the Wikipedia, as well as a translation from a Wikipedia, an original work, and a plagiarism with no Wikipedia sources. Both .pdf and .doc files were used. The results:
  • 21-Tibet (3 sources including WP:DE): WP source found, and another Wikipedia article with an identical phrase
  • 33-Eyjafjallajoekull (1 source, WP:EN): WP source found
  • 36-Champagne (translation from WP:FR): nothing found
  • 43-Brüder-Grimm (2 sources, one WP:DE): WP source found
  • 45-Strelitzia (highly disguised plagiarism from WP:EN): WP source found
  • 46-Thermoskanne (25% automatically disguised plagiarism from WP:EN): WP source found
  • 47-Tessellation (4 sources, none WP): no sources found
  • 50-Union-Jack (original paper with correct references to WP:DE): It notes a similarity to the Union Jack lemma, but does not flag any reused text.
  • 51-London-Blitz (4 sources, one from WP:EN): WP source found
  • 52-Boxer-Rebellion (disguised plagiarism from WP:DE): WP source found
  • 57-Fallingwater (1 source, WP:DE): 2 possible WP sources named, one is correct
  • 58-Phillip-K-Dick (disguised plagiarism from WP:DE): WP source found
  • 60-Rolltreppe (3 sources, 1 WP:EN properly referenced): It notes a similarity to the Escalator lemma, but does not flag any reused text.
  • 63-Hebrew-Plag (1 source, WP:HE): nothing found, this language is not given as a possible one
The system did not always find the total amount of plagiarism, but it pointed to the correct source in all cases except the impossible ones (#36, #47, #63). It also did not report plagiarism for correctly quoted Wikipedia, something many systems do not get right.

The text is first uploaded to their server (in Germany) and deleted after examination, according to their privacy policy. However, they do keep search results and thus may have portions of the uploaded text stored in some form. I repeated the test a day later and saw no trace of the results from the previous day influencing the repeat tests. They do record IP addresses and use Google Analytics, but offer the service free of charge (even for commercial use) as long as it is not abused and as long as the user does not pretend that they developed the system themselves.

So the system does appear to be quite useful for a small subset of plagiarism detection problems, namely identifying text that has been taken from a current Wikipedia. If it is necessary to look for text in older versions of Wikipedia articles, the tool WikiBlame can be quite useful for identifying and dating text taken from the Wikipedia many years prior.

Saturday, May 3, 2014

Stormy Waters

It all began with a Facebook posting on April 22, 2014: Arne Janning posted a longish article to his friends asking for help. He had found a recent book by two prominent historians (Karsten, A. & Rader, O. B. (2013) Grosse Seeschlachten -- Wendepunkte der Weltgeschichte von Salamis bis Skagerrak. München: C.H. Beck) to contain plagiarism from the Wikipedia. He exaggerated by saying that "every page contained plagiarism", and wondered what he should do.

The first thing Janning should have done was perhaps to check his privacy settings, as his post was public and the case quickly caught fire and was widely reported on. Maritime puns seem to be the norm for the titles of the articles, as I have also chosen: Spiegel Online ["Abschreiben bei Wikipedia: Zwei Historiker geraten in Plagiatssturm"], Neue Zürcher Zeitung, ["Seeschlacht mit unzulässigen Beibooten"], Süddeutsche ["Wendepunkte der Weltgeschichte aus Wikipedia kopiert"], FAZ ["Unter der Flagge Wikipedias"]. The authors and the publisher promptly threatened Jennings with legal action. According to Spiegel Online, one of the authors, Radar, noted that he did not actually steal intellectual property, as he only used "technical details" from the Wikipedia. "In earlier days we used the Brockhaus [encyclopedia], today we use the Wikipedia," he is quoted as stating [translation dww].

The blog Erbloggtes noted that there were at least two pictures used from the Wikipedia as well as some text, and the pictures were printed without attribution. That is a definite copyright infringement, although one of the pictures was indeed in the public domain, the other was not. Many other blogs joined the discussion: Archivalia, Schmalenstroer, hellojed, plagiatsgutachter. The Wikipedia-Kurier discussion was, as so often, extensive.

The publisher soon decided to withdraw the book, as reported by BuchMarkt, Meedia, and others. Beck ran the book through plagiarism detection software (iThenticate) and declared the parts written by Arne Kasten to be "free from unmarked quotations", despite the fact that it is impossible to prove the absence of plagiarism. One can only demonstrate the presence of plagiarism by a synoptic documentation showing the plagiarism and the source together. The other author, however, had not only plagiarized from the Wikipedia, but from an article published online in 2003. The publisher notes in a pseudo-scientific manner the "exact" word counts and percentages found, although I have repeated shown in my work (for example, my 2013 test) that such numbers are meaningless. Additionally, a reader cannot tell which parts of this book were written by which author, so they both are responsible for the entire book, in my opinion.

Beck also couldn't resist bashing Janning, still threatening legal action, perhaps to deflect criticism from itself for not having properly edited the book. A good comment by Jörg Hopfgarten in the Boersenblatt notes the publisher would be better off to understand that this was just an angry customer blowing off steam, ranting. Customers have a right to do just that without consulting a lawyer, especially when it can easily be seen that they are at least partially right. Amazon is full of similar reactions, this was just the media picking up on the keyword "plagiarism" and running with it, without having independently verified the accusations. Indeed, none of the Seeschlachten books in the Berlin libraries were out on loan when I obtained a copy, although perhaps they all purchased the Kindle version.

Beck closes their press notice with a condescending offer to "participate in a discussion about the use of the Wikipedia in academics." Jan Englemann notes on the Wikimedia blog that the discussion is for all practical purposes already over, as there are numerous court rulings on the legality of the Creative Commons license that the Wikipedia articles are under, CC-BY-SA. It is, perhaps, time for publishers to understand how a legal use of Wikipedia texts works: Link to the license and authors, and put the material that uses a Wikipedia text under at least this license. Open licenses do not mean that the material is free to be misappropriated.

Many of the blogs discussing the topic have started documenting the plagiarized portions, in particular using a German system Picapica (also called Picapedia), that compares text to the current version of the Wikipedia. I will be bringing a short test of this system in my next blog entry.