Saturday, November 2, 2013

Automatic plagiarism accusation?

In October 2013 Germany saw another dissertation plagiarism case involving a politician.  Frank-Walter Steinmeier, former Foreign Minister and the leader of the opposition in the previous parliament, submitted a thesis in 1991 to the law faculty of the University of Gießen. Entitled Bürger ohne Obdach, Citizens without Shelter, the 395-page thesis deals with the legal aspects of homelessness. The weekly magazine Focus reported in an article (not linked here because Focus participates in the consortium claiming intellectual property on links and snippets) on Sept. 29, 2013 that a marketing professor who has been trying to sell his software system for detecting plagiarism for many years had found extensive plagiarism in the dissertation, and that Steinmeier had called the accusation absurd.

The daily newspaper taz soon dug out the news that Focus had paid the marketing professor to investigate the theses of politicians, supposedly without targeting a specific individual. Focus had also asked a law professor to evaluate the accusation. He had stated that most of the passages were unproblematic, for example a seven-word name of a book listed in a footnote, but that there were three portions that were more extensive. The report was published online, 279 rather incomprehensible pages worth of text in tiny print, colored text, and some so-called "total plagiarism probability scores". The press immediately whipped itself into a frenzy, presumably because they believed that a computer program would be much more accurate at finding plagiarism than people. After all, the software reported numbers on every page! But since no one quite understood what the numbers actually meant, extensive discussion ensued.

A number of blog commentators and online media picked the report apart (among them Spiegel Online, Causaschavan, Archivalia, Erbloggtes, HajoFunke). The marketing professor was interviewed many times, giving conflicting statements: In Deutschlandfunk he stated that he had examined the report before publishing it; in an interview with Main-Netz he is reported as having said
 [...] bei der Überprüfung durch unsere Software gab das System bei Steinmeiers Arbeit einfach »Rot« an und verschickte einen automatisch generierten Prüfbericht an die Universität.
(our system checked Steinmeier's thesis and it registered "red" and sent an automatically generated report to the university)
If the statement that the report was automatically sent is true, it demonstrates quite vividly the grave danger that the use of plagiarism detection software without understanding the report can cause. While false negatives are bothersome – a plagiarized text is not flagged – a false positive can potentially be devastating for the author of the text, as an accusation of plagiarism will still hang in the air, even if it later turns out that the accusation cannot be substantiated.

Perhaps this case demonstrates the problem with automatic accusations made solely on the basis some number generated by a software system. The algorithms by which the number was derived is usually not published and thus unverifiable. The numbers in general do not mean anything until they have been checked – every single one – by an experienced teacher or researcher to determine if the result is at all meaningful. As I have said repeatedly: A software system cannot determine plagiarism, only a human can. 

A software system can, however, find indications that there might be plagiarism, although it would be helpful if not so many irrelevant "hits" were to be reported. VroniPlag Wiki began documenting the three more extensive text parallels in Steinmeier's thesis that were reported by the system and soon found more plagiarism from these sources, as well as additional sources that were not listed in the automatic report. When the extent of the text copying was determined to be severe enough, the case was published online as VroniPlag Wiki case #57. Many of the fragments documented are so-called "pawn sacrifices", the source is given, but no indication is made that a copy or near copy of the source text was used.

Does this vindicate the computer-generated report? (The automatic report produced by the marketing professor appears to be constantly updated as new sources are found by VroniPlag Wiki, the original version is unfortunately no longer available online.)  Hardly. If there is plagiarism in a thesis, a software that reports plagiarism on every page will, of course, be right in a way, even if most of the flagged pages are false positives.

Thus, if plagiarism detection software is used by an institution, no accusation should be made until the report has been checked in detail by a person who understands what the results actually mean. Schools that define "plagiarism" to mean any report by a plagiarism detection software that is above a specified threshold should re-think their policy. Any sort of automatic plagiarism accusation should not be tolerated.

Update: For guest commentary on the mathematics of the plagiarism probabilities, see the next article.


  1. Are there Schools that do this? A few years back I looked at the cheating/plagiarism policies of Australian universities and I've never seen any that actually have an automatic plagiarism accusation step. All have a manual inspection process following the plagiarism detection system kicking in.

  2. Apparently there are. I have had emails from a few people who have been accused of plagiarism on the basis of a score alone. One school refused to even show the student the report, just wrote to him/her that the final thesis was being awarded a failing grade on account of plagiarism. I suggested that they first request the report in writing, and then take the university in question (in Germany) to court in order to determine if this is legal, which I assume is not. Just as plagiarism is not acceptable, an accusation of plagiarism is also not acceptable if it is not substantiated. Far too many people believe in the "power of software" and some number that sounds important, even if they do not know how the number was determined.


Please note that I moderate comments. Any comments that I consider unscientific will not be published.