The daily newspaper taz soon dug out the news that Focus had paid the marketing professor to investigate the theses of politicians, supposedly without targeting a specific individual. Focus had also asked a law professor to evaluate the accusation. He had stated that most of the passages were unproblematic, for example a seven-word name of a book listed in a footnote, but that there were three portions that were more extensive. The report was published online, 279 rather incomprehensible pages worth of text in tiny print, colored text, and some so-called "total plagiarism probability scores". The press immediately whipped itself into a frenzy, presumably because they believed that a computer program would be much more accurate at finding plagiarism than people. After all, the software reported numbers on every page! But since no one quite understood what the numbers actually meant, extensive discussion ensued.
A number of blog commentators and online media picked the report apart (among them Spiegel Online, Causaschavan, Archivalia, Erbloggtes, HajoFunke). The marketing professor was interviewed many times, giving conflicting statements: In Deutschlandfunk he stated that he had examined the report before publishing it; in an interview with Main-Netz he is reported as having said
[...] bei der Überprüfung durch unsere Software gab das System bei Steinmeiers Arbeit einfach »Rot« an und verschickte einen automatisch generierten Prüfbericht an die Universität.If the statement that the report was automatically sent is true, it demonstrates quite vividly the grave danger that the use of plagiarism detection software without understanding the report can cause. While false negatives are bothersome – a plagiarized text is not flagged – a false positive can potentially be devastating for the author of the text, as an accusation of plagiarism will still hang in the air, even if it later turns out that the accusation cannot be substantiated.
(our system checked Steinmeier's thesis and it registered "red" and sent an automatically generated report to the university)
Perhaps this case demonstrates the problem with automatic accusations made solely on the basis some number generated by a software system. The algorithms by which the number was derived is usually not published and thus unverifiable. The numbers in general do not mean anything until they have been checked – every single one – by an experienced teacher or researcher to determine if the result is at all meaningful. As I have said repeatedly: A software system cannot determine plagiarism, only a human can.
A software system can, however, find indications that there might be plagiarism, although it would be helpful if not so many irrelevant "hits" were to be reported. VroniPlag Wiki began documenting the three more extensive text parallels in Steinmeier's thesis that were reported by the system and soon found more plagiarism from these sources, as well as additional sources that were not listed in the automatic report. When the extent of the text copying was determined to be severe enough, the case was published online as VroniPlag Wiki case #57. Many of the fragments documented are so-called "pawn sacrifices", the source is given, but no indication is made that a copy or near copy of the source text was used.
Does this vindicate the computer-generated report? (The automatic report produced by the marketing professor appears to be constantly updated as new sources are found by VroniPlag Wiki, the original version is unfortunately no longer available online.) Hardly. If there is plagiarism in a thesis, a software that reports plagiarism on every page will, of course, be right in a way, even if most of the flagged pages are false positives.
Thus, if plagiarism detection software is used by an institution, no accusation should be made until the report has been checked in detail by a person who understands what the results actually mean. Schools that define "plagiarism" to mean any report by a plagiarism detection software that is above a specified threshold should re-think their policy. Any sort of automatic plagiarism accusation should not be tolerated.
Update: For guest commentary on the mathematics of the plagiarism probabilities, see the next article.