Sunday, November 3, 2013

Guest commentary: Plagiarism Probabilities

[I have offered guest commentary privileges to anyone interested in posting longer pieces than the comments section will accept. This is the first such comment. -- dww ]
Plagiarism Probabilities
by Gerhard Hindemith
In response to the Copy, Shake, & Paste article  "Automatic plagiarism accusation?" on the documentation of plagiarism in the dissertation of Frank-Walter Steinmeier I would like to make the following comments:
There is an intermediate version of the computer-generated report available online: The latest update, including most Vroniplag Wiki finds, can be found here:
Surely the author of the report would claim that most commentators criticizing it did not understand it correctly. The report claims for each fragment a plagiarism probability (Einzelplagiatswahrscheinlichkeit), and if that is low, say 1%, the claim of the report is not that the documented fragment constitutes in fact plagiarism, but rather that it will be plagiarism only in one of 100 comparable cases. Having that in mind, one should disregard those "low probability fragments" for most purposes (and it is not quite clear, why they are included in a report aimed for the public in the first place). The system then goes on and calculates an overall plagiarism probability (Gesamtplagiatswahrscheinlichkeit), which supposedly (none of this is explained in the report unfortunately) then is the probability that the entire thesis should be considered plagiarism. With this understanding, the report makes sort of sense conceptually -- but that doesn't really help much. I think the report should never have been published, even if these clarifications are understood. Here are my reasons:
  1. If the outcome of this report really is only a plagiarism probability below 100% (it was around 60% in the beginning), publishing it seems quite unethical, because at that point there was a 2/5 chance that the dissertation was perfectly OK (if we believe the system ... see further down, whether that is a good idea).
  2. The professor made confusing claims with respect to how these probabilities should be interpreted. Here for instance: he claims that in Steinmeier's thesis 400 passages are not OK, that in fact Steinmeier has forgotten the quotation marks in 400 instances. ("Und bei Herrn Steinmeier kam eben heraus, dass 400 Textstellen nicht in Ordnung waren. Konkret heißt das: Er hat bei 400 Stellen die Anführungszeichen vergessen."). Now, this is much more than the claim that the system has found certain (low) probabilities for plagiarism. It seems that the professor is swinging back and forth between quite bold claims (400 instances of forgotten quotation marks) that attract attention and cautious remarks about the correct interpretation of the findings (the system only finds "plagiarism indicators", often with low probability attached, not certain plagiarism), when challenged with concrete examples.
  3. As I said, conceptually I can imagine that the probability set-up could make sense, but in practice the details matter a lot. Not only for the technically minded insider, but also for the reader trying to make sense of such a report and the so prominently placed probabilities. The following questions would need to be explained in order to ensure one can interpret the findings of the report:

    3a) What is the definition of "plagiarism" for a single fragment? Is every small citation mistake considered plagiarism, or only more severe cases of verbatim copied text of a certain length? Is self-plagiarism included? One needs to know this to understand what a plagiarism probability of say 20% actually means.

    3b) Equally, one would need to know what the definition of "plagiarism" for the entire dissertation would be. This could be as severe as "at least one citation error in the entire thesis" or as forgiving as "the plagiarism is so severe that even a medical dissertation in Germany would be rescinded". Without this definition, the overall probability figure is meaningless.

    3c) Important for the interpretation of the probabilities would also be an explanation, how these probabilities are conditional on the choice of potentially plagiarized sources, whether they are independent of text length and with what suitable confidence intervals comes their estimation.
  4. I have very strong doubts that the probability calculator has been developed on a methodologically sound basis. But maybe I am wrong. In order to check this, the following questions would have to be answered:

    4a) How has the probability calculator on fragment basis been built? Given that surely an automatic system cannot be taught to detect plagiarism directly, it would pick up certain factors that point towards plagiarism (like identical text and the length of it, lack of quotation marks, lack of reference, etc.) and estimate the plagiarism probability on the basis of those factors. The question is then: on the basis of what pool of text fragments with known plagiarism status has the tool been calibrated? Surely this pool would have to be fairly diverse, covering different plagiarism types, citation styles, and subject areas, and hence would have to be quite large, to achieve a certain statistical validity of the calibration results? How has been assured that this pool includes a representative proportion of fragments that are not considered plagiarism? What is the discriminatory power of the resulting probability calculator?

    4b) According to which logic have the fragment level plagiarism probabilities been combined to form the overall plagiarism probability? How has the overall probability been calibrated, on the basis of what pool of dissertations with known plagiarism status? How has the fragment-to-fragment correlation been accounted for that surely is induced by a consistent writing style throughout the thesis? (e.g. if in the whole thesis italics mark quotations (or a different text size/color for that matter), this might not be picked up by the system and the plagiarism probability on fragment level would be consistently over-estimated.) What is the discriminatory power of the overall probability calculator?

    4c) In the case of the Schavan thesis, a similar report has been generated, apparently following the same methodology: see here: This report gives an overall plagiarism probability of 100%. How can an automatic system reach absolute certainty, particularly for a dissertation where the passages of verbatim copied text are very limited? This 100% value makes me suspect that the system has in fact never been calibrated, and the probabilities given have no empirical basis, but are rather heuristic constructs that somewhat point in the right direction? If this was the case, one would have to ask, why such probabilities have been calculated in the first place, with apparent high precision up to a single percentage point?
I leave it at that. If there are reasonable answers to all these questions, I would be amazed, and the probability calculator would constitute a very interesting diagnosis tool and certainly an advance scientifically (the author of the report should then definitely publish his research). But given the complexities of plagiarism detection (as opposed to text parallel detection), I suspect the discriminatory power of an empirically built tool would be disappointingly low and its calibration extremely involved and costly. I don't believe the report about Steinmeier's thesis is generated by such a tool because the provided documentation (close to none) gives me no reason to believe it was but rather several indications that it most likely was not.

No comments:

Post a Comment