Monday, June 9, 2014

Dissertation mining

The past few weeks have certainly been quite stressful for the medical school of the University of Münster in Germany. VroniPlag Wiki began reporting on plagiarism in 21 dissertations to date that were submitted to the school in the years 2004 – 2011. The findings even include a chain of three plagiarized dissertations: Gt (2010) is a plagiarism of Ckr (2009) on 100% of the pages. The Ckr thesis contains plagiarism on 94% of the pages, including Gb (2008), which in turn is a plagiarism of a thesis submitted 2007. All four theses were prepared with the same doctoral advisor. Another cluster of five theses that repeat material from each other, with another advisor, has been documented (Tmm/40 pages/47%; Aeh/15 pages/86%; Clm/27 pages/62%; Clg/21 pages/80%; Amh/21 pages/52%).  In addition to the plagiarism, evidence of data falsification has also been found in some of the theses.

How were these theses identified? And why were so many found in such a short time?

It was a rather simple application of data mining techniques to dissertations that are available as open access digital publications from university libraries. Medical dissertations were chosen, as there are a large number of them available and they often deal with similar topics. Many theses in the past 10 years are available as open access publications from the university libraries. The theses are also often painfully short, sometimes even consisting of just one publication by a research group that one of the authors submitted as their dissertation.

Volker Rieble, a German law professor, discussed open access repositories in his 2010 book Das Wissenschaftsplagiat: Vom Versagen eines Systems (p. 52ff). The book has unfortunately been taken off the market, as one of the persons named as a plagiarist won a lawsuit filed against Rieble. He argues that open access repositories, especially ones operated by universities, should be taking measures to make sure that their authors are not being plagiarized if their texts are being openly offered. He feels that this publicly available material is a simple invitation to plagiarize. Of course, he does recognize that open access could help discover plagiarism, but he pointed out that no one was taking any action against copyists.

Well, now someone has. The work of VroniPlag Wiki in the past three years has shown that there is extensive plagiarism in dissertations and other academic texts throughout Germany, in all fields, and done in many different ways. The cases in Münster were discovered using a collusion identification method applied to open access dissertations.

While reading an early version of my book, one of the VroniPlag Wiki researchers stumbled over the section on collusion. What exactly was that? Collusion is when two or more students cooperate in producing materials in situations in which they were expected to work alone. For example, two students write a program together and each turns it in as their own work. Or five students in a very large course cooperate to write a paper together and then each turns in his or her own slightly modified version. Students hope that the teachers will not be reading carefully (or not at all?) and thus will not identify the "work-saving" efforts. It is not necessary for the participants to knowingly participate in the collusion. If author A re-uses text from author B without B being aware of the situation, this would also be considered collusion.

The researcher noted that he could imagine students doing such a thing, but no doctoral candidate would be so careless as to do something like that, would they, especially when they plan on publishing online? Would people collaborate on a dissertation, each submitting their own copy, or each writing half of the dissertation, or would someone copy another dissertation from the same school or even the same professor? Unimaginable. But there was a precedent.

There was a case of collusion discovered at the medical school in Münster in 2011 that was identified by a Wikipedia author who stumbled upon two practically identical dissertations that were submitted three years apart – to the same examiners ([1] submitted in 2009 and since withdrawn, is a copy of [2] from 2006). This was found just after Germany was rocked by the Minister of Defense, Karl-Theodor zu Guttenberg, stepping down after his dissertation was found to be extensively plagiarized.  The dean of the medical school in Münster emphasized then in a press release that plagiarism in a dissertation was an absolute singularity. He also noted that they would be looking into punishing the advisor, perhaps by barring him from taking on doctoral students in the future.

Would it be possible to check whether it is indeed true that such a plagiarism is a singularity? After all, many theses are, indeed, available online. All a dishonest author would need to do would be to download one or more theses, touch them up, and submit them. Since they are apparently not read closely (or why is such a thesis acceptable in Münster? The formatting from PDF page 14 is so erratic as to make the text unreadable) this might seem a good strategy for someone who is trying to get that "Dr." with as little effort as possible.

Intra-University Clusters
The first step in identifying collusion within a university department is to obtain a good number of theses from a university and then check each one against all the others from the same school. A list of the dissertation-granting medical schools in Germany was quickly found online. An attempt was made to download medical theses for a selection of these schools, including Münster.

As is usual for data mining applications, the most time-consuming part of the exercise is getting the data ready for work. The university libraries' offerings of digital publications are of quite varying quality. Some offer wonderfully clean metadata with URLs to the entire thesis; others have chaotic catalogs, upload the same dissertation more than once under different names, or for some reason split a thesis into chapters. The names of the files are quite amusing, as they appear to be named by the candidates themselves: "copyshop-fassung.pdf" [copyshop version], "dissertation_finish.pdf", or just "doktor.pdf". Most are called "dissertation.pdf" or "doktorarbeit.pdf".

Since the main piece of software compares each thesis with all the others, the number of comparisons grows quadratically with the number of texts examined. Comparing only a few dissertations with each other only takes minutes, but as the number of dissertations examined increases, the time quickly grows to days or even months.

The results of the comparisons are not an automatic plagiarism determination: only identical text sequences are identified. Each and every suspicious pair of theses needs to be investigated manually. Often, both authors identified their thesis as joint work or the text is a direct quote, so this is not a plagiarism. Or both had a copy of the same questionnaire in the appendix and a very similar literature list that is responsible for the text similarity. Or two copies of the thesis were uploaded to the library database under different names. But occasionally, there is no such explanation for the numerous and at times extensive swaths of identical text. And so, researchers with VroniPlag Wiki began to document the theses – manually.

Manual documentation of plagiarism involves locating the text overlap positions, recording the overlap, and having a second researcher sign off on the documentation. Once a potential source for a thesis has been located, the text comparison tool SIM_TEXT that researchers at VroniPlag Wiki implemented so that it can run locally in the browser can be used to identify the positions of the text overlap. These are documented as fragments, recording the page and line numbers, and documenting the portion of text similarity in both the source and the potential plagiarism.

The result lists from the comparisons can be sorted by amount of text overlap, so that one can work down from the most extensive ones. Investigating Münster, quite a number of theses turned up that were able to be rapidly documented, as the theses were quite short and the text copying was often page-wise.

The University of Münster has set up an investigative committee that includes external experts for what the press speaker has termed a "conflagration" (Flächenbrand). The committee is to convene in July. The dean is quoted in the press as being extremely irritated by the number of cases documented, the head of the medical association of Westfalen-Lippe is quoted in the same article as stating that since it is expensive to train doctors and they are urgently needed, it would be a "waste of labor" to demand that medical students spend two to three years working on a dissertation. I respectfully request, then, that medical students just quit producing sham dissertations. They should be awarded an "M.D." upon finishing their studies and let those interested in furthering science and academics invest their labors in producing dissertations that are original work.
Die Ausbildung zum Mediziner ist teuer. Mediziner werden dringend gebraucht. Da sei es eine "Vergeudung von Arbeitskraft", wenn von einem Studenten verlangt würde, zwei, drei Jahre an einer Doktorarbeit zu arbeiten – wie in anderen Fächern üblich.

Münster - Münstersche Zeitung - Lesen Sie mehr auf:;art993,2384095#plx1248430946
Die Ausbildung zum Mediziner ist teuer. Mediziner werden dringend gebraucht. Da sei es eine "Vergeudung von Arbeitskraft", wenn von einem Studenten verlangt würde, zwei, drei Jahre an einer Doktorarbeit zu arbeiten – wie in anderen Fächern üblich.

Münster - Münstersche Zeitung - Lesen Sie mehr auf:;art993,2384095#plx89862196

There is also an interesting collection of statistics on dissertations in Münster put together by a VroniPlag Wiki researcher in an attempt to try and understand what may have caused this extreme cluster of plagiarism. What one sees here, though, is that the number of dissertations submitted has declined, as has the number of online publications.

Münster is not the only university that has been shown to have accepted massive plagiarisms. A thesis from the Charité in Berlin was recently posted (Ali) that has more than 75% plagiarism on all (100%) of the pages. It is also evident that data was falsified in this thesis, as the numbers of patients interviewed are different from the older thesis, but the percentages given are the same ones in the older thesis, not for the numbers published in the thesis itself. When Spiegel-Online questioned the doctoral advisor about the thesis, he could only vaguely remember it. The Charité is currently investigating.

Inter-University Clusters
After investigating theses from just one university, clusters from two or more different universities can be combined in order to see whether there has been any "borrowing" of text between the universities. This is an extremely time-consuming process, but it turns up fascinating results. Two theses have been found that are around three-quarters identical that were handed in within a few weeks of each other to two different universities under different advisors. If this was joint work, it is not mentioned in either thesis. There are quite a number of theses that are patchwork quilts of text from different universities. There is a 30-page thesis submitted to Mainz (Tz), of which over half of the pages are from a thesis submitted seven years prior to Gießen.

There are so many text identities that have been found, it would take an enormous effort to document them all. But it has been shown that it is possible, using a rather simple (if time-consuming) method, to detect collusion plagiarism. Universities that publish epubs should at least make sure that they are not re-publishing material before they put a text out in public. After checking against their own text collection, perhaps a test against a selection of other university libraries is worth the investment of time. And at the risk of sounding like a broken record: the examiners should actually read the theses and perhaps keep better track of their students and the topics they pose.

The next blog article will be quite technical and explain the methodology used to find these collusion plagiarisms.

P.S. While finishing up writing this blog post today, medical dissertation #22 from Münster was posted, Aaf. The 48% of the 31 pages that have text overlap appear to be taken from a thesis submitted one year previously. The text has been disguised by substituting synonyms and re-wording sentences. This makes it difficult for software to identify the thesis as a possible plagiarism, although there are some longish portions that are taken verbatim. Page 13 shows a problem that appears when plagiarized text is rephrased: the original author writes that S. aureus appears to increase the mortality rate. That word was left out of the reworded text in Aaf, making it appear to be a known fact.


  1. Hi Debora,

    this is all extremely interesting and a real step forward. CONGRATULATION! How many medical dissertations of Münster were in the sample, and with which method (random numbers...?) did you select them? Or did you and the VroniPlag researchers test ALL medical dissertations of Münster available online und submitted between 2004 and 2011?
    I did not know so far that VroniPlag did intentionally choose ONE university (Münster) and ONE discipline (medical science) for this project. This fact makes the result even more striking. I wonder why no one of the official institutions - like Hochschulrektorenkonferenz - did say anything so far concerning the Münster experiment.
    And, a last question: Where DFG-projects - at least indirectly - involved in any plagiarized dissertations?

    Stefan Weber

  2. There were around 1500 theses that were a) downloadable and b) for which the text could be extracted and c) for which there was more than just the cover page. So percentages cannot be calculated, it a just a very rough figure. The figures for the text overlap are also just ballpark figures, as there can be many reasons (some given above) for the numbers to go either way. For example, one can see older VPW cases in the listings with much less amounts of text overlap as have been determined, as there were multiple (outside) sources for these theses. There were a number of universities chosen and not just medicine but dental medicine, veterinary medicine, as well as some biology and chemistry were also included in some of the clusters. This is ALL very preliminary, but since Münster rather stuck out and it was easy to document the large-scale text overlapping, these theses have been quickly and closely examined. I have no data on whether the DFG financed any of the research, that would be an interesting question.