Thursday, June 23, 2011

Koch-Mehrin EU Commissioner for Research

Sylvana Koch-Mehrin had her doctorate rescinded by the University of Heidelberg for containing over 30% plagiarism in May 2011, and now in June she has been named EU commissioner for research, Spiegel Online reports.

This means that she is in the committee that determines research policy for the EU. She had been an alternate for the committee, her fellow FDP politician, Jorgo Chatzimarkakis (his dissertation is currently at 71% plagiarism, but he is contesting the plans of the University in Bonn to rescind his doctorate as well) had the main seat. They have now changed places.

What does this say about research in Germany? What message does this give to the general populace about the importance of research? Plagiarists determining research policy? If today was April 1 I would have considered this an April Fool's joke, but it is unfortunately true.

Poor Germany, all of your good researchers do not deserve this.

Update: A petition has been started requesting that she step down immediately.  It is online in English, German, and French. If you feel so inclined, please sign. There are already almost 2000 signatures - on day 1 of the petition.

Wednesday, June 22, 2011

GuttenPlag wins Online Award

The GuttenPlag Wiki was awarded the Grimme Online Spezial Award 2011, a German journalism prize for it's work in determining the plagiarism in zu Guttenberg's thesis.

Here is a translation of the jury's decision:
At a time when great confusion reigned about how to assess the dissertation of Karl-Theodor zu Guttenberg, the "GuttenPlag Wiki" created clarity. In an unprecedented form of collaboration thousands of web users examined the thesis in detail and found discrepancies. Plagiarisms that were found were precisely documented in parallel with the original sources. An overview of the plagiarisms found was kept current on the first page of the wiki.

This method of working produced excellent results in a very short time. The results are verifiable for anyone, making this web site a central contact point for the discussion about plagiarism in the dissertation that was conducted in the media and in the general populace for weeks on end.

The fair and unbiased mode of operation used by the administrators of the wiki was outstanding, and channeled the onslaught of prospective users into constructive paths that delivered a sober overview of the findings. The public statements of Minister Guttenberg about his thesis were thus set in contrast to the facts that could be evaluated by anyone.

Not only was the project idea of the initiators noteworthy, but also the hundreds of web users who found more and more passages online and offline, that were used in the thesis without proper attribution. The project makes it clear that text comparisons can be well-organized in a collaborative manner and shows the possibilities of the web in general for group work.
Congratulations!

Saturday, June 11, 2011

The Strange Tale of the Paus Family and Borstel

Silvia Bulfone-Paus is an immunologist. She worked at the Forschungszentrum Borstel (FZB) outside of Hamburg in Germany, and is a professor at the Medical University in Lübeck.

Laborjournal.de reports that she published many papers together with the Russian couple Elena Bulanova and Vadim Budagian from Russia. Retraction Watch reported in March 2011 that 12 papers by the three authors have been retracted. The three have published 22 papers together, so there may be more.

In October 2009 the biologist Karin Wiebauer realized that the Western blots in some of the papers were very similar - sometimes just the labels were changed, in others a dose of Photoshop was used to mirror, move or distort the bands. This is the same method that Marion Brach used in the Hermann/Brach scandal end of the 90s [strangely enough, there is nothing in either Wikipedia about either them or the scandals].

In November 2009 Wiebauer informed the first author, Bulfone-Paus, of her discovery. Nothing happened. Finally, in April 2010 an investigation committee was convened. They determined that there was just sloppy publication, but the results were okay. There was a culprit found - the Russian couple. They were accused of deceit and the 12 papers retracted, although the Russians did not agree to the retractions.

There ensued an anonymous Internet-based campaign. Colleagues then published an open letter supporting Bulfone-Paus,  saying the poor woman, who is a brilliant researcher and has published much, including work together with her husband, was deceived by her postdocs. The Borstel Board of Directors - sans Bulfone-Paus - published a good response to the open letter soon after forcing her off the board:
Severe failure in one area (as supervisor and responsible senior, corresponding and first author) can hardly be compensated by merits in other areas. [...] For all scientists, one of the greatest goods in science is personal credibility and integrity, and that the most precious currency scientists have is the truthfulness of their data. The scientific community expects rigorous adherence to the rules of scientific research from principal investigators and, in particular, from heads of research divisions or departments. [...] The scientific misconduct in Silvia Bulfone-Paus's lab and her procrastination to go public despite being ultimately responsible has highly damaged the reputation of the Research Center. This is what cannot be tolerated.
But now the plot thickens: An additional paper by Bulfone-Paus (not including the Russian couple) in Blood  is currently under investigation. A co-author on this one is her husband, Ralf Paus, a dermatologist at the University of Lübeck. And the university has verified for Spiegel, a German news weekly, that they are currently investigating 6 papers of Paus.

And now it appears that Bulfone-Paus and Paus both have professorships in Manchester, in England, where they spend 20% of their time, according to the Times Higher Education. The couple also have three children, as reported by Spiegel in January.

In other news about Borstel, another director, Peter Zabel, stepped down earlier this month amidst plagiarism charges. It seems he double published a paper (once in German and once in English), as well as in 2009 publishing a paper that included large portions of text and diagrams from a 2008 paper published in the US. The double publication is deemed not so severe, although it is not clear that the later publication makes clear that it is in fact a double publication - the abstract has been rewritten, but is still similar. Zabel has now also resigned from the editorial board of Der Internist.

The double publication was found by someone calling themselves Clare Francis, who informed Retraction Watch, Abnormal Science Blog, and me. It was found using the Déjà vu tool for searching for duplicate content in Medline.

Joerg Zwirner, in a recent post to the Abnormal Science Blog, calls for setting up an Office for Research Integrity in Germany, as is to be found in the US. I heartily agree - this is far too complicated to understand for non-medical researchers, but it seems that there are deficiencies in the medical research complex in Germany that have existed for decades. And Hermann/Brach did not result in these being adequately addressed. Germany needs action, and it needs it now.

Plagiarism Found by Chance

The Münsterländische Volkszeitung reports on a strange case of plagiarism. An author for the German version of the Wikipedia was cleaning up the article on prostate cancer and looking for a serious source for one statement. Doctoral theses are great places to find these, as proper disserations include an overview of the current literature on the topic.

As the author describes in the Wikipedia Kurier from May 28, 2011, while looking through the dissertation from 2006 a term showed up that was quite unfamiliar, although the author had been researching the topic for a while. A source was given, from 1996, but not feeling like going to the library, the Wikipedia author just asked the "omniscient garbage can" about the term - and found a few hits.

One was a dissertation from 2009 - from the same department of the same university. The author downloaded it, and started reading, and it was déjà vu all over again, in the immortal words of Yogi Berra. The author thought that perhaps the links had been mixed up. But no, they had different authors, different years, but even the same advisor.

But Diss 2009 had been slightly changed. For example, 2009 discovered that one source in 2006 was included in the bibliography twice by mistake - it was removed and the references renumbered. Even large portions of the acknowledgments are identical. The CVs are, however, different.

The University of Münster was informed in March, the committees are working, there is no word yet on their decision. These kinds of cases tend to be very hush-hush, just in case the charge of plagiarism is being lodged just to throw dirt at someone. If I hear of a result I will report it here.

Thursday, June 2, 2011

Another test with Guttenberg's thesis

While we were working on our test using zu Guttenberg's thesis, we were alerted to work being done by researchers at the University of Magdeburg and University of Berkeley:

B. Gipp, N. Meuschke, and J. Beel, "Comparative Evaluation of Text- and Citation-based Plagiarism Detection
  Approaches using GuttenPlag," in Proceedings of the 11th ACM/IEEE Joint Conference on Digital
  Libraries (JCDL`11), Ottawa, Canada,  2011. 
 
The group will be presenting the paper in September in Ottawa, and have put a pre-print online.

Plagiarism Detection Software and zu Guttenberg's Thesis

We* couldn't resist. Here we had a thesis, a very large one (475 pages) for which a group of collaborators, the GuttenPlagWiki people, had already determined which bits were plagiarized from which source. We decided to see how plagiarism software would fare on the same material.

We wrote to the 5 systems that we rated "partially useful" in our 2010 test (a top score): PlagAware, turnitin, Ephorus, PlagScan, and Urkund. All of the companies were glad to provide us with a test account for this test. Turnitin did, however, suggest that we use iThenticate. That is one of their products that uses the same engine and backend as turnitin, but has an easier to use interface for the purpose at hand, testing individual files instead of modeling classes.

We obtained a copy of the thesis in pdf format, and started in.

The first problem was the size of the file. At 7.3 MB and about 190,000 words, it was a heavyweight and was not easily digestible for the systems. PlagAware gave up after just 159 pages; iThenticate chopped it into 13 pieces á 15,000 words; Ephorus first tested it themselves, then reluctantly let us upload our copy because we wanted to have the same copy used by all systems; PlagScan chugged away over night on the results; Urkund ran into trouble with the number of hits, we appear to have taken down their entire system over the weekend.

The next problem was with the PDF of the source itself. It was formatted nicely with ligatures and different sized spaces for a very professional look. Many systems struggled with this, assuming these were stray  characters and just discarding them, instead of replacing them with the respective letters (fi, fl, or blank). This caused some missed plagiarism.

Then we had the result reports - since there are many copies of the same text available online, many systems inflated their results by counting the same source numerous times. We made every effort to disregard anything published on the GuttenPlag site. iThenticate had a nice functionality for disregarding this site. On the other hand, over 40% of the links that iThenticate returned went to 404s, pages no longer available at that URL! The sites have reorganized their material and URLs. We researched a number of them, the sources are, indeed, still online. This is unconvincing for a dissertation board - they need to be able to verify both the source and the copy.

The GuttenPlag people have determined that 94% of the pages contain plagiarism on 63% of the lines. The following are the results for the individual systems:
PlagAware: Initially 28% on the first 159 pages, however this included a lot of garbage such as pastebin material. After we removed this and the GuttenPlag links, the amount went to 68% before the report disappeared completely. We have not been able to resubmit, it breaks off with an error.
iThenticate: 40%
Ephorus: 5%! Only 10 possible sources found, of these 3 were GuttenPlag and one a duplicate
PlagScan: 15,9%
Urkund: 21%

One general problem that we had with all of the reports was that we could not click on a plagiarism and discover the page number in the PDF source. This is something we would need in the case of preparing a case before the dissertation board, as the side-by-side documentation needs to be marked in copies of the original text and not online. It would be a lot of hand work to find the pages for preparing such a documentation.

Another problem was the presentation of the footnotes. These were generally not recognized by any of the systems, and often the footnote number and text was just inserted in the middle of a paragraph. This often got in the way of marking a larger block of 1-1 plagiarism. The Wiki has found an interesting type of plagiarism we now call “Cut & Slide” - from a large block of plagiarized material, a portion is cut out and demoted to a footnote.

Reports

The PlagAware report disappeared from the database after a software update, we were rather upset that reports that have been produced (and would have been paid for, if we had been real customers) are suddenly gone. The side-by-side that we have always liked turns out to have an extreme problem - the system often marked just 3-4 words, then noted an elision ("[...]") that sometimes went on for a page or so, and then another 3-4 words. Since this was at the very beginning, you had to scroll a good bit to get to the first plagiarism, the Zehnpfennig one (from the FAZ). A law professor might have considered the system broken, reporting minimal plagiarism, and just broken off the check.

iThenticate drove us batty with the 404s and the inability to copy and paste from the online report - many of the sources were large online PDFs containing numerous papers, we had to retype some words and phrases in order to be able to search for the plagiarism source in the link given. It also reports proper quotes as plagiarisms, for example, the appendix which contains material available completely online is only reported to be a 60% plagiarism.

Ephorus has a nice side-by-side, but included a COMPLETE COPY of the entire dissertation for every source found. We assume that this is why it broke off after just 10 sources, the report had already reached 54 MB! The reports don’t need to repeat things that are not plagiarized, just the plagiarized material, but preferably with the page numbers, please.

PlagScan also irritated us with the little dropdown list of sources. We had to open every source and then look to see where the plagiarism was, this took an enormous amount of time.

Urkund was extremely slow - we understand that this had to do with the number of hits found. The navigation, as we had found in the 2010 test, was difficult to use, and the numbers given had no real meaning. This report now cannot be loaded, it gives a count up to 90, stays at 89 of 90, and after some minutes gives up and tells us to come back another day.

38 out of the currently known 131 sources were found by at least one system. Overall, PlagAware found 7 (5%), iThenticate 30 (23%), Ephorus 6 (5%), PlagScan 19 (15%) and Urkund 16 (12%).

Out of the top 20 sources, however, only 8 were “findable”, i.e. online. Many were books that are available on Google books (and are reported by Google through a normal Google search), 7 of the top 20 were papers prepared by the writing services of the German Bundestag and were thus not available to the public.

When we take the top 43 sources found by GuttenPlag, then the top 11 sources found by the systems are included (3 found by all 5 systems, 3 by four out of five, 5 by three out of five systems) as well as the top 20 sources that were available online, we have the following results: PlagAware and Ephorus both found 6 of the top 20 online sources (30%), iThenticate 16 (80%), PlagScan 12 (60%) and Urkund 13 (65%).

Since iThenticate found so many sources, we wanted to go back and look at which position in the results these were to be found. iThenticate had reported 1156 sources, of which only just over 400 were longer than 20 words. We only looked at the 117 sources reported that had a match of 100 words or longer. Of the 13 top sources reported by iThenticate (one for each portion), 6 were 404s (“File not found”), one was a correctly quoted portion in the thesis, one was a correspondence between the bibliography and the
bibliography of another source, 3 were from the same source (Volkmann-Schluck) that is indeed the top source for the thesis, and 2 were of the second most used source that was available online.

No easy answer

There is no easy answer to the question as to whether the professors at the University of Bayreuth would have been able to discover the plagiarism in this thesis with the help of software: The usability problems are very serious. People who are not computer scientists have a hard time interpreting the results. iThenticate’s links to 404s will lead people to disregard other found sources. But it would have been possible for the university to at least suspect a problem, if not see the abysmal magnitude of the plagiarism.

Our suggestion stands: If a teacher or examiner has doubts about a thesis, they should be able to use software systems, preferably two or three, to examine the material. However, they need to be trained in interpreting the results, or have a trained person such as a librarian go over the report with them. We do not find it generally useful to have all papers put through such a system with a simple threshold set for alarming the teacher. Allowing students to “test” their own papers will just encourage them to use one of the many online synonymization tools available until their paper “passes”. Writing, especially scientific writing, is concerned with intensive work on the text itself, not just a superficial attention to word order and choice.

The German report on this research can be found in the June 2011 edition of iX. (Update: Now available online at http://www.heise.de/ix/artikel/Kopienjaeger-1245288.html)

[*] Katrin Köhler is my co-author for this research and report.