Showing posts with label wikipedia. Show all posts
Showing posts with label wikipedia. Show all posts

Monday, February 27, 2017

Catching up on VroniPlag Wiki

I haven't written about the work at VroniPlag Wiki for a while, so here is some of the more interesting things that happened in the past year or so:
  • There were three important verdicts handed down for VroniPlag Wiki cases, all of them affirming the university decisions to rescind the doctorates in question:
  • In another legal case (Ssk: Verwaltungsgericht Düsseldorf, 15 K 1920/15) as reported by the Legal Tribune Online, although the judge made it clear that the university would win its case, it still settled the case without judgement on rather strange terms: The thesis can be submitted to another university, but not Düsseldorf again.
  • In March 2016 the Medical University of Hanover determined that the current German Minister of Defense had plagiarized, but not enough to warrant rescinding the doctorate (discussed on this blog previously). A number of attempts have been undertaken in order to obtain information on which documented fragments were considered plagiarism and which ones were not, but I keep hitting a brick wall here, although it would be useful for the scientific community to know why specific fragments were considered to not be plagiarisms. The university was informed of another five dissertations (Acb, Bca, Lcg, Wfe in medicine, Cak in dentistry) and a habilitation (Mjm) that also include extensive text parallels that could be construed as plagiarism, but there has been no public progress made to date.
  • The University of Münster, with 23 cases in medicine alone,  announced in February 2017 that they have completed their investigation that took 3 years and 12 meetings of the committee. The Westfälische Nachrichten report that eight doctoral degrees have been withdrawn and 14 persons reprimanded, although the university won't say which degrees have been withdrawn. One author has died, and thus that investigation was discontinued. One doctoral advisor of two of the withdrawn degrees, according to the paper,  has been stripped of additional funding and personell and is prohibited from taking on doctoral students. The story was picked up by dpa and published in a number of online publications, for example Spiegel Online.
  • The often-heard argument that natural scientists don't plagiarize can be considered refuted with this doctoral thesis in chemistry that contains text overlap on over 90 % of the pages: Ry
  • A law dissertation from the University of Bremen, Mra, that was published in 2016, was documented with extensive plagiarism from, among other sources, the Wikipedia.
  • The documentations published for two habilitations, Chg (law, 2005) and Ank (dentistry, 1999), bring the total number of documented habilitations to eleven cases.
  • One dissertation, Gma, about TV game shows such as Who wants to be a Millionaire?, copied extensively from at least 13 Wikipedia lemmata.
  • One of the cases published in February 2017, Pak, includes not only text from five Wikipedia articles, but preserves the links from the articles as underlines in the text.
http://edoc.ub.uni-muenchen.de/13061/, Med. Diss. LMU München, p. 18-19


I will be speaking with a colleague in March at a conference about the Dr. Wikipedia phenomenon.

Friday, January 30, 2015

A Patchwork Thesis

In 2008, a graduate of a German Fachhochschule (University of Applied Sciences, abbreviated FH) submitted a dissertation to the Tomáš-Baťa-Universität in Zlín, in the Czech Republic, a good 800 km from his place of residence. At that time he was in his mid-40s and had been working as a public official for the past 18 years.

One may wonder why a mid-career public official would go so far afield to obtain a dissertation, when there are excellent universities near his hometown. In Germany at that time, a Diplom from a Fachhochschule was not sufficient to be admitted for doctoral work. Often extra coursework would be required, or even a Diplom or Master's at a university had to be completed before beginning work on a doctorate.

Today, the Diplom is no longer offered, but instead Master's degrees from both universities and FHs are acceptable for beginning work on a doctorate at a university. Then as now, however, doctorates earned in other EU countries can be used back home, so there is quite some interest in obtaining degrees outside of Germany.

Zlín offers a four-year doctoral program in Management and Economics that charges 1,600 €/year in tuition that requires submission of a written dissertation that is generally published online in the Zlín Digital Library.  One of these dissertations, the one submitted by the German public servant, is a 100-page dissertation that is now documented as case #140 on the VroniPlag Wiki site.

The "barcode" representation for the manual VroniPlag Wiki documentation for the case looks quite like a bit like a patchwork quilt. This barcode is often misunderstood as being the result of a software-based plagiarism investigation. Nothing could be further from the truth: All discovery and documentation is done manually with the help of small software tools for various tasks, and all documentation is reviewed by a second researcher.

The bar code uses five different colors:
  • white is for pages that have not yet been investigated, or for which nothing has yet been found;
  • bright red is for pages that contain text parallels on over 75 % of the lines of text on a page. The line counting is not automatic, but must be done and reviewed by two researchers;
  • dark red is for pages that have text parallels on between 50 and 75 % of the lines;
  • black is used for pages that have text parallels, but that make up 50 % or less of the page;
  • blue is used for pages that are excluded from consideration. These are normally the title pages, the table of contents, the literature list and any appendices.
For this dissertation, there are some additional blue bands: Towards the end of the thesis there is a list of abbreviations and one of figures and tables that are sandwiched in between pages of content.

http://de.vroniplag.wikia.com/wiki/Msc
[Msc 2008]
The two blue bands after the first few pages are quite interesting. The first one, extending from page 13 to page 21, is taken verbatim from the Catechism of the Catholic Church. The second one, running from page 26 to page 33, is verbatim from a European commission document. Discounting these pages, as each one has a brief reference given when the copy begins, there is only a total of 65 pages of content in this dissertation.

http://de.vroniplag.wikia.com/wiki/Msc/Fragment_024_04
The patchwork continues when looking at the individual pages, as there are also problems on many of those pages. For example, extensive swatches of text are taken verbatim from seven Wikipedia articles without reference. The pages 2325, which deal with some topics in German history, are lifted entirely from the Wikipedia with only minor adjustments.








http://de.vroniplag.wikia.com/wiki/Msc/Fragment_058_01Ten fragments are taken from a journal article that appeared in the Academy of Management Review. There are occasional references to the article given in the text, but it is not made clear that the pages 5663 are almost entirely from this article, and taken verbatim. Page 58 includes an interesting copy & paste error: the printed version of the article has a footnote from the previous page continued at the bottom of the left-hand column. In the dissertation, this text can be found sandwiched-in  between the text from the left-hand column and the text from the right-hand one. The sentences thus make no sense whatsoever.



On page 72, the Daimler-Benz sustainability report is copied with the "we" pronouns changed to "they" or "their" or "Daimler".

http://de.vroniplag.wikia.com/wiki/Msc/072

Pages 7779 are taken verbatim from a discussion paper for the European Sustainable Development Network Conference 2008. A copy & paste error on page 79 caused quotations marks from the original to be reproduced as | or —.

http://de.vroniplag.wikia.com/wiki/Msc/Fragment_079_01

Whenever the writing shifts from proper English sentences to word-for-word literal translations from German, the thesis becomes quite unreadable. I quote from page 50:
Germany applies in 2001 above a surface of 357,020 sq. kms, the population around catches 82,330,000 million people (2000: 82,260,000). Of it 40,326,000 persons (49.1%) were gainfully employed in 2000. In 2001 there were 2.4% of the employed persons in the agriculture, forestry and fishery, 22.0 % in the producing trade without the building trade, 6.7 % in the building trade, 25.4 % in trade, guest's trade and traffic, 15.2% in the area of financing, renting and services for companies and 28.3% in the sector of public and private service providers (cf. "Germany in figures" in 2002). While trying to explain how many of these persons have been employed in small and middle companies or to define the boundary between them and large companies in Germany one pushes fast to his borders, because there are not enough actual statistical facts offered from the Statistical Federal Office16. Merely on data delivered by the Institute Of Middle Class Research17
[Note: The institution referred to in the last sentence is the Institut für Mittelstandsforschung in the original. It actually researches small and medium-sized enterprises, not the middle class.]

The University of Zlín publishes the reviews by the thesis examiners online, a commendable gesture. Two excerpts are documented in the Findings section on the VroniPlag Wiki:
  • Review 1 (17.11.2008): "Author considers that CSR [Corporate Social Responsibility] is suitable way how to change current managerial thinking which describe from different point of views, e. g. historical progress in religious aspect."
  • Review 2 (17.11.2008): "The dissertation is written very cultivate, digest at the high academic level."
I beg to differ. I don't find this patchwork of other people's words to be either at a high academic level or acceptable scholarship. Above all, it is a mystery to me that people are not aware that when they publish their works in a digital library that they are available world-wide for discussion. Does no one read the theses critically before publication?

The University of Zlín has been informed of the situation and has been sent this report containing all the documentation produced manually by VroniPlag Wiki about the dissertation. The university promptly acknowledged the receipt of the documentation by email. Many other universities, sadly, don't manage to do even that.

Friday, August 29, 2014

Wikipedia by any other name

Back in May I reported on the the uproar surrounding the assertion that a book published by C. H. Beck in Germany, Grosse Seeschlachten -- Wendepunkte der Weltgeschichte von Salamis bis Skagerrak, contained plagiarism from the Wikipedia. The publisher withdrew the book, although "only" 5% of the book was affected, they stated. Well, there is actually quite a bit, and although the Wikipedia texts have been patchwritten (words inserted or deleted, words swapped with synonyms, phrases reordered) so they are not completely identical, it is clear that the text closely follows the Wikipedia.  Some of the fragments have been documented by a VroniPlag Wiki researcher, however they have not yet been double-checked [volunteers are welcome!]:
A representative of the publisher has agreed to participate in a discussion about the use of the Wikipedia by researchers on October 3, 2014 at the WikiCon in Cologne.

The next German publication with heavy borrowing from the Wikipedia was published by Springer Vieweg, Geschichte der Rechenautomaten, the history of computing in three volumes by a retired German computer science professor. Anyone who has given a lecture on the history of computing recognizes that many of the pictures are taken from the Wikipedia and other Internet pages, and many are not in the public domain. But it turns out that a good bit of the text is also from the Wikipedia.

I don't normally link to the FAZ, but they published an excellent article on the problem by Eleonor Benítez. She quotes the author as stating that these volumes are not scientific writing, but reference books. He defines a reference book as 80% data, while scientific writing contains didactical editing and thus contains more intellectual property. Data, he continues, are facts and not copyrightable. And anyway, there are only so many ways to state something in German.

Again, a VroniPlag Wiki researcher has documented just a few pages that have not yet been double-checked, but there are some very long passages that are identical:

Springer has withdrawn the books from their home page, but the books are still easily obtainable through other booksellers. I asked the executive editor if they were going to put out a press release about the issue, he said no. It seems it is hoped that this will quietly die down.

And now a third German book using Wikipedia without attribution has been identified. The Wagenbach Verlag recently published Aldo Manuzio. Vom Drucken und Verbreiten schöner Bücher, a scathing review in artmagazine pointing out the copying was published in July 2014.

A few questions arise:
  • Why do academic authors use the Wikipedia in their work without respecting the CC-BY-SA license? Okay, they probably find it embarrassing to have Wikipedia references all over the place. But isn't it worse to be found out after the book is in print?
  • Why don't the publishers have editors read the books critically before they are published? The prices are high enough, and that is supposed to be the justification for the price, that the publishers are somehow adding value to the process by ensuring a high-quality product. If the publishers are trying to save money by cutting out the editors, then perhaps we don't need publishers any more. 
  • Do the universities where the book authors work get rewarded financially by their ministries of education for these "publications"? Some are still listed on the publication lists of the authors, even though they have been withdrawn.  This is also often the case for retracted papers, they remain on the lists of publications for which one assumes the university and perhaps the researcher obtained a reward, even after retraction. 
  • I've asked the German Wikimedia e.V. if they cannot sue in the name of the collective authors for the Wikipedia articles. However, only the authors themselves would be able to sue over copyright misuse. I still think, though, that since the license is not being respected by the publishers (especially if pictures are being used), that a suit or two should be in order.
  • Above all: if researchers are publishing Wikipedia material under their own names, how can I explain to my students that it is not acceptable for them to do the same?
I'm sure there will be more to come. 

Tuesday, July 22, 2014

Belgrade Mayor plagiarizes doctorate

A new plagiarism scandal has erupted in Serbian politics. The scandal around the dissertation of the Minister of the Interior, Nebojša Stefanović, is still in full swing. Now the dissertation of the Mayor of Belgrade, Sinisa Mali, entitled “Creating Value Through the Process of Restructuring and Privatization – Theoretical Concepts and Experiences of Serbia” and submitted in 2013 to the University of Belgrade’s Faculty of Organizational Sciences has been documented to be heavily plagiarized.

, Professor of finance at the European Business School in Wiesbaden, Germany, documented the plagiarism in English on the Serbian site Peščanik in early July.


has put together an interactive graphical representation of the thesis with every page of Mali's thesis linked to the iThenticate report on the plagiarism found on that page. Even considering all the caveats about the use of plagiarism detection software, quite a number of sources, including the Wikipedia, have been identified.
If the protection of ideas is no longer important in our society, then we will gamble our future away.



Sunday, May 4, 2014

Test of the Picapedia System

Stefan Weber noted recently that the plagiarism research group in Weimar has tweaked their experimental system picapica. We tested the system back in 2007 while it was in a beta stage. We have looked at it occasionally in recent years, but were not able to test it for various reasons. The system back in 2007 took quite a long time to return very little in the way of finding sources.

They have now focused the system on comparing a text with the current version of the Wikipedia in 10 different languages,  English, German, Spanish, Catalan, Basque, French, Italian, Dutch, Portuguese, and Swedish. That would make it a useful tool for teachers looking for Wikipedia copies in student papers, as students will tend to take the current version of an article and not a historic one. 

I decided to test the system using the 2013 test cases that were based on the Wikipedia, as well as a translation from a Wikipedia, an original work, and a plagiarism with no Wikipedia sources. Both .pdf and .doc files were used. The results:
  • 21-Tibet (3 sources including WP:DE): WP source found, and another Wikipedia article with an identical phrase
  • 33-Eyjafjallajoekull (1 source, WP:EN): WP source found
  • 36-Champagne (translation from WP:FR): nothing found
  • 43-Brüder-Grimm (2 sources, one WP:DE): WP source found
  • 45-Strelitzia (highly disguised plagiarism from WP:EN): WP source found
  • 46-Thermoskanne (25% automatically disguised plagiarism from WP:EN): WP source found
  • 47-Tessellation (4 sources, none WP): no sources found
  • 50-Union-Jack (original paper with correct references to WP:DE): It notes a similarity to the Union Jack lemma, but does not flag any reused text.
  • 51-London-Blitz (4 sources, one from WP:EN): WP source found
  • 52-Boxer-Rebellion (disguised plagiarism from WP:DE): WP source found
  • 57-Fallingwater (1 source, WP:DE): 2 possible WP sources named, one is correct
  • 58-Phillip-K-Dick (disguised plagiarism from WP:DE): WP source found
  • 60-Rolltreppe (3 sources, 1 WP:EN properly referenced): It notes a similarity to the Escalator lemma, but does not flag any reused text.
  • 63-Hebrew-Plag (1 source, WP:HE): nothing found, this language is not given as a possible one
The system did not always find the total amount of plagiarism, but it pointed to the correct source in all cases except the impossible ones (#36, #47, #63). It also did not report plagiarism for correctly quoted Wikipedia, something many systems do not get right.

The text is first uploaded to their server (in Germany) and deleted after examination, according to their privacy policy. However, they do keep search results and thus may have portions of the uploaded text stored in some form. I repeated the test a day later and saw no trace of the results from the previous day influencing the repeat tests. They do record IP addresses and use Google Analytics, but offer the service free of charge (even for commercial use) as long as it is not abused and as long as the user does not pretend that they developed the system themselves.

So the system does appear to be quite useful for a small subset of plagiarism detection problems, namely identifying text that has been taken from a current Wikipedia. If it is necessary to look for text in older versions of Wikipedia articles, the tool WikiBlame can be quite useful for identifying and dating text taken from the Wikipedia many years prior.

Saturday, May 3, 2014

Stormy Waters

It all began with a Facebook posting on April 22, 2014: Arne Janning posted a longish article to his friends asking for help. He had found a recent book by two prominent historians (Karsten, A. & Rader, O. B. (2013) Grosse Seeschlachten -- Wendepunkte der Weltgeschichte von Salamis bis Skagerrak. München: C.H. Beck) to contain plagiarism from the Wikipedia. He exaggerated by saying that "every page contained plagiarism", and wondered what he should do.

The first thing Janning should have done was perhaps to check his privacy settings, as his post was public and the case quickly caught fire and was widely reported on. Maritime puns seem to be the norm for the titles of the articles, as I have also chosen: Spiegel Online ["Abschreiben bei Wikipedia: Zwei Historiker geraten in Plagiatssturm"], Neue Zürcher Zeitung, ["Seeschlacht mit unzulässigen Beibooten"], Süddeutsche ["Wendepunkte der Weltgeschichte aus Wikipedia kopiert"], FAZ ["Unter der Flagge Wikipedias"]. The authors and the publisher promptly threatened Jennings with legal action. According to Spiegel Online, one of the authors, Radar, noted that he did not actually steal intellectual property, as he only used "technical details" from the Wikipedia. "In earlier days we used the Brockhaus [encyclopedia], today we use the Wikipedia," he is quoted as stating [translation dww].

The blog Erbloggtes noted that there were at least two pictures used from the Wikipedia as well as some text, and the pictures were printed without attribution. That is a definite copyright infringement, although one of the pictures was indeed in the public domain, the other was not. Many other blogs joined the discussion: Archivalia, Schmalenstroer, hellojed, plagiatsgutachter. The Wikipedia-Kurier discussion was, as so often, extensive.

The publisher soon decided to withdraw the book, as reported by BuchMarkt, Meedia, and others. Beck ran the book through plagiarism detection software (iThenticate) and declared the parts written by Arne Kasten to be "free from unmarked quotations", despite the fact that it is impossible to prove the absence of plagiarism. One can only demonstrate the presence of plagiarism by a synoptic documentation showing the plagiarism and the source together. The other author, however, had not only plagiarized from the Wikipedia, but from an article published online in 2003. The publisher notes in a pseudo-scientific manner the "exact" word counts and percentages found, although I have repeated shown in my work (for example, my 2013 test) that such numbers are meaningless. Additionally, a reader cannot tell which parts of this book were written by which author, so they both are responsible for the entire book, in my opinion.

Beck also couldn't resist bashing Janning, still threatening legal action, perhaps to deflect criticism from itself for not having properly edited the book. A good comment by Jörg Hopfgarten in the Boersenblatt notes the publisher would be better off to understand that this was just an angry customer blowing off steam, ranting. Customers have a right to do just that without consulting a lawyer, especially when it can easily be seen that they are at least partially right. Amazon is full of similar reactions, this was just the media picking up on the keyword "plagiarism" and running with it, without having independently verified the accusations. Indeed, none of the Seeschlachten books in the Berlin libraries were out on loan when I obtained a copy, although perhaps they all purchased the Kindle version.

Beck closes their press notice with a condescending offer to "participate in a discussion about the use of the Wikipedia in academics." Jan Englemann notes on the Wikimedia blog that the discussion is for all practical purposes already over, as there are numerous court rulings on the legality of the Creative Commons license that the Wikipedia articles are under, CC-BY-SA. It is, perhaps, time for publishers to understand how a legal use of Wikipedia texts works: Link to the license and authors, and put the material that uses a Wikipedia text under at least this license. Open licenses do not mean that the material is free to be misappropriated.

Many of the blogs discussing the topic have started documenting the plagiarized portions, in particular using a German system Picapica (also called Picapedia), that compares text to the current version of the Wikipedia. I will be bringing a short test of this system in my next blog entry.

Tuesday, April 8, 2014

Short links

Thanks to a correspondent for combing Google News for these links:
  • The International News (30 March 2014) reports that a professor in India Pakistan is being forced to retire on account of plagiarism in research articles he published:
    The Punjab University (PU) syndicate on Saturday, confirming plagiarism charges against the varsity’s Institute of Chemistry Prof Dr Zaid Mahmood, penalised him with forced retirement under the PEEDA Act. [...] Dr Zaid had been claiming that his research papers were published before 2007 and therefore they could not be made a subject of the inquiry as per the HEC’s plagiarism policy.
    As if just waiting a number of years somehow changes a plagiarism into a non-plagiarism. The article does not state if the papers are being retracted.

  • The Guardian has a piece (21 March 2014) on how easy it is to plagiarize using the Internet, but also how easy it is to find people out:
    The act of uncovering and investigating acts of plagiarism is becoming easier by the day. Search engines, online plagiarism checkers (of varying quality) and the viral publicity opportunities afforded by social media all play their part. Plagiarism searches can be compelling, like addictive puzzles where positive results elicit mental fist-pumps of delight.
  • The Times Higher Education notes (3 April 2014) that a senior sociologist, caught by a young PhD in plagiarizing from the Wikipedia, of all places, finds that rules about referencing don't apply to his scholarship:
    An eminent sociologist has claimed that high-quality scholarship does not depend on “obedience” to “technical” rules on referencing after a PhD student accused him of plagiarising from websites, including Wikipedia, in his latest book.

    Zygmunt Bauman, emeritus professor of sociology at the University of Leeds, was responding to claims that he fails to clearly indicate that several passages in his 2013 book Does the Richness of the Few Benefit Us All? are exact or near-exact quotations from the online encyclopedia and other web sources.
    Bauman tries to put the PhD student down by snorting that ideas aren't owned by anyone. But really, shouldn't every academic be able to clearly state what is from others? It does then rather save face when it turns out that something one is using from other people without attribution is just plain wrong...

Friday, July 15, 2011

Medical doctorate rescinded

The German University of Münster announced on July 14, 2011 that they have rescinded a doctorate from the medical faculty.

A Wikipedia editor reports on a strange occurrence while researching an article about growth factors in prostate cancer in the Wikipedia Kurier from May 28, 2011. The editor was using a dissertation from 2006 as the basis for the Wikipedia article. Dissertations - the non-plagiarized ones - are very useful for this task, as they offer a succinct overview of the literature on the topic.

The editor stumbled over the term Xxxxxxxxx xxxxxxxxxxxx xxxxxx something s/he had not heard before. It was references from a book from 1996. Instead of running to the library to get the book, the editor first asked the "all-knowing garbage heap" if there was anything on this term around.

The editor was amazed to find a dissertation from 2009 on the same topic from the same university. The university puts all of its dissertations online, so only a click was necessary to download it. A short read was, as Yogi Berri would have put it, "déjà vu all over again". The editor thought this might just be a typo on the downloads page and that s/he now had two copies of the same dissertation. But no, each was by a different author.

Since zu Guttenberg had just recently resigned and VroniPlag was under full steam investigating other dissertations, the editor began a side-by-side comparison. Except for minor (and sometime error-inducing) changes, the general introduction to the topic was identical, down to the the line breaks. Then it got worse: there were even identical results, discussions, and the dedication - except the names were changed. The CVs were, however, different.

The editor was not sure what to do, consulted with some scientists. The unanimous opinion was: this must be reported to the authorities. So it happened, and the doctorate has been rightly rescinded from a practicing medical doctor in Westphalia.

I have often stated, as here in the Deutsche Ärzteblatt, that we need two kinds of doctorates for medicine: M.D. for the practicing doctors and Dr. med. for the researchers.

I'm glad the University of Münster was so quick in reacting.

Friday, June 29, 2007

Stolen from the Wikipedia

Many pupils and students think that the Wikipedia is there to let them hand in term papers very easily. The German Wikipedian Avatar has started a collection of stories about sad happenings to people getting caught in a plagiarism.

They include a student trying to cover his tracks by changing the Wikipedia article and some links to discovered plagiarisms by journalists and publishers of articles in the Wikipedia. There are some from both the English and the German WP, more are welcome.