Copy, Shake, and Paste: plagiarism-detection-software

Showing posts with label plagiarism-detection-software. Show all posts

Tuesday, October 13, 2020

Plagiarism Detection Software: Publication, Mergers, News

Finally found some time for a post!

First off: The TeSToP working group (of which I am a participant) at the European Network for Academic Integrity has finally published its test of support tools for plagiarism detection. It looks at the results from various angles such as effectiveness on various European languages, one source or multi-source plagiarism, and amount of rewriting done.

Foltýnek, T., Dlabolová, D., Anohina-Naumeca, A. et al. Testing of support tools for plagiarism detection. Int J Educ Technol High Educ 17, 46 (2020). https://doi.org/10.1186/s41239-020-00192-4

Abstract:
There is a general belief that software must be able to easily do things that humans find difficult. Since finding sources for plagiarism in a text is not an easy task, there is a wide-spread expectation that it must be simple for software to determine if a text is plagiarized or not. Software cannot determine plagiarism, but it can work as a support tool for identifying some text similarity that may constitute plagiarism. But how well do the various systems work? This paper reports on a collaborative test of 15 web-based text-matching systems that can be used when plagiarism is suspected. It was conducted by researchers from seven countries using test material in eight different languages, evaluating the effectiveness of the systems on single-source and multi-source documents. A usability examination was also performed. The sobering results show that although some systems can indeed help identify some plagiarized content, they clearly do not find all plagiarism and at times also identify non-plagiarized material as problematic.

So just a few months later these two press releases show up:

Turnitin announced in June 2020 that they have purchased the company Unicheck. Both systems participated in the TeSToP test.
Urkund and PlagScan, two more systems that were in the TeSToP test, announced a merger in September 2020: They will now be known as Ouriginal, and will be combining the plagiarism detection results of Urkund with the author metrics of PlagScan.

These four systems just happened to be the best ones in combined coverage and usability, although none of the systems are perfect, averaging 2.5 ± 0.3 on a scale of 0 to 5. We plan on retesting in 3 years, so it will be very interesting to see how these combined systems fare then.

In other news, the proceedings of the "Plagiarism Across Europe and Beyond 2020" (PAEB2020) that ended up being held online instead of Dubai is now ready and available for download. PAEB2021 will be held in Vienna, September 22-24, 2021, COVID-19 permitting.

And in very sad news, academic integrity researcher Tracey Bretag from Australia passed away in October 2020. Jonathan Bailey has written an excellent obituary on his blog Plagiarism Today. I am glad that I was able to meet her many times and experience her great ideas and energy. It was a pleasure to contribute to her Handbook of Academic Integrity. She will be sorely missed.

Sunday, July 5, 2015

Dutch University Rescinds Doctorate

According to NRC.NL [apparently the online edition of a Dutch daily evening paper], the Erasmus University of Rotterdam has recently rescinded a doctorate. The case involves a doctorate granted by the university to a women in 2013 in psychology.

There was a discussion about the case in 2014 in a Erasmus University of Rotterdam publication. That publication states that after the plagiarism was discovered, she was reprimanded and given until October 1, 2014 to "repair" the plagiarism in her thesis. The academic integrity council of the university had recommended immediate retraction of the thesis, but the Executive Board of the university decided that the supervisor was partially at fault and it was "only" a question of sloppy citations. The doctoral student, according the the article in the EUR publication, felt that she had not intentionally plagiarized, and she had had her thesis checked by Turnitin and it had not uncovered any plagiarism. Additionally, the thesis committee passed her, so she felt that she should not be penalized if they didn't have any problems with the thesis.

The NRC.NL article notes that the external committee investigating the case determined that she did not rewrite the plagiarized passages, but only deleted them in the re-submitted version. The quality of the rest was debatable, appearing to be based only on secondary sources. Thus, she has been asked to return her doctoral certificate. NRC.NL says that this is a first for the Netherlands, I am not sure that this is true. She refuses, however, to hand back her certificate and is now initiating legal action against the university, NRC.NL reports.

The argument that she brings of having used software to check the thesis and it not finding anything points to a very big problem in the use of so-called plagiarism detection software. Just because the software does not find any sources, that does not mean that the thesis is original. It just means that no sources were found. There could be a source that is not available on the open Internet, or one from a book, or one that is for some reason not in the database used by the system. It is also possible that the text was rewritten to disguise the text taken, which will foil many such software systems. Software can only be used as a tool, not as a litmus test for determining plagiarism.

Thanks to Google Translate for filling in the bits of Dutch I couldn't decode!

Friday, February 13, 2015

Austrian term papers clog plagiarism detection system

The Austrian online newspaper derStandard.at reports on a bit of problem with their new high school term paper submission system for the school leaving certificates matura. Pupils in Austria are now expected to submit a 40,000 to 60,000 character long term paper (vorwissenschaftliche Arbeit) by the middle of their last year of school. The paper will be graded by teachers and the students must give a presentation on their work.

Of course, since Austria is well aware that there is a plagiarism problem, at least as far as pupils and students are concerned [not so much for doctoral dissertations, but that is another blog post], the term papers must be checked for plagiarism by a so-called plagiarism detection system.

The due date 2015 is Friday, February 13. Surprise, many students have waited until the last minute, and the system is throwing errors that appear to point to the system being swamped. Apparently, they did not also reckon with such large files as are being uploaded. The server operator noted that they were expecting the files to be around 1 MB, instead they were getting 60 MB large files.

Not to fear - there is a Plan B in action: the pupils can submit a printed version at their schools in order to keep the deadline. Or, as one teacher noted in a comment, submit at 5 a.m. The server runs well at that time of the night.

2013 there were almost 44,000 pupils granted their diplomas in Austria. Teachers will now, in addition to grading these papers, have to wade through the results of the plagiarism-detection software, although they also generate false positives as well as false negatives, thus not determining plagiarism but giving some ideas as to where perhaps there could be some plagiarism. Even assuming that a teacher only spends an average of 10 minutes per paper interpreting the results (and this is generous, as the reports are not easy to read and the numbers reported can be quite misleading), this means a minimum of 7-8000 extra hours of work nationwide, but probably tenfold that.

If the pupils are anything like the ones I see in the first semester, they love to take pictures they found on the Internet to spice up their texts - they are much more visually oriented than the older generations. The software will certainly not be able to identify pictures that are not used according to license, so the teachers will also need to use Google's image search or a system such as TinEye to look for the potential sources, increasing the amount of time needed for grading.

Maybe the idea of a term paper submitted centrally needs to be rethought? Of course, they have to learn how to do research and to write about a topic. But we need to be thinking about how to develop methods of assessment that are plagiarism-proof, instead of adding more broken software to a broken system.

Monday, October 7, 2013

Plagiarism Detection Software Test 2013

Today I released the results of the Plagiarism Detection Software Test 2013. The report is available online, as are the individual results. Spiegel online reported on the test, including a picture of the home pages of the systems and the company response to the results, if the company cared to answer. We also offer the companies the opportunity to send us their comments on the test, we are glad to publish them.

The results can be summed up rather simply:

So-called plagiarism detection software does not detect plagiarism. In general, it can only demonstrate text parallels. The decision as to whether a text is plagiarism or not must solely rest with the educator using the software: It is only a tool, not an absolute test.

If a university decides to use plagiarism detection software, they need to have a clear policy on why they are using the software and how they will react to the results. It would be good to set up a competence team that offers educators help in testing suspicious texts, and to perhaps have two systems on offer, as the systems find different sources for different parts of the same source.

Thursday, June 2, 2011

Plagiarism Detection Software and zu Guttenberg's Thesis

We* couldn't resist. Here we had a thesis, a very large one (475 pages) for which a group of collaborators, the GuttenPlagWiki people, had already determined which bits were plagiarized from which source. We decided to see how plagiarism software would fare on the same material.

We wrote to the 5 systems that we rated "partially useful" in our 2010 test (a top score): PlagAware, turnitin, Ephorus, PlagScan, and Urkund. All of the companies were glad to provide us with a test account for this test. Turnitin did, however, suggest that we use iThenticate. That is one of their products that uses the same engine and backend as turnitin, but has an easier to use interface for the purpose at hand, testing individual files instead of modeling classes.

We obtained a copy of the thesis in pdf format, and started in.

The first problem was the size of the file. At 7.3 MB and about 190,000 words, it was a heavyweight and was not easily digestible for the systems. PlagAware gave up after just 159 pages; iThenticate chopped it into 13 pieces á 15,000 words; Ephorus first tested it themselves, then reluctantly let us upload our copy because we wanted to have the same copy used by all systems; PlagScan chugged away over night on the results; Urkund ran into trouble with the number of hits, we appear to have taken down their entire system over the weekend.

The next problem was with the PDF of the source itself. It was formatted nicely with ligatures and different sized spaces for a very professional look. Many systems struggled with this, assuming these were stray characters and just discarding them, instead of replacing them with the respective letters (fi, fl, or blank). This caused some missed plagiarism.

Then we had the result reports - since there are many copies of the same text available online, many systems inflated their results by counting the same source numerous times. We made every effort to disregard anything published on the GuttenPlag site. iThenticate had a nice functionality for disregarding this site. On the other hand, over 40% of the links that iThenticate returned went to 404s, pages no longer available at that URL! The sites have reorganized their material and URLs. We researched a number of them, the sources are, indeed, still online. This is unconvincing for a dissertation board - they need to be able to verify both the source and the copy.

The GuttenPlag people have determined that 94% of the pages contain plagiarism on 63% of the lines. The following are the results for the individual systems:

PlagAware: Initially 28% on the first 159 pages, however this included a lot of garbage such as pastebin material. After we removed this and the GuttenPlag links, the amount went to 68% before the report disappeared completely. We have not been able to resubmit, it breaks off with an error.
iThenticate: 40%
Ephorus: 5%! Only 10 possible sources found, of these 3 were GuttenPlag and one a duplicate
PlagScan: 15,9%
Urkund: 21%

One general problem that we had with all of the reports was that we could not click on a plagiarism and discover the page number in the PDF source. This is something we would need in the case of preparing a case before the dissertation board, as the side-by-side documentation needs to be marked in copies of the original text and not online. It would be a lot of hand work to find the pages for preparing such a documentation.

Another problem was the presentation of the footnotes. These were generally not recognized by any of the systems, and often the footnote number and text was just inserted in the middle of a paragraph. This often got in the way of marking a larger block of 1-1 plagiarism. The Wiki has found an interesting type of plagiarism we now call “Cut & Slide” - from a large block of plagiarized material, a portion is cut out and demoted to a footnote.

Reports

The PlagAware report disappeared from the database after a software update, we were rather upset that reports that have been produced (and would have been paid for, if we had been real customers) are suddenly gone. The side-by-side that we have always liked turns out to have an extreme problem - the system often marked just 3-4 words, then noted an elision ("[...]") that sometimes went on for a page or so, and then another 3-4 words. Since this was at the very beginning, you had to scroll a good bit to get to the first plagiarism, the Zehnpfennig one (from the FAZ). A law professor might have considered the system broken, reporting minimal plagiarism, and just broken off the check.

iThenticate drove us batty with the 404s and the inability to copy and paste from the online report - many of the sources were large online PDFs containing numerous papers, we had to retype some words and phrases in order to be able to search for the plagiarism source in the link given. It also reports proper quotes as plagiarisms, for example, the appendix which contains material available completely online is only reported to be a 60% plagiarism.

Ephorus has a nice side-by-side, but included a COMPLETE COPY of the entire dissertation for every source found. We assume that this is why it broke off after just 10 sources, the report had already reached 54 MB! The reports don’t need to repeat things that are not plagiarized, just the plagiarized material, but preferably with the page numbers, please.

PlagScan also irritated us with the little dropdown list of sources. We had to open every source and then look to see where the plagiarism was, this took an enormous amount of time.

Urkund was extremely slow - we understand that this had to do with the number of hits found. The navigation, as we had found in the 2010 test, was difficult to use, and the numbers given had no real meaning. This report now cannot be loaded, it gives a count up to 90, stays at 89 of 90, and after some minutes gives up and tells us to come back another day.

38 out of the currently known 131 sources were found by at least one system. Overall, PlagAware found 7 (5%), iThenticate 30 (23%), Ephorus 6 (5%), PlagScan 19 (15%) and Urkund 16 (12%).

Out of the top 20 sources, however, only 8 were “findable”, i.e. online. Many were books that are available on Google books (and are reported by Google through a normal Google search), 7 of the top 20 were papers prepared by the writing services of the German Bundestag and were thus not available to the public.

When we take the top 43 sources found by GuttenPlag, then the top 11 sources found by the systems are included (3 found by all 5 systems, 3 by four out of five, 5 by three out of five systems) as well as the top 20 sources that were available online, we have the following results: PlagAware and Ephorus both found 6 of the top 20 online sources (30%), iThenticate 16 (80%), PlagScan 12 (60%) and Urkund 13 (65%).

Since iThenticate found so many sources, we wanted to go back and look at which position in the results these were to be found. iThenticate had reported 1156 sources, of which only just over 400 were longer than 20 words. We only looked at the 117 sources reported that had a match of 100 words or longer. Of the 13 top sources reported by iThenticate (one for each portion), 6 were 404s (“File not found”), one was a correctly quoted portion in the thesis, one was a correspondence between the bibliography and the
bibliography of another source, 3 were from the same source (Volkmann-Schluck) that is indeed the top source for the thesis, and 2 were of the second most used source that was available online.

No easy answer

There is no easy answer to the question as to whether the professors at the University of Bayreuth would have been able to discover the plagiarism in this thesis with the help of software: The usability problems are very serious. People who are not computer scientists have a hard time interpreting the results. iThenticate’s links to 404s will lead people to disregard other found sources. But it would have been possible for the university to at least suspect a problem, if not see the abysmal magnitude of the plagiarism.

Our suggestion stands: If a teacher or examiner has doubts about a thesis, they should be able to use software systems, preferably two or three, to examine the material. However, they need to be trained in interpreting the results, or have a trained person such as a librarian go over the report with them. We do not find it generally useful to have all papers put through such a system with a simple threshold set for alarming the teacher. Allowing students to “test” their own papers will just encourage them to use one of the many online synonymization tools available until their paper “passes”. Writing, especially scientific writing, is concerned with intensive work on the text itself, not just a superficial attention to word order and choice.

The German report on this research can be found in the June 2011 edition of iX. (Update: Now available online at http://www.heise.de/ix/artikel/Kopienjaeger-1245288.html)

[*] Katrin Köhler is my co-author for this research and report.

Copy, Shake, and Paste