Tuesday, September 2, 2014

Homebrew Collusion Detection

tl;dr -- One can use free tools to identify collusion, a special sort of plagiarism, but there is still much manual work involved.

[Note: I promised this post 3 months ago - then life and a lot of dissertations got in the way. Sorry for the delay. --dww]

In a previous blog post I described the situation the University of Münster is currently facing with at least 23 dissertations in medicine documented as containing massive text overlap from dissertations submitted to that same university or other universities in previous years. The renowned Charité Medical School in Berlin is currently at 20 dissertations in medicine with massive text overlap, the number there is steadily rising.

This re-use of text (and images or even data) from the same department can be considered to be collusion, a special form of plagiarism. When looking at questions of collusion, there is a closed number of documents that are to be compared with each other, for example all of the dissertations from one department. Text overlap is much easier to find in a closed set of documents than finding a potential source somewhere on the internet.

How were these theses with such extensive text overlap identified? It has been postulated that VroniPlag Wiki has some sort of "deep search" tool, but actually, it is a time-intensive manual process, aided by small software tools. About 50,000 dissertations in medicine, dental medicine, veterinary medicine and biology have been downloaded and compared with each other, with some of the major plagiarisms thus discovered documented at the VroniPlag Wiki. Medicine was chosen for this investigation, as these theses tend to be quite short in Germany and many are available online.

Data collection

The first step was obtaining the dissertations from the various university libraries. One would think that this would be a trivial step, as most university libraries offer e-publication services to their members. It would seem that all one would need to do would be to download the files. But each university seems to have its own, intricate database and retrieval structure. An API would be wonderful that could be queried and would return a JSON map with relevant metadata such as name, title, field, year, and URL to the thesis. Indeed, there are a few libraries that offer such a service. Most just have some sort of web page for each dissertation that includes the metadata, but without markup indicating the semantic meaning of the text. Some libraries seem to make it intentionally difficult to automatically download all of the theses. With a little bit of work, the data needed can be automatically scraped from such pages, but the scraper needs to be adjusted for each library.

The most important data item for this task is the file name for the PDF. One library goes to the trouble of splitting every thesis into chapters, so there is not just one PDF but a directory containing all of the files. These have to be merged before continuing. Another library does not publish the file names, but only a key value used to generate the file name. However, if one downloads a few theses by hand, it is easy to see how to construct the thesis PDF name, if one has the correct key value, so that these theses, too, can be automatically downloaded.

The names of the files are at times quite amusing, as they appear to be named by the candidates themselves: "copyshop-fassung" [copyshop version], "dissertation_finish", or just "doktor". Most are called "dissertation" or "doktorarbeit", my favorite is "Microsoft_Word_-_DoktoarbeitAmAktuellsten" (misspelled "most recent doctoral thesis done with Word"). Apparently most of the libraries don't have a procedure for giving the files meaningful names. Sometimes the same thesis is offered under different names for unknown reasons. There are also universities that co-publish dissertations in their online libraries, so the same thesis will be available from two different universities under two different names.

Pre-processing

As is usual for data mining applications, one of the most time-consuming parts of the exercise is getting the data ready for work. A directory was set up for each of the 44 departments from various medical schools and life science departments chosen throughout Germany and Austria. The downloaded files were renamed to include the name of the university and the year published (if available).

The PDF files now needed to be converted into plain text in order to be compared. The free program pdf2txt, which can be run as a batch job, can be set up to automate this process. Around 10% of the dissertations in the collections downloaded could not be extracted with this tool. Some of the theses were locked, others had the text stored as images, some just produced garbage, so they had to be disregarded.

The result was about 9 GB of plaintext files.

Text crunching

Now with directories of appropriately named text files, the text crunching can begin. The pairwise comparison of the text files can be done with the sim_text algorithm, a powerful open-source text comparison tool developed by Dick Grune & Matty Huntjens1, originally as a tool for finding program code replication in large collections of program files. With the following command, all of the files in a directory can be compared with each other.
# -o Output to out.log
# -d use diff format for output
# -p use percentage format for output
# -t cutoff percentage is 1

# -r minimum run size is 7
# use all files ending in .txt


sim_text -o out.log -d -p -t 1 -r 7 *.txt
Or you can compare all the files in two directories d1 and d2 with themselves and each other:
sim_text -o out.log -d -p -t 1 -r 7 d1\*.txt d2\*.txt
The main point of using sim_text as given above is the use of the -p option. This suppresses the standard output from the algorithm, which consists of long lists of overlapping portions of text and their positions within the text. Instead, only an approximate percentage of the text overlap is printed out, as shown here:

Diss_371.txt consists for 31 % of Diss_45.txt material

The results are sorted by amount of overlap, so that largest overlapping pairs are shown first. However, one must understand that this is only an indication of a possible plagiarism. The two files now must be closely examined, manually. They could be joint work and note that fact in the theses themselves; it could be just the title pages that are on deposit, so of course there will be a lot of similarity between the files; the theses could be quite short, but there is a large overlap in the references used; or they could be copies of the same thesis, just with different file names or from different library servers. These are false positives, and there are many of them. Filtering out the false positives is time-intensive and can only be done manually.

Even when two theses are found with large amounts of text overlap, there is still the question of which author copied and which one was copied from. If the theses are a number of years apart, it could be relatively clear, although it is a problem in Germany in medicine, as the doctoral thesis is often written in parallel with the studies, but can only be handed in when all the coursework has been completed. If they are in the same year, or both defended on the same day, then one cannot really say which one was copied.

Comparison

Once a pair of theses has been identified as candidates for further investigation, it is necessary to directly compare them.  sim_text can also be used for this step, as it also works as an "anti-diff-tool", one that quickly highlights the identical portions of two files so that the differences are very easy to see. VroniPlag Wiki has implemented the algorithm in JavaScript so that it can be run locally (and offline!) in any browser with JavaScript enabled.

One text file is copied into the left hand box, one into the right hand box. The drop-down list is the minimum number of identical words in a run list to be colored, the default is 4. When the button "Texte vergleichen!" (compare texts) is pressed, text which is the same on both sides is colored with the same color on each side, changing colors only when the exact text match terminates. The algorithm has been adapted to ignore punctuation, super- and subscripts, and special characters when matching. 

Output of JavaScript implementation of sim_text with text coloring
If a thesis turns up with many colored parts, it then needs to be fragmented and double-checked – manually, by the researchers at VroniPlag Wiki who have developed a good way of documenting and categorizing text overlap that constitutes plagiarism.

The big picture

Examining thousands of dissertations in this manner can quickly get confusing. In order to be able to see the big picture, for example, to determine if there are groups of theses with common text in one department or other patterns, a graphical representation of the sim_text output can be created. This makes it relatively simple to plot the similarities between dissertations as a collection of graphs.

A simple Python script can be run on the output produced by sim_text in order to create input for Graphviz. Graphviz is a free and open graph-drawing program that takes a standardized input form as a text file and produces graphical output.  Now clusters of overlapping theses (sadly, often with the same supervisor) just pop out visually. The similarities are sorted by degree of overlap, but they still need to be closely examined as above, as many false positives are generated.

Text overlap in dissertations from one faculty at one university

It is not possible to create a graph for all 50,000 dissertations, especially as there are many theses with small amounts of overlap that would just clutter up the graph. Even just for one department there can be very many small commonalities. For example, in one department with 534 dissertations, there were 340 overlaps reported by sim_text with run length of 7, but only 60 of these consisted of more than 5% of the document.

Small amounts of overlap from different theses can, however, add up to quite a substantial amount. The complete output of sim_text for two or more institutions can be loaded into a spreadsheet and then sorted by various criteria. For example, since the data preparation step included a name for the department, cross-cluster text overlap can be isolated and identified. The data can also be sorted by name of the file. If there are a number of text parallels in one thesis from a number of different other ones, the file should be more closely examined. During one investigation in which the theses from the University of Vienna were compared with all of the theses from the other universities, one thesis was identified, Ves, that used portions from quite a number of other dissertations. It turned out that on at least 57% of the pages there was text overlap from dissertations from other universities. 

Time constraints

Since the algorithm compares each thesis pairwise with all of the others, the number of comparisons for such a data set grows quadratically with the number of texts examined. In a first investigation on a simple dual-core laptop with 4 GB main memory, comparing 1,000 theses needed only a few minutes. When about 24,000 were tried on the same machine, the system eventually crashed after 3 days of computation unfortunately without outputting any useful results.

Using a faster computer with 8 GB of main memory, 3,000 theses were compared in 18 minutes, 7,000 needed almost 2 hours. An attempt to compare 29,000 files ran for a bit more than 2 hours before crashing without results.

Since it was possible to compare each departmental cluster with each of the others, it was decided to set up such a sequence of pairwise cluster comparisons. With 44 departments, this meant 946 cluster comparisons (44*43/2) needed to be run, each taking between 10 minutes and 3 hours, depending on the size of the clusters. In all, at least 1.25 billion individual comparisons of pairs of dissertations needed to be made.

It was determined that using quad-core computers it was possible to run sim_text in parallel on each core without interference. So four machines were set up in order to have 16 processes running at the same time. It took about 20 minutes to load the 9 GB of text data locally onto each machine (accessing the network drive would have slowed down the investigation tremendously). A batch file was generated with all the 946 cluster comparisons and then simply split into 16 files. One was loaded onto each core, and the processes started chugging away. 40 hours later, much earlier than expected, the processes had all finished without crashing!

Of course, the logfiles now included an excessive number of duplicates, as the intra-cluster comparisons had been repeated 43 times, bringing the number of individual comparisons of one dissertation to another up to around 50 billion. So the duplicates had to be eliminated before looking at the data. 

Additional investigations
 
Attempts to run sim_text on a Mac computer or under Linux turned up an interesting anomaly. Calculations that run for about 5 minutes on a PC will take just under 2 hours on a Mac, and may never terminate on a Linux. It is not clear why this is so.

The program sim_text also has an option for only comparing new files with "old" ones, that is, those that have already been checked against each other.
sim_text.exe -o output.log -d -p -t 1 -r 7 newdir/*.txt / olddir1/*.txt olddir2/*txt
This command should only compare files in newdir with those in olddir1 and olddir2, not the files in olddir1 with olddir2 and each of those with themselves. However, tests with this option were not conclusive, as somewhat different results, including overlap reported where there actually was none at all, were obtained using this option as opposed to a full comparison. Theoretically, this is what would be needed in order to set up a system for comparing a newly submitted thesis with all of the older ones from the same department. This needs looking into to see why it does not work.

Future work will be seeing if adding more main memory can speed up the process, and trying to work out an alternative algorithm for a Hadoop-based supercomputer.

What have we seen?

At the beginning of the investigation, we suspected that there would be a few theses that used material from other universities. It seems ludicrous that people would actually take some text without attribution, or even entire dissertations from other people from the same university or even the same supervisor and submit it as their own.

We were wrong.

The detailed analysis of Münster and the Charité has to date uncovered three theses that are completely (100% of the pages) taken from other dissertations. Scores of others have used text without reference from others in their research group. There are chains and nets of text overlap that violate the principles of good scientific practice. And these are the only two clusters that have been looked at in detail up until now.

There are theses that no one can have read, or they would have seen the Wikipedia links underlined and embedded in the PDF or the disastrous formatting and layout problems. Or found the large amount of text overlap with the supervisor's own habilitation.

So there is both plagiarism within the faculty and plagiarism from other universities, plagiarism from Internet sources in general and from the Wikipedia in particular.

Why do they do that? I've had one person whose dissertation has been documented on VroniPlag Wiki call and tell me that his supervisor told him to write it like that. There were a few laminated pages attached to the machine he was using for his research, they were told to put that verbatim in their thesis. Do they not realize that they are publishing a scientific document with their names attached that is readable by everyone in the world? Anyone can compare this thesis with other published ones and ask: Why can't they refer to the source?

In conclusion, this investigation was not the result of applying any sort of magic software that ruminated and spat out the offending theses. There was no research money needed, just some free and open software, many dissertations published in Open Access, some university computers otherwise idle over weekends, and some researchers with a good bit of time.

For the universities in question, the conclusion must be: Start reading your dissertations carefully, especially before they are published online! Don't expect to solve the problem quickly by purchasing expensive software, that won't help. Software can only be a tool, and it does not catch everything automatically, and there are some systems out there that are little more than snake oil. Do note that all of these plagiarism cases are not singularities, individual persons who have cheated. The amount of plagiarism found to date points to a systemic problem within the universities which must be solved, the quicker the better. 
 ---------------------------
 1 Dick Grune writes in January 2015: "[Y]ou write that sim_text was developed by Matty Huntjens and me, but that is not correct. I had the idea and wrote the code for comparing C program files. I then extended the code to handle Pascal programs (this was 1986 or so). Matty Huntjens, who was in charge of the C and Pascal programming workshops, wrote a bunch of (Unix) shell scripts to mass-compare workshop hand-ins from several years back, with overwhelming results. Matty and me then (1989) wrote a short paper on these shell scripts and their use."

No comments:

Post a Comment