Wednesday, December 25, 2013

Guest post: Citation-based Plagiarism Detection

I hope to be able to post some of my backlog over the holiday season. This post is by Bela Gipp. I was an external examiner for the doctoral dissertation on Citation-based Plagiarism Detection that he defended in September at the University of Magdeburg. A student project at my university helped Bela develop an interface for his system.
– Debora Weber-Wulff

Can citation patterns help detect heavily disguised plagiarism in academic documents?


by Bela Gipp

A while back, Retraction Watch, a blog on scientific integrity, reported on five plagiarism cases discovered in Neuroscience Letters. Three cases translated Chinese originals into English, while another translated a French text into English. None of the cases acknowledged they were translations.
Translated plagiarism remains one of the most difficult forms of academic misconduct to detect. Since few researchers actively follow the literature in multiple languages, peer review is unlikely to recognize translated plagiarism. Software is largely useless in helping to identify translated plagiarism, because today’s plagiarism detection systems rely on a minimum amount of text similarity to spark suspicion, yet translations typically contain very low or no textual overlap. When documents use different alphabets, e.g. Chinese, Korean, or Russian characters compared to Latin characters, available detection systems stand no chance.
A new approach for plagiarism detection, termed Citation-based Plagiarism Detection (CbPD) goes beyond literal text similarity to detect potential plagiarism. The citation-based approach examines the in‑text placement of academic citations to form a language and text independent “fingerprint” of semantic similarity. The practicability of this citation-based approach was initially demonstrated in an analysis of the translated plagiarism in the prominent plagiarism case of K.-T. zu Guttenberg. Recently, a group of researchers in cooperation with students from the HTW-Berlin developed the first citation-based plagiarism detection prototype, “CitePlag”.
In the image below, CitePlag visualizes one of the five articles that were retracted from Neuroscience Letters. No textual similarity remains between the two publications, since the plagiarism (left) is a translation into English of the Chinese original (right). The citation-based approach, however, identifies and connects matching citations in a central scrollable column for human inspection. Examine this example for yourself in CitePlag. For more information on the prototype and the algorithms it implements, refer to this publication.
A medical article in Indian Journal of Urology was recently retracted after the CbPD approach identified a notably high citation pattern overlap with a journal article published in another journal two years prior. The citation-based similarities, as well as the text, which the retracted article shared with its source can be examined using the prototype here.


  1. Thanks for sharing this! This sort of detection seems like it could be promising in detecting secondary source abuse (using references from a non-cited source) as well as translation. This seems to happen often in student research essays. I'd love to see this tool made available online or as part of an existing suite like Turnitin.

    1. "secondary source abuse" is not generally accepted as plagiarism at all, for example: (already posted in the comment in

    2. Oh, but this is not about secondary source abuse or citation plagiarism, but the use of citation patterns to detect types of plagiarism (translation plagiarism or structural plagiarism) that would otherwise go unnoticed. If an author checks all of the sources obtained from another source, I do not see a problem. The problem is when the statement is taken without checking. If it turns out to be false or non-existent, then one can see that shortcuts have been taken. Otherwise, I don't see a problem.

    3. I agree that the use of citations which were taken without checking is not the correct way to work, but I don't regard it as plagiarism or cheating in general. The author of the first and third texts linked above seem to agree with the opinion that there is no plagiarism in this case. In my opinion the judgement of cheating depends on the discipline and task. If the task of a thesis lies in the analysis and elaboration of other texts it is in my opinion of course cheating to take references out of secondary sources without checking them, because the desired work has not been done (the recent case of a German politician showed a different point of view in the evaluation of the made charges). If the task of a thesis lies in experimental research I'd regard the use of unread references as a technical flaw.