Then I realized that Thomas Lancaster had submitted his dissertation "Effective and Efficient Plagiarism Detection" in 2003 to the London South Bank University, London, UK. He has an excellent, detailed classification of the plagiarism detection systems available at that time, and a good overview of a lot of the technical papers that are to be found on the topic. The glossary alone is a joy to read, and I have asked for and received permission to repeat portions here. There are also a number of papers that Lancaster has published or prepared on the topic included in the appendix. Lancaster focuses in the thesis on a four-step process for determining plagiarism:
- Collection stage - The first stage of the four-stage plagiarism detection process. This
is where students submit their work to an electronic system so it can later be analysed for similarity.
- Analysis stage - The second stage of the four-stage plagiarism detection process. Here all submissions are compared with each other (for intra-corpal plagiarism detection) or the external sources such as the Web (for extra-corpal plagiarism detection) to find submissions that are similar to each other or the Web sources.
- Confirmation stage - The third stage of the four-stage plagiarism detection process. Here a tutor checks the pairs of student submissions that have been judged to be similar to see if they represent plagiarism or they represent legitimate shared citations or false hits. The tutor decides which pairs will go on to be investigated further.
- Investigation stage - The fourth and final stage of the four-stage plagiarism detection
process. This is where pairs of similar submissions have been found and they have
been confirmed by human inspection to be similar and possible cases of plagiarism. In this case further evidence is collected, such as student interviews and marked up
copies of the submissions and penalties are given.
- Academic plagiarism - Plagiarism carried out by academics, for instance copying journal articles and submitting them as their own work for possible career development.
- Attribute counting metrics - A count of some property of a single document which
might involve tokenisation. This has been redefined to remove the inconsistencies
from the literature but is not considered a sensible classification.
- Authorship attribution - The branch of linguistics that aims to calculate the author of
a work based on knowledge of works by other known authors. This is not appropriate for plagiarism detection since there is no corpus of known work by a given student.
- Characters Metric - A simple metric that measures the number of sequences of
characters of a chosen length two documents have in common.
- Cheating - Unauthorised behaviour that is going against student etiquette when trying for an academic award or to gain an advantage over other students. Examples include plagiarism, use of cribs in exams and paying someone to complete an assignment specification on your behalf.
- Closeness Calculation - A computationally part of automated plagiarism detection
where a single number is generated from a number of different metrics to decide how similar two submissions are.
- Contractive plagiarism - Plagiarism where the source is larger than the copy and
hence the source has been reduced in some way to create the student submission.
- Corpal Metrics - A multi-dimensional metric that is a measure of a property of an
entire corpus, for instance the proportion of submissions using a given keyword.
- Collusion - Where two students discuss and work on an assignment specification
together and complete elements of their final submissions together. This might be
judged to be intra-cor[p]al plagiarism.
- Direct copy - Two student submissions that are identical to one another with no
attempt at disguise. One is a direct copy of the other.
- Disguise - Where a student has attempted to change a source and hand it in as their
own submission so that the use of the original source won't be noticed.
- Expansive plagiarism - Plagiarism where the source has been extended, either by
adding new thoughts or adding filler words and phrases to make a student submission.
- Extra-corpal plagiarism - Plagiarism where the plagiarism source is outside the
corpus of student submissions, for instance a Web site or material from a book.
- False hits - Pairs of submissions that are ranked high enough for a tutor to investigate them but are judged to be dissimilar, thus being a waste of tutor time.
- Free text plagiarism - Plagiarism that has been done in natural language, for instance, altering the words of another writer and presenting it as your own work.
- Hybrid metric systems - A system that a combination of both attribute counts and
structure metrics to find similar submissions. This has been defined to remove the
inconsistencies from the literature but is not considered a sensible method of
- Intra-corpal plagiarism - Plagiarism entirely within a corpus, primarily meaning two
students who have copied from one another.
- Missed pairs - A pair of submissions that contains plagiarism but is not automatically ranked in the upper portion of an ordered list of similar pairs and hence not investigated further by a tutor.
- Mosaic plagiarism - Plagiarism where chunks from different sources are used and rearranged in a way that could be considered like a mosaic is created from combining and arranging different pictures.
- Multiply sourced - A student submission or external source that has been used in
multiple student submissions.
- Ostrich plagiarism policy - Where an academic institution states that plagiarism does not exist in their institution and has no formal way of dealing with it.
- Paraphrasing - Using the ideas of another but rewriting them in your own words
without suitable and continual acknowledgement.
- Plagiarism - Taking the words or ideas of another and presenting them as your own
without suitable acknowledgement.
- Proactive plagiarism policy - A policy of an academic institution where plagiarism is actively sought out on a regular basis, perhaps by using automated detection methods and cases are followed up when they are found.
- Professional plagiarism - Plagiarism in a professional setting, for instance copying an internal report or company Web page from another source or using a service that
writes standard CVs or job applications.
- Reactive plagiarism policy - The academic policy where plagiarism is not actively
sought out but is taken seriously and followed up when it is identified during the
course of marking.
- Similarity - Where two submissions have words or ideas in common they are said to
be similar. When they have been looked at by a tutor they may also be judged to be
- Singularly sourced - A plagiarism source that has been copied from once only.
- Source code plagiarism - Plagiarism of source code submissions, where two students
have handed in programs where one has been derived from the other in some way.
Detecting this is a well understood area since the constrained language reduces the
number of possibilities that must be checked.
- Structural Metrics - A metric that measures a property of one or more submissions
where knowledge of the structure of the documents is needed.
- Synthetic corpus - A corpus of documents that have been generated using synthetic
means by taking sequences of words or characters in a known and defined order.
- Thesaurising - A technique for plagiarism where words in a source are replaced by
synonyms or changed in such a way that the submission makes the same points but
the intention is that the plagiarism will not be discovered.
- Visual metrics - A metric which is a based on some property of the similarity
visualisation that would be generated for a given pair of student submissions.
- Words Pair Metric - A simple metric that measures the number of sequences of word
pairs in common between two documents. Identified as the most effective simple