For more than 20 years, we’ve relied on plagiarism detectors to safeguard academic integrity at our educational institutions. The plagiarism detection tools in the market today are great evidence-based tools for detecting plagiarism (verbatim copy/paste) from colluding students, the internet, and literature as a whole. However, in this modern age of AI writing powered by tools such as ChatGPT, we’re seeing the age-old problem of ghostwriting growing to new levels. The academic integrity market has been hard at work building “AI Detectors” for the past year to tackle this new phenomenon of ghostwriting and academic dishonesty. This article examines the current ghostwriting landscape, surveys empirical tests of these AI detector systems, and provides a rationale for why AI detection is a flawed strategy.
Ghostwriting is not a new phenomenon. Contract cheating – the act of paying money for access to a ghostwriter – has been a problem for decades, but has increased from a historical average of 3.5% of students to 15.7%, as reported in 2014. For more background on the size of this problem, have a look at Nor.Education CEO Roger Larsen’s Spotlight talk at the Online Educa Berlin 2023 conference, elaborating on why academic integrity is at risk: https://youtu.be/IiypjIJeTo4
What the provider market seems to be forgetting is that AI writing is ghostwriting. Read more in our blog post: The problem is Ghostwriting, not generative AI. Generative AI is simply a new ghostwriter that is available at everyone’s fingertips for no cost. AI writing is simply a catalyst for the already existing problem of ghostwriting. AI detectors do not tackle the issue of ghostwriting as a whole – they simply look for a very specific ghostwriter.
For an AI detector, accuracy measures how many times the detection tool is correct in its prediction. If the detector falsely claims an essay has been written by AI, this is counted as an equivalent error to not detecting AI in an essay that was written by AI. If an AI detector is 90% accurate, we do not know if the 10% of wrong predictions are due to honest students being accused of cheating, or if some cheaters were not detected. There could be instances where a 90% accurate AI detector tends to accuse an innocent student every time it makes an error (1 in 10 times), rather than letting a cheater pass.
Accusing an honest student (a false positive in the case of AI detection) is very serious – you could even argue that this is much more serious than letting a cheater pass.
Academic integrity solutions must be incredibly precise in their results (low chance of accusing an honest student of cheating), at the expense of recall (letting some cheaters pass). The precision of an AI detection tool is the number of essays that the tool claims to be AI written, that are AI written. If an AI detection tool has 90% precision, 1 out of 10 essays that the system flags as being AI written, are not AI written. This is different from accuracy — because accuracy does not distinguish between falsely accusing an honest student and not detecting a cheater.
If you are unsure about the meaning of recall and precision, look at this: Precision is the ratio of true positive predictions to the total number of predicted positive instances, reflecting the model's ability to avoid false positives. On the other hand, recall, also known as sensitivity or true positive rate, is the ratio of true positive predictions to the total number of actual positive instances, indicating a model's ability to capture and identify all relevant instances.
When evaluating AI detectors, we need to be aware of not only the accuracy but also the false positives and how these relate to the true positive predictions of the tools.
Several studies, such as "Testing of Detection Tools for AI-Generated Text", have concluded that “the available detection tools are neither accurate nor reliable”. This study was conducted by 8 universities, and tested “12 publicly available tools and two commercial systems (including TurnItIn) that are widely used in the academic setting.”. TurnItIn also admits that there are issues related to reliability, as reported in the Washington Post: “TurnItIn has acknowledged a reliability problem with AI cheat-detection software used on 38 million student papers”.
The performance of these tools takes a nosedive when faced with deliberately modified texts, such as those generated through paraphrasing or rewriting, rendering them ineffective in real-world scenarios.
The study "Can AI-Generated Text be Reliably Detected?" sheds light on the vulnerability of detection tools against paraphrasing attacks. Even the best detectors struggle to perform better than a random guess when faced with sufficiently advanced language models engaging in paraphrasing. This vulnerability poses a reputational risk to developers and highlights the need for robust defenses against spoofing attacks.
The study finds that “even the best-possible detector may only perform marginally better than a random classifier” when faced with paraphrased content. Further, the study shows how language models “protected by watermarking schemes can be vulnerable against spoofing attacks”. The study author concludes that a “light paraphrase” can “break a whole range of detectors.”
As evidenced in "GPT detectors are biased against non-native English writers", the detectors consistently misclassify non-native English writing samples as AI-generated, raising ethical concerns. However, the study suggests that simple prompting strategies can mitigate this bias, emphasizing the importance of addressing unintentional penalties faced by writers with constrained linguistic expressions.
Likewise, the paper “Evade ChatGPT Detectors via A Single Space” introduces a technique where single characters of a text are modified, reducing detection accuracy from 81.3% to 9.4% in the worst cases. With an accuracy score lower than 50%, the model becomes less reliable than a coin flip.
In the academic realm, the study "Modern threats in academia: evaluating plagiarism and artificial intelligence detection scores of ChatGPT" reveals that even the highly sophisticated GPT-4 can successfully evade AI detectors without significantly altering their readability scores. This raises questions about the robustness of existing detection strategies in keeping up with evolving AI technologies.
OpenAI, a prominent player in the AI landscape and the creator of ChatGPT introduced an AI detection tool. Unfortunately, as of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy. When the creators of the generative AI tools (housing the leading AI researchers in the world) are insecure about their ability to detect AI writing, we need to question the longevity of the advantage of detection tools in the educational landscape. It is unclear whether these tools provide a sustainable solution in the long term.
Figure 1: 6 out of the Top 10 Custom Writing GPTs in OpenAI’s Marketplace are designed to avoid AI detection and ghostwrite your essay for you.
At the same time (as of March 8, 2024), we’re seeing that some of the most used plugins for OpenAI’s ChatGPT are custom models for avoiding AI detection and ghostwriting essays. Under the “Writing” section of OpenAI’s Custom GPT Marketplace, 6 out of the top 10 plugins (no. 2, 3, 5, 6, 8, 9) are specifically designed to evade AI detection or simply write your essay for you.
The top tool (Humanizer Pro) even claims to avoid detection with a 100% success rate. Several of these tools receive more than 4 out of 5 stars in reviews, indicating that they do work, and their users are happy with the results. This is clear evidence that the arms race between AI ghostwriting and AI detection – it’s a flawed strategy. It’s also a really good indication of the size of the problem as ghostwriting on essays is one of the top use cases of ChatGPT.
Focusing on detecting AI writing is simply the wrong approach because it does not solve the real issue: ghostwriting. Instructure (owner of the LMS Canvas) CEO echoes this sentiment live on CNBC stating that AI detectors have “a lot of false positives” and that detection is “probably the wrong approach”. We can’t build specific detectors for every possible ghostwriter, that’s essentially what we are doing with AI detectors. It’s not relevant whether or not an AI wrote the text, it’s only relevant that the student wrote the text.
If we go down this path of solely relying on AI detectors, the detector providers need to beat the wider AI community as a whole in an arms race to build newer and better detectors that tackle the newer and better AI models.
In the past few weeks alone, we’ve seen Google release Gemini Pro, Mistral released their GPT-4 competitor, Mistral Large, Meta continues to improve on their Llama-series of models, and there are dozens of new open-source models on par with Google’s and OpenAI’s offerings popping up on the Open LLM Leaderboard.
TurnItIn even sites themselves as being the “most accurate out of 16 detectors” on their website, but acknowledge that they only examined documents written by GPT-3.5 and GPT-4 (gathered Feb 29, 2024).
While we are arguing against the practicalities of AI detection, the linguistic theory behind them is sound. Detection is mostly based on perplexity and burstiness.
As language models simply predict a likely next word sequentially, we can measure how well the model is doing just that. A good language model is very good at predicting the next likely word, which means there is low entropy, or chaos, in the predictions. Perplexity is a measure of this, and the lower the perplexity, the more predictable the text is. It turns out that language models are quite predictable – as they are designed to be! But it is fairly easy to increase the perplexity of a text (as we’ve seen numerous methods above!) without altering its quality and readability. Burstiness is another quite important measure, and it is somewhat related to perplexity as it measures changes in underlying probability distributions. A human’s writing does not mimic predicting words from a singular probability distribution, while that’s exactly what a language model does. But again, the examples referred to earlier in the post show how you can easily alter the burstiness of a text.
In conclusion, the growing body of evidence indicates that AI detection tools, despite their widespread adoption, are grappling with accuracy issues, ethical concerns, and susceptibility to sophisticated evasion techniques. As educators and institutions navigate the complex landscape of AI in education, a critical examination of these tools' limitations and the exploration of alternative strategies become imperative for fostering a fair and effective learning environment. This is why Nor.Education is building the Authorship Authentication system – designed to address the issue of Ghostwriting (and thereby also AI writing), without relying on a black-box detector. You can read more about the Authorship Authentication method in this blog post: A reliable way to verify student writing and combat ChatGPT/AI writing: Authorship Authentication
This article was authored in cooperation with generative AI.