I tested a ChatGPT detection tool: I wasted my time
According to the author’s survey, 41% of the texts produced in whole or in part by GPT-3 were classified as having probably been written by a human being. It is therefore futile to fight the AI with the AI.
Detect the bullshit is a professional deformation. I have been a journalist for 35 years and a teacher for 15 years. I sat on my faculty’s academic infractions committee. I have seen all the colors. ChatGPT disgusts me as much as it amazes me.
We are told that more in-person assessments need to be done. Very good. But tell that to the universities (and academics) who have taken a liking to distance learning. Everyone is looking for magic solutions to make sure that the texts we are evaluating have not been created by ChatGPT or another automatic writing system.
Read more: The translation survived the AI. Other professions that seem threatened by ChatGPT will also survive
Test the machine
I tested a tool that claims to do this. GPTZero “estimates the probabilities that a document was written by a large language model”. Its creator, Edward Tianis a student from Toronto. He completed a major in computer science with a minor in journalism at Princeton. He worked for Bellingcata terrific investigative and data journalism site. It’s an inspiring journey that I can identify with.
I tested his tool with a corpus of 900 texts:
In each case, the texts are composed of three subgroups:
-
100 articles written by journalists, published in the last five years. Articles in French have been harvested from the web by students of my data journalism course 2. For my part, I harvested the articles in English on the website of Globe and Mail.
-
100 articles generated in part by GPT-3. I took the first part of other English and French articles and asked GPT to supplement them with a command (prompt) looking like: “Here is the beginning of an article, whose title is X. Complete it with 1500 to 2500 characters, for publication in a Canadian newspaper. »
-
100 articles generated entirely by GPT-3 with a command that looked like: “Write, for publication in a Canadian newspaper, an article of 4500 to 5000 characters with the title X.”
In the cases of articles generated in whole or in part by GPT-3, the value of “X” was the title of an actual article published in an English or French journal.
I finally submitted each of these 900 texts to a analysis by GPTZero.
Mixed results
First, in French, the results are pitiful. The creator of GPTZero says that his tool was developed mainly using texts in English. That’s why I translated all my French corpus into English.
GPTZero notably provides, for each text it analyzes, a probability score that it was produced by an artificial intelligence system. Based on this score, I have classified my translated texts into five categories:
-
AI++ : It is very likely that the text was produced by an AI system (if the score is higher than 95.0%)
-
AI+ : It is likely that the text was produced by an AI system (if the score varies between 75.0% and 95.0%)
-
? : Unclassifiable (if the score varies between and 1.0% and 75.0%)
-
Hum+ : It is likely that the text was produced by a human being (if the score varies between 0.00001% [oui, un cent-millième de pourcent] and 1.0%)
-
Hum++ : It is very likely that the text was produced by a human being (if the score is lower than 0.00001%)
The table below shows how the tool classified the translated texts according to the way they were written.
GPTZero Ranking | Journalist | half journalist, half GPT-3 |
GPT-3 | Total |
---|---|---|---|---|
AI++ | 1 | 9 | 49 | 59 |
AI+ | 2 | 11 | 18 | 31 |
? | 6 | 14 | 17 | 37 |
Hum+ | 13 | 24 | 8 | 45 |
Hum++ | 78 | 42 | 8 | 128 |
Total | 100 | 100 | 100 | 300 |
The tool does a not-so-bad job. Its creator says he’d rather err on the side of classifying AI-generated text as probably being written by a human than the other way around. My results show that this is indeed what GPTZero did.
But the fact remains that in my sample, 41% of the texts produced in whole or in part by GPT-3 were classified as having probably been written by a human being.
So I wasted my time for two reasons. First, the quality of text generated by large language models today makes GPTZero inconsistent. It detects most of the time, but not always.
Second, technology is evolving at breakneck speed. No sooner had I completed my test, the weekend of March 11 and 12, than a more efficient version of GPT, GPT-4, was made public two days later. I tried it and for the moment I find that it is a producer of bullshit even more eloquent than the previous version, based on GPT-3.5.
What’s the point of trying to fight AI with AI? The more I try, the more I realize that it’s a kind of arms race that leads nowhere. Like all technologies before it, automatic writing will take its place in our daily lives. It will be up to us, human beings, to dig into our neurons to integrate it, as well as possible, into our teaching practices and to legislate if necessary, in order to mitigate its deleterious effects.
In the spirit of open science, the code and data for this experiment are available on the author’s github account.