The AI Detection Landscape: A Study

Detecting AI-generated text presents a challenge for organizations in various fields, including academia news editing. The ability to generate highly convincing text by prompting Large Language Models (LLMs) with a few instructions exposes us to information that may not be conceived or written by humans. While LLMs are invaluable in refining text, we must also acknowledge the issues this evolution brings to the concept of intellectual production. As humans, we rely on cues to assess the trustworthiness of text. However, with LLM-powered text generation, it becomes increasingly difficult to determine if the text originated from a human and if it presents accurate or biased ideas and statements.

AI's progress in generating text has made it increasingly difficult to distinguish between human-written and machine-generated content. This poses a significant challenge for companies relying on accurate data annotation and labeling for machine learning training and natural language processing tasks. Various AI detectors, including Open AI's now-retracted AI detector released in early 2023, were available on the market. However, it fell short, only catching 26% of AI-generated text, and was decommissioned after just six months. Recent studies also reveal bias in AI detectors against non-native speakers. These complexities underscore the difficulty of AI detection, emphasizing the distinction from other detectors.

Solutions currently available on the market use text-based approaches that work by analyzing lexical, semantic or syntax clues after having been trained on synthetic and real written human text to detect AI-generated text. As described by Appen data scientists Arjun Patel and Phoebe Liu, these solutions have shortcomings in detecting LLM-generated text, which is often very similar to human-written content. Additionally, current detection methods are prone to false alarms and misses. As a result, the risk of undetected AI-generated texts being labeled as genuine and reliable further aggravate concerns about the accuracy and credibility of data.

Researchers face significant challenges in detecting AI-generated text due to a multitude of factors:

The constant race between the improving performance of language models (LLMs) and training AI detectors with new examples, necessitating frequent retraining of detectors.
The growing accessibility of LLMs to the public, ranging from commercial products to open-source models.
The scarcity of ground-truth datasets that capture humans using text generation tools and the limited understanding of the prevalence of AI-generated text in annotation submissions.
The absence of standardized metrics to evaluate such models.
The lack of transparency in the methodology employed by third-party models to prevent adversarial exploits.

Metrics Matter!

When it comes to determining the effectiveness of anything, the main challenge is identifying the right metric. Depending on the metric chosen, it is possible to consider something a success even if it doesn't meet the expected usage requirements. Understanding the different metrics and carefully selecting the one(s) that truly reflect your objective is critical to accurately evaluating success.

Although model accuracy is often regarded as the key metric for assessing performance, it can be misleading in determining whether a model is efficient or not. This is particularly true when dealing with imbalanced data sets or when cost sensitivity is important. For example, misclassifying a text as being generated by an AI when it is actually written by a human can have significant and detrimental consequences for the human author. Model accuracy is typically expressed as the percentage of correct predictions out of the total number of predictions. When working with imbalanced data sets, it is possible to achieve a high accuracy rate while also having a high false positive rate. This is precisely why AI detectors are considered unreliable.

Our expectation is that our crowd is generally honest and tends to follow instructions when asked not to use external LLMs for content generation. This means our crowd is mostly composed of well-intentioned individuals with a few bad actors. Therefore, using a model that has high accuracy but also a high false positive rate would be detrimental as it could undermine the trust our contributors have in Appen.

In addition to accuracy, a wide range of indicators can be utilized, such as area under the curve, false positive, true positive, and more. Determining the most meaningful indicator depends heavily on the specific use case and context of the AI detector. This is why defining metrics typically requires collaboration between product and data science teams, as it is crucial to align with business needs.

At Appen, we take a conservative approach and prioritize a metric that considers an AI detector efficient if it does not negatively impact humans, particularly the authors of the analyzed text. We aim to evaluate how often AI detectors incorrectly identify texts as AI-generated when they are actually written by humans. This is significant in our human-centric approach because authors who are mistakenly labeled as AI-generated have limited or no means to challenge this prediction. Therefore, we closely examine the false positive rate, which represents the ratio of human-generated texts wrongly identified as AI-generated.

Appen’s AI Detection Benchmarking Experiment

Recently, Appen Data Scientists, Phoebe Liu and Arjun Patel partnered with Appen Senior Product Manager Alice Desthuilliers, to conduct an experiment to assess the effectiveness of different market solutions. Their goal was to improve the interpretability of advertised performances. Thanks to Appen's expertise and dedication to curating a purposeful crowd and collecting high-quality human data through well-designed tasks, this experiment became a reality. Leveraging our own crowd, Appen was able to evaluate the performances of various AI detectors against different benchmarks. The experiment aimed to determine how often an AI detector incorrectly classified human-generated text as being generated by AI.

[The researchers evaluated four popular market solutions: OpenAI's retracted AI detector as a control, a commercial solution, an open-source solution, and a machine learning-based model developed in-house. Each of these models was tested on Appen's high-quality human data. The results were then benchmarked against a pre-defined baseline of 95% accuracy, representing the expected performance for an efficient AI detector. The experiment concluded that none of the current market solutions met this benchmark, with all models having a false positive rate higher than 10%. This means that these AI detectors are falsely labeling human.]

Crowd Criteria

To conduct our AI detection experiment, our team at Appen has assembled a group of 24 contributors who have native or near-native fluency in US English. These contributors are based in the US or in the Philippines. Thanks to this group, we were able to create our control data set.

The Jobs

Appen's team arranged a mix of jobs under two different conditions:

Human: Users were instructed to respond to prompts without any external assistance.
AI: Users were guided to respond to prompts using generative AI like ChatGPT.

Before each task, a training job was conducted to ensure that contributors understood the guidelines and felt comfortable with the task. All prompts were carefully selected from open-source Dolly datasets.

The guidelines were designed to efficiently capture the required data and were kept straightforward. Contributors were asked to draft their work in Appen Data Annotation Platform (ADAP), provide a text of at least 150 words (as most AI detectors perform better with this minimum length), pay attention to grammar and spelling, avoid harmful or toxic content, and provide correct responses to the prompts. In general, contributors were encouraged to imagine themselves as helpful assistants and to avoid overly personal statements or justifications in their responses.

For jobs that involved the use of AI assistance, contributors had the freedom to choose their preferred language model and were provided with free online web apps as examples.

The Results

Patel, Liu and Desthuilliers generated a total of 636 prompt-response pair datasets through a combination of seven jobs. Among these, 334 pairs were created using Gen AI tools, while 302 were crafted by human contributors.

To assess the performance, Appen's Data Science and Product teams selected several widely used APIs known for their advertised efficacy, including:

Sapling AI
GPTZero Sentence Level and Document Level
OpenAI GPT2 Detector, an earlier model from OpenAI that served as a baseline

Each model underwent evaluation using 5-fold stratified cross-validation. The results were aggregated across all folds, considering metrics such as accuracy, f1 score, false positive rate, and true positive rate.

‍

	accuracy	f1	FPR	TPR
sapling	0.62	0.71	0.67	0.90
GPTZero	0.70	0.70	0.26	0.66
GPTZero document	0.61	0.71	0.73	0.91
OpenAI GPT2	0.51	0.31	0.16	0.21

Table 1: Performance metric for third-party APIs

‍

The results revealed that while some models performed better than others in certain metrics, none of the evaluated AI detection tools met the expected benchmark of 95% accuracy. In fact, the false positive rate ranged from 16.67% to 70%, highlighting the need for further improvements in AI detection technology.

Comparison with OpenAI Retracted AI Detection Model

Keep in mind that OpenAI released a classifier for ChatGPT, which was reported to have a 26% true positive rate (TPR) at a 9% false positive rate (FPR). Although this model was later retracted and we couldn't evaluate it using our control data set, it's worth noting as one of the few models that claimed to work on real-world data, coming from a top LLM firm. If anyone understands what AI-generated content should be, it's OpenAI!

To ensure the FPR stays below 9 percent, we recalculated the best TPRs observed for each model on our data, optimizing using different thresholds. This helped us gauge the improvement of these paid third-party APIs compared to OpenAI's free, retracted model. For this reason, the Appen Data Science team tested the APIs to achieve a false positive rate below 9%, and attempted to find the best true positive rates achieved using varying thresholds.

‍

	FPR	TPR
Sapling	0.07	0.05
GPTZero document	0.07	0.15
OpenAI GPT2	0.08	0.15

Table 2: Maximized TPR for FPR below 0.09

‍

Our initial investigation uncovered that all third-party models fell short of OpenAI's retracted classifier. This classifier boasted a true positive rate of 26% and a false positive rate of 9%. Among the models, GPTZero showed the most promising performance, with a comparable true positive rate of 13% and a false positive rate of 8%. While certain models achieved impressively high true positive rates, exceeding 91% in some cases, the false positive rates were alarmingly high, especially for the top-performing GPTZero model at 73%. These elevated false positive rates pose a significant risk to contributors, leading OpenAI to retract their latest classifier.

According to Patel, "minimizing false positives is crucial to maintaining trust in the system and ensuring fairness. While true positives are important for catching actual instances of cheating, prioritizing false positive reduction helps strike the delicate balance between accuracy and minimizing harm to innocent individuals."

Interestingly, the sentence-level GPTZero model was absent in our observations. This is because this model failed to achieve such a low false positive rate on our dataset. The Sapling model faced similar issues, as it had to predict all instances as AI to meet the false positive rate requirement. Only the GPTZero document-level classifier performed well, reducing the false positive rate by 3 percentage points compared to the decommissioned OpenAI detector. However, the model identified fewer true positives than OpenAI's solution.

This may indicate a trade-off between minimizing false positives and maximizing true positives in AI detection technology.

Striving for a Safer and Ethical Digital Landscape

Our study highlights the challenges involved in detecting AI-generated content using current techniques. While third-party APIs have shown promising results, they still fall short of meeting expectations and effectively identify AI-generated text with high accuracy. Further improvements are needed to ensure these systems can accurately and efficiently identify AI-generated content and protect against harmful or deceptive information.

As AI technology continues to advance, detection methods will require constant re-evaluation and updates to keep up with the evolving landscape of AI-generated text. It’s critical that we keep an open mind and embrace new technologies while also being cautious and vigilant to ensure their responsible use. The journey towards effectively detecting and regulating AI-generated content may be a challenging one, but it is an important step towards creating a more responsible and ethical use of AI in our world today.

‍

Navigating the AI Detection Landscape