ChatGPT has revolutionized the ability to easily produce a wide range of fluent text on a wide range of topics. But how good are they really? Language models are prone to factual errors and hallucinations. This allows readers to know whether such tools have been used to write news articles or other informational text when deciding whether or not to trust a source. The advancement of these models has also raised concerns about the authenticity and originality of the text. Many educational institutions have also restricted the use of ChatGPT due to the ease of content production.
LLMs like ChatGPT generate responses based on patterns and information contained in the large amount of text they were trained on. It does not reproduce responses verbatim but generates new content by predicting and understanding the most appropriate sequence for a given input. However, feedback can build on and synthesize information from its training data, leading to similarities with existing content. It is important to note that LLMs aim for originality and precision; it’s not foolproof. Users should exercise discretion and not rely solely on AI-generated content to make critical decisions or in situations requiring expert advice.
Many detection frameworks exist, such as DetectGPT and GPTZero, to detect whether an LLM generated the content. However, the performance of these frameworks falters on datasets for which they were not originally evaluated. Researchers from the University of California present Ghostbusters. It is a detection method based on structured search and linear classification.
Ghostbuster uses a three-step training process named probability calculation, feature selection, and classifier training. First, it converts each document into a series of vectors by calculating token-wise probabilities under a series of language models. Then it selects features by performing a structured search procedure on a space of vector and scalar features that combine these probabilities by defining a set of operations that combine these features and performing feature selection. Finally, it trains a simple classifier on the best features based on probabilities and some manually selected additional features.
Ghostbuster’s classifiers are trained on combinations of probability-based features chosen via structured search and seven additional features based on word length and highest token probabilities. These other features are intended to integrate the qualitative heuristics observed on the text generated by the AI.
Ghostbuster’s performance gains over previous models are robust with respect to the similarity of training and testing datasets. Ghostbuster averaged 97.0 F1 across all conditions and outperformed DetectGPT by 39.6 F1 and GPTZero by 7.5 F1. Ghostbuster outperformed RoBERTa’s benchmark in all areas except out-of-domain creative writing, and RoBERTa had much worse out-of-domain performance. The F1 score is a commonly used metric to evaluate the performance of a classification model. This is a metric that combines both precision and recall into a single value and is particularly useful when dealing with unbalanced data sets.
Check Paper And Blog post. All credit for this research goes to the researchers of this project. Also don’t forget to register our SubReddit 33k+ ML, 41,000+ Facebook communities, Discord Channel, And E-mailwhere we share the latest AI research news, interesting AI projects and much more.
If you like our work, you will love our newsletter.
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from Indian Institute of Technology Kharagpur. Understanding things at a fundamental level leads to new discoveries which lead to technological advancements. He is passionate about fundamentally understanding nature using tools such as mathematical models, ML models and AI.