Unsafe at any seed

Why ChatGPT and Bing Chat are so good at making things up

A look inside the hallucinating artificial minds of the famous text prediction bots.

Benj Edwards – Apr 6, 2023 3:58 PM | 202

There's something about the way this applicant writes that I can't put my finger on... Credit: Aurich Lawson | Getty Images

Over the past few months, AI chatbots like ChatGPT have captured the world's attention due to their ability to converse in a human-like way on just about any subject. But they come with a serious drawback: They can present convincing false information easily, making them unreliable sources of factual information and potential sources of defamation.

Why do AI chatbots make things up, and will we ever be able to fully trust their output? We asked several experts and dug into how these AI models work to find the answers.

“Hallucinations”—a loaded term in AI

AI chatbots such as OpenAI's ChatGPT rely on a type of AI called a "large language model" (LLM) to generate their responses. An LLM is a computer program trained on millions of text sources that can read and generate "natural language" text—language as humans would naturally write or talk. Unfortunately, they can also make mistakes.

In academic literature, AI researchers often call these mistakes "hallucinations." But that label has grown controversial as the topic becomes mainstream because some people feel it anthropomorphizes AI models (suggesting they have human-like features) or gives them agency (suggesting they can make their own choices) in situations where that should not be implied. The creators of commercial LLMs may also use hallucinations as an excuse to blame the AI model for faulty outputs instead of taking responsibility for the outputs themselves.

Ars Video

Still, generative AI is so new that we need metaphors borrowed from existing ideas to explain these highly technical concepts to the broader public. In this vein, we feel the term "confabulation," although similarly imperfect, is a better metaphor than "hallucination." In human psychology, a "confabulation" occurs when someone's memory has a gap and the brain convincingly fills in the rest without intending to deceive others. ChatGPT does not work like the human brain, but the term "confabulation" arguably serves as a better metaphor because there's a creative gap-filling principle at work, as we'll explore below.

The confabulation problem

It's a big problem when an AI bot generates false information that can potentially mislead, misinform, or defame. Recently, The Washington Post reported on a law professor who discovered that ChatGPT had placed him on a list of legal scholars who had sexually harassed someone. But it never happened—ChatGPT made it up. The same day, Ars reported on an Australian mayor who allegedly found that ChatGPT claimed he had been convicted of bribery and sentenced to prison, a complete fabrication.

Shortly after ChatGPT's launch, people began proclaiming the end of the search engine. At the same time, though, many examples of ChatGPT's confabulations began to circulate on social media. The AI bot has invented books and studies that don't exist, publications that professors didn't write, fake academic papers, false legal citations, non-existent Linux system features, unreal retail mascots, and technical details that don't make sense.

And yet despite ChatGPT's predilection for casually fibbing, counter-intuitively, its resistance to confabulation is why we're even talking about it today. Some experts note that ChatGPT was technically an improvement over vanilla GPT-3 (its predecessor model) because it could refuse to answer some questions or let you know when its answers might not be accurate.

"A major factor in Chat's success is that it manages to suppress confabulation enough to make it unnoticeable for many common questions," said Riley Goodside, an expert in large language models who serves as staff prompt engineer at Scale AI. "Compared to its predecessors, ChatGPT is notably less prone to making things up."

If used as a brainstorming tool, ChatGPT's logical leaps and confabulations might lead to creative breakthroughs. But when used as a factual reference, ChatGPT could cause real harm, and OpenAI knows it.

Not long after the model's launch, OpenAI CEO Sam Altman tweeted, "ChatGPT is incredibly limited, but good enough at some things to create a misleading impression of greatness. It's a mistake to be relying on it for anything important right now. It’s a preview of progress; we have lots of work to do on robustness and truthfulness." In a later tweet, he wrote, "It does know a lot, but the danger is that it is confident and wrong a significant fraction of the time."

What's going on here?

How ChatGPT works

ChatGPT hovering in the library, as one does. — An AI-generated image of a chatbot hovering in the library, as one does. Credit: Benj Edwards / Stable Diffusion

To understand how a GPT model like ChatGPT or Bing Chat confabulates, we have to know how GPT models work. While OpenAI hasn't released the technical details of ChatGPT, Bing Chat, or even GPT-4, we do have access to the research paper that introduced their precursor, GPT-3, in 2020.

Researchers build (train) large language models like GPT-3 and GPT-4 by using a process called "unsupervised learning," which means the data they use to train the model isn't specially annotated or labeled. During this process, the model is fed a large body of text (millions of books, websites, articles, poems, transcripts, and other sources) and repeatedly tries to predict the next word in every sequence of words. If the model's prediction is close to the actual next word, the neural network updates its parameters to reinforce the patterns that led to that prediction.

Conversely, if the prediction is incorrect, the model adjusts its parameters to improve its performance and tries again. This process of trial and error, though a technique called "backpropagation," allows the model to learn from its mistakes and gradually improve its predictions during the training process.

As a result, GPT learns statistical associations between words and related concepts in the data set. Some people, like OpenAI Chief Scientist Ilya Sutskever, think that GPT models go even further than that, building a sort of internal reality model so they can predict the next best token more accurately, but the idea is controversial. The exact details of how GPT models come up with the next token within their neural nets are still uncertain.

"what does it mean to predict the next token well enough? ... it means that you understand the underlying reality that led to the creation of that token"

excellent explanation by @ilyasut, and thoughts on the crucial question: how far can these systems extrapolate beyond human? pic.twitter.com/v8zFQWvxWY
— Scott Swingle (@bio_bootloader) March 28, 2023

In the current wave of GPT models, this core training (now often called "pre-training") happens only once. After that, people can use the trained neural network in "inference mode," which lets users feed an input into the trained network and get a result. During inference, the input sequence for the GPT model is always provided by a human, and it's called a "prompt." The prompt determines the model's output, and altering the prompt even slightly can dramatically change what the model produces.

For example, if you prompt GPT-3 with "Mary had a," it usually completes the sentence with "little lamb." That's because there are probably thousands of examples of "Mary had a little lamb" in GPT-3's training data set, making it a sensible completion. But if you add more context in the prompt, such as "In the hospital, Mary had a," the result will change and return words like "baby" or "series of tests."

Here's where things get a little funny with ChatGPT, since it's framed as a conversation with an agent rather than just a straight text-completion job. In the case of ChatGPT, the input prompt is the entire conversation you've been having with ChatGPT, starting with your first question or statement and including any specific instructions provided to ChatGPT before the simulated conversation even began. Along the way, ChatGPT keeps a running short-term memory (called the "context window") of everything it and you have written, and when it "talks" to you, it is attempting to complete the transcript of a conversation as a text-completion task.

A diagram showing how GPT conversational language model prompting works. Credit: Benj Edwards / Ars Technica

Additionally, ChatGPT is different from vanilla GPT-3 because it has also been trained on transcripts of conversations written by humans. "We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant," wrote OpenAI in its initial ChatGPT release page. "We gave the trainers access to model-written suggestions to help them compose their responses."

ChatGPT has also been tuned more heavily than GPT-3 using a technique called "reinforcement learning from human feedback," or RLHF, where human raters ranked ChatGPT's responses in order of preference, then fed that information back into the model. Through RLHF, OpenAI was able to instill in the model the goal of refraining from answering many questions it cannot answer reliably. This has allowed the ChatGPT to produce coherent responses with fewer confabulations than the base model. But inaccuracies still slip through.

Why ChatGPT confabulates

Natively, there is nothing in a GPT model's raw data set that separates fact from fiction. That guidance comes from a) the prevalence of accurate content in the data set, b) recognition of factual information in the results by humans, or c) reinforcement learning guidance from humans that emphasizes certain factual responses.

The behavior of LLMs is still an active area of research. Even the researchers who created these GPT models are still discovering surprising properties of the technology that no one predicted when they were first developed. GPT's abilities to do many of the interesting things we are now seeing, such as language translation, programming, and playing chess, were a surprise to researchers at one point (for an early taste of that, check out 2019's GPT-2 research paper and search for the term "surprising").

So when we ask why ChatGPT confabulates, it's difficult to pinpoint an exact technical answer. And because there is a "black box" element of the neural network weights, it's very difficult (if not impossible) to predict their exact output given a complex prompt. Still, we know some basic things about how why confabulation happens.

Key to understanding ChatGPT's confabulation ability is understanding its role as a prediction machine. When ChatGPT confabulates, it is reaching for information or analysis that is not present in its data set and filling in the blanks with plausible-sounding words. ChatGPT is especially good at making things up because of the superhuman amount of data it has to work with, and its ability to glean word context so well helps it place erroneous information seamlessly into the surrounding text.

"I think the best way to think about confabulation is to think about the very nature of large language models: The only thing they know how to do is to pick the next best word based on statistical probability against their training set," said Simon Willison, a software developer who often writes on the topic.

In a 2021 paper, a trio of researchers from the University of Oxford and OpenAI identified two major types of falsehoods that LLMs like ChatGPT might produce. The first comes from inaccurate source material in its training data set, such as common misconceptions (e.g., "eating turkey makes you drowsy"). The second arises from making inferences about specific situations that are absent from its training material (data set); this falls under the aforementioned "hallucination" label.

Whether the GPT model makes a wild guess or not is based on a property that AI researchers call "temperature," which is often characterized as a "creativity" setting. If the creativity is set high, the model will guess wildly; if it's set low, it will spit out data deterministically based on its data set.

Recently, Microsoft employee Mikhail Parakhin, who works on Bing Chat, tweeted about Bing Chat's tendency to hallucinate and what causes it. "This is what I tried to explain previously: hallucinations = creativity," he wrote. "It tries to produce the highest probability continuation of the string using all the data at its disposal. Very often it is correct. Sometimes people have never produced continuations like this."

Parakhin said that those wild creative leaps are what make LLMs interesting. "You can clamp down on hallucinations, and it is super-boring," he wrote. "[It] answers 'I don't know' all the time or only reads what is there in the Search results (also sometimes incorrect). What is missing is the tone of voice: it shouldn't sound so confident in those situations."

Balancing creativity and accuracy is a challenge when it comes to fine-tuning language models like ChatGPT. On the one hand, the ability to come up with creative responses is what makes ChatGPT such a powerful tool for generating new ideas or unblocking writer's block. It also makes the models sound more human. On the other hand, accuracy to the source material is crucial when it comes to producing reliable information and avoiding confabulation. Finding the right balance between the two is an ongoing challenge for the development of language models, but it's one that is essential to produce a tool that is both useful and trustworthy.

There's also the issue of compression. During the training process, GPT-3 considered petabytes of information, but the resulting neural network is only a fraction of that in size. In a widely read New Yorker piece, author Ted Chiang called this a "blurry JPEG of the web." That means a large portion of the factual training data is lost, but GPT-3 makes up for it by learning relationships between concepts that it can later use to reformulate new permutations of these facts. Like a human with a flawed memory working from a hunch of how something works, it sometimes gets things wrong. And, of course, if it doesn't know the answer, it will give its best guess.

We cannot forget the role of the prompt in confabulations. In some ways, ChatGPT is a mirror: It gives you back what you feed it. If you feed it falsehoods, it will tend to agree with you and "think" along those lines. That's why it's important to start fresh with a new prompt when changing subjects or experiencing unwanted responses. And ChatGPT is probabilistic, which means it's partially random in nature. Even with the same prompt, what it outputs can change between sessions.

All this leads to one conclusion, one that OpenAI agrees with: ChatGPT as it is currently designed, is not a reliable source of factual information and cannot be trusted as such. "ChatGPT is great for some things, such as unblocking writer's block or coming up with creative ideas," said Dr. Margaret Mitchell, researcher and chief ethics scientist at AI company Hugging Face. "It was not built to be factual and thus will not be factual. It's as simple as that."

Can the fibbing be fixed?

Trusting an AI chatbot's generations blindly is a mistake, but that may change as the underlying technology improves. Since its release in November, ChatGPT has already been upgraded several times, and some upgrades included improvements in accuracy—and also an improved ability to refuse to answer questions it doesn't know the answers to.

So how does OpenAI plan to make ChatGPT more accurate? We reached out to OpenAI multiple times on this subject over the past few months and received no response. But we can pull out clues from documents OpenAI has released and news reports about the company's attempts to guide ChatGPT's alignment with human workers.

As previously mentioned, one of the reasons why ChatGPT has been so successful is because of extensive training using RLHF. As OpenAI explains, "To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior and rank several outputs from our models. We then use this data to fine-tune GPT-3."

OpenAI's Sutskever believes that additional training through RLHF can fix the hallucination problem. "I'm quite hopeful that by simply improving this subsequent reinforcement learning from human feedback step, we can teach it to not hallucinate," Sutskever said in an interview with Forbes earlier this month.

He continued:

The way we do things today is that we hire people to teach our neural network to behave, to teach ChatGPT to behave. You just interact with it, and it sees from your reaction, it infers, oh, that's not what you wanted. You are not happy with its output. Therefore, the output was not good, and it should do something differently next time. I think there is a quite high chance that this approach will be able to address hallucinations completely.

Others disagree. Yann LeCun, chief AI scientist at Meta, believes hallucination issues will not be solved by the current generation of LLMs that use the GPT architecture. But there is a quickly emerging approach that may bring a great deal more accuracy to LLMs with the current architecture.

"One of the most actively researched approaches for increasing factuality in LLMs is retrieval augmentation—providing external documents to the model to use as sources and supporting context," said Goodside. With that technique, he explained, researchers hope to teach models to use external search engines like Google, "citing reliable sources in their answers as a human researcher might, and rely less on the unreliable factual knowledge learned during model training."

Bing Chat and Google Bard do this already by roping in searches from the web, and soon, a browser-enabled version of ChatGPT will as well. Additionally, ChatGPT plugins aim to supplement GPT-4's training data with information it retrieves from external sources, such as the web and purpose-built databases. This augmentation is similar to how a human with access to an encyclopedia will be more factually accurate than a human without one.

Also, it may be possible to train a model like GPT-4 to be aware of when it is making things up and adjust accordingly. "There are deeper things one can do so that ChatGPT and similar are more factual from the start," said Mitchell, "including more sophisticated data curation and the linking of the training data with 'trust' scores, using a method not unlike PageRank... It would also be possible to fine-tune the model to hedge when it is less confident in the response."

So while ChatGPT is currently in hot water over its confabulations, there may be a way out ahead, and for the sake of a world that is beginning to rely on these tools as essential assistants (for better or worse), an improvement in factual reliability cannot come soon enough.

Listing image: Aurich Lawson | Getty Images

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

202 Comments