Journalism of Courage
Advertisement

Why do ChatGPT, Gemini, and Claude forget your chats?

Why do AI models struggle to remember information? The answer lies in something called the context window, and it can be more limiting than you might think.

Every LLM comes with a hard-coded context limit, and reaching this limit means losing out on information. (Image: FreePik)Every LLM comes with a hard-coded context limit, and reaching this limit means losing out on information. (Image: FreePik)

ChatGPT, Gemini, and Claude are among the widely used chatbots in the world today. If you have been a regular user of one of these AI chatbots, you may have, at times, noticed that they tend to forget things you have said earlier in a chat, or they start giving confusing, repeated answers after you have had a long conversation. If you have ever wondered why this happens, well, it all comes down to one simple concept – the context window. 

Author and YouTuber Matt Pocock, in his latest video, talks extensively about how the odd behaviours of AI chatbots are linked to the context window. According to Pocock, the context window is one of the most important (and misunderstood) limits in how large language models (LLMs) like these work. Firstly, to understand this concept, let’s break it down into simpler words.

What is a context window?

Imagine the context window as the short-term memory of an AI model. Each time you write a prompt or question, and every time the model responds, it is processing the text in units known as tokens. A token essentially represents a few characters or part of a word. For example, the word ‘Anonymous’ can be split into ‘anony’ and ‘mous’.

When it comes to LLMs, all of these tokens are essentially inputs, and their replies are outputs, and together they make up the context the AI model can see at one time. In simple words, the total number of tokens an AI model can handle in a conversation is its context window size. 

For instance, if an AI model has a 200,000-token context window, it means it is capable of only processing and remembering that many tokens at once. If a user goes beyond this limit, older information gets dropped or truncated, and the model effectively forgets parts of the conversation. 

Can’t AI models have infinite memory?

While it would be great if an AI model could remember everything you have ever told them, in practical terms, it would be both technically difficult and inefficient. Memory comes at a cost, as each additional token the model reads and processes requires computing power and memory. This means the bigger the context window, the higher the costs for every request. Interestingly, when AI models get too much information, they tend to become bad at using it. According to Pocock, finding the right detail inside a huge document or a long chat would be like finding a needle in a haystack.

Moreover, there are design limits. It needs to be noted that each AI model is built with a fixed architectural limit that dictates how much context it can handle. This is also the reason why you may see differences such as Claude 4.5 having a 200K token limit, Gemini 2.5 Pro flaunting two million, or smaller models like LLaMA or Mistral coming out with just a few thousand limit. 

Story continues below this ad

At a time when companies are engaged in a rat race to expand their model’s context windows, Pocock explains that bigger does not automatically translate to better performance. 

‘Lost in the middle’ problem

In his short video, Pocock explains that one of the key limitations of context windows is what researchers call the “lost in the middle” problem. To understand this, picture your chat as a scroll, and the AI pays the most attention to the beginning (the instructions or setup) and the end (the most recent messages). This means information in the middle of the chat is given less importance. Pocock explains that this happens because of how LLMs use ‘attention’, which is a mechanism that decides which words or tokens matter the most when predicting the next ones. This, according to Pocock, is similar to primacy bias and recency bias among humans that give more weight to recent events or information. 

For developers, such behaviour can be frustrating when it comes to long coding sessions. For example, if a user asked their coding assistant to fix a bug from 200 lines ago, it may not be able to recall the exact detail that they mentioned earlier, as that information would be buried deep in the middle of the context window.  

How does this issue affect coding agents?

When developers work with AI coding tools such as Claude Code or GitHub Copilot, they are essentially working within the context window each time. Now, if a chat or codebase fed into the AI gets too large, it can reach the context window limit and may stop responding or even miss crucial context. This is also the reason why expert users often reset or compact (summarise) their sessions. 

Story continues below this ad

The author claims that the shorter the context window, the fewer ‘lost-in-the-middle’ problems. Models do better with less, more focused information. This also means that regularly cleaning your coding agents’ chats will refresh the agent’s memory, and clearing its context window will lead to much better performance.

Pocock said that clearing gives users a completely blank slate. Besides, compacting also helps to retain context in a shortened form. However, this costs tokens and takes time since the AI must generate that summary. The YouTuber went on to explain why smaller and cleaner context usually works better. 

Most AI users and even developers often assume that a larger context window means better results. However, this is not always true. Pocock highlights that smaller, well-managed contexts allow models to think clearly. Also, too much context can cause the AI model to mix up instructions, prioritise the wrong details, or even generate vague answers. 

Today, many advanced coding AI models allow users to see current context usage, especially how much of the window is consumed by system instructions, user messages, or outputs. This is beneficial as it prevents a coder from running out of space mid-task. However, Pocock cautions against the use of third-party extensions or tools such as Model Context Protocol (MCP) servers, as they can bloat the context window faster. He claims that these tools may often use extra data or prompts that could consume tokens, leaving less room for a user’s original input. 

So the real answer to AI chatbots forgetting or generating vague responses lies in the fact that every LLM comes with a hard-coded context limit. Anything above that, information gets dropped. Also, within the context limit, AI models tend to prioritise what is at the beginning and at the end over what lies in the middle.  

Story continues below this ad

One should look at the context window as the AI model’s workspace. Hence, experts advise you to keep it tidy to get smarter, faster, and more reliable results, no matter which model you use.

From the homepage

Bijin Jose, an Assistant Editor at Indian Express Online in New Delhi, is a technology journalist with a portfolio spanning various prestigious publications. Starting as a citizen journalist with The Times of India in 2013, he transitioned through roles at India Today Digital and The Economic Times, before finding his niche at The Indian Express. With a BA in English from Maharaja Sayajirao University, Vadodara, and an MA in English Literature, Bijin's expertise extends from crime reporting to cultural features. With a keen interest in closely covering developments in artificial intelligence, Bijin provides nuanced perspectives on its implications for society and beyond. ... Read More

Tags:
  • artificial intelligence ChatGPT Openai
Edition
Install the Express App for
a better experience
Featured
Trending Topics
News
Multimedia
Follow Us
Express ExplainedRampur and Mudhol Hound dog breeds to be inducted by BSF: Everything you need to know
X