Earlier this week, at New York Tech Week, I was at the Founders Salon that Anthropic hosted. Chris Gorgolewski, who works on Memory and Reinforcement Learning at Anthropic, told the room something that a lot of teams want to hear: you do not really need to worry about context window lengths anymore.
I had my hand up for a question but the session ran out of time before I could ask. So I am going to ask it here instead, because it is the kind of claim that quietly reshapes how an entire cohort of founders builds their products.
The short version of my question is this -
"Do not worry about context windows" is true in the narrow sense of not running out of context-lengths. But the things that actually decide whether a product works in production are recall quality, cost per turn, latency, and long-term memory; all of which get worse with the larger part of the context window used.
Then is the advice not quietly setting teams up to build something slow, expensive, and forgetful? What am I getting wrong about that?
Let me elaborate.
Bigger windows do not fix the lost-in-the-middle problem
The first thing worth saying is that context window size and context window quality are two different things and only one of them has actually improved.
Researchers at Stanford documented the lost-in-the-middle effect back in 2023 and it has held up remarkably well since. Language models attend strongly to the beginning and the end of their input and poorly to everything in the middle, which means accuracy can drop by more than 30% when the relevant fact sits in the middle of a long context instead of at the edges. This is a U-shaped attention bias and it is structural and not a bug that a bigger window patches over.
More recent work on context rot from Chroma tested 18 frontier models, including GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro, and found every single one degrades as input length grows, often well before the window is anywhere near full. To be fair about it, the same study found Claude models decay the slowest of the group, so Anthropic is genuinely ahead here. But "slowest to degrade" is not "does not degrade." The situation is not "older models had this problem and the new million-token models fixed it." The situation is "the problem scales with the amount of context you actually use." A larger window gives you more room to make this worse, not a guarantee that the model reads the room.
So when someone says do not worry about the window, the honest follow-up is: do not worry about running out of room, sure. But you should worry a lot about what happens to recall once that room is full.
The economics still point the other way
The second thing worth thinking through is cost, and here I want to be careful, because this is easy to read as a motives argument and that is not what I mean.
The way you use a context window is by sending tokens, and on a usage-based model that means your bill scales with how much context you carry on every turn. "Just paste everything in" is the most expensive habit you can build, because you re-pay for the same history on every single call, and that cost grows with conversation length and user count at the same time.
The pricing reflects this. Gemini 2.5 Pro still charges roughly double per input token once a prompt crosses the 128K mark, and when Anthropic first shipped the 1M-token context beta on the Sonnet 4 line, prompts over 200K tokens were billed at 2x the input rate and 1.5x the output rate ($6 and $22.50 per million tokens, against the standard $3 and $15). The newest Claude models have since moved to a flat rate across the full window, which is a real and welcome change and one I would point to as the labs moving in the right direction. But flat rate does not mean free. You still pay per token, every turn, so the size of the window does not change the economics of how much you choose to put in it.
I wrote about this cost stack in more detail in an earlier piece on why your AI agent is a cash guzzler. The summary is that context accumulation is the layer that sneaks up on teams, because it grows with conversation length and user count at the same time, and it does not show up as a problem until you are already overcommitted.
Long windows are short-term memory, not long-term memory
A context window, however large, is short-term memory. It holds what is in front of the model right now, for this conversation, and it does so at a cost that grows with how much you put in it. Long-term memory is a different problem. It is the ability to recall the right fact from three weeks ago without re-reading three weeks of transcripts, to resolve that "the new PM" and "Steffi" are the same person, to know which facts are stale and which still hold, and to forget what no longer matters.
You cannot get that by making the window bigger. A bigger window solves short-term memory with unviable tradeoffs, and it does nothing for long-term memory at all. The two are not points on the same line. They are different systems, and treating one as a substitute for the other is how teams end up with agents that are technically working and practically forgetful.
AI memory is a data pipeline and fast-retrieval problem, not a storage problem alone. Getting it right is iterative depth work. This is the problem we take care of at Maximem, and for what it is worth it is why we score 92% on LongMemEval at P50 15ms.
Latency is the part conversational AI cannot wait out
Which brings me to the constraint that "do not worry about context" completely ignores: time.
Processing more tokens takes longer. In a coding workflow you can absorb that, because a developer will happily wait a few seconds for a good answer. In voice AI you cannot. A voice agent has a budget of a few hundred milliseconds before the conversation feels broken, and you cannot spend that budget re-reading a 200K-token history on every turn. Latency is not a tuning detail here. It is the difference between a product and a demo. Even in textual conversations; delays can frustrate customers. But the real blow comes when you deal with real production-grade agentic systems. You are probably spinning up deep-reasoning agent swarms; and latency compounds at each step of recall.
And most teams are not using the long window in production anyway
The last thing I would point out is that the place these long windows shine, big frontier models doing heavy reasoning, is mostly a coding and development setting. Most teams do not run their most expensive, longest-context model in production for every user request, because the unit economics do not survive it. They use it for coding tasks and route production traffic to cheaper, faster models where the giant window is not even on the table. So the advice optimizes for the one workflow where cost and latency do not bite, and waves away the workflows where they do.
The question, plainly
So, Chris, if you are reading this, here is what I would have asked.
If context rot is real across every frontier model, if the labs price long context at a premium past a threshold, if a window is short-term memory by definition, and if latency makes large contexts a non-starter for voice and most production traffic, then "do not worry about context windows" cannot be the whole story. What part of that do you see differently?
I would actually love to hear the answer. Reframing problems is how this field moves, and I have been wrong before. But until then, I would tell any founder in that room the opposite of what they heard: the window is the thing you can stop expanding. Memory is the thing you still have to build.



