The Memory Problem: Why LLMs Need More Than Just Bigger Context Windows

Just as computers dont rely exclusively on RAM, LLMs need to not rely exclusively on context windows

Dec 26, 2024

One of the most useful abilities of LLMs is working with really big documents. Being able to drop a large amount of stuff into the context window, and get a quick summary or analysis/comparison across entire prompt is very powerful. Basically there are two things at play here:

Superhuman reading speed - LLMs can ingest a lot of text much quicker than humans, and cheaper!
Superhuman working memory - LLMs can keep track of info across their entire context window

The key reason this became possible is an expansion in the context size of models without sacrificing recall on specific info (needle-in-haystack eval). This was an incredible feat! But now that it has been achieved, new limitations arise:

Long-context is still limited (really large datafiles won't fit into context)
Long-context is very expensive (need a lot of time to process and compute to host)
Performance can degrade with very long prompts, even when simple recall remains good

While modern models can handle impressive context lengths (some reaching up to a million tokens, enough to fit all Harry Potter books), it's still finite. You can't upload all AI papers from 2024, for example. The costs grow quadratically with context length, making it increasingly impractical as contexts get longer. And even when models show perfect recall on simple tests, their ability to meaningfully use all that information seems to degrade.

This is very reminiscent of RAM in computers and short-term/working memory in humans. It's a high cost, high output solution. But it's silly then to try to improve LLMs' ability to process longer documents by improving context window alone. It's as if we expected a person with amnesia to become smarter by teaching them to hold more and more information at a time, or tried to build a computer with RAM only.

Understanding Memory Systems

Human Memory

Human memory operates through a dual-system architecture. We have working memory - our immediate mental workspace - which is highly limited but offers rapid access to information we're actively thinking about. Then we have long-term memory, which is vast but operates with different rules - we can't always retrieve information on demand, and storage isn't permanent or perfectly accurate. This limitation of working memory likely exists because maintaining active neural connections is energetically expensive for the brain.

Computer Memory

Computers solve similar challenges through a hierarchical memory system. At the fastest level, there are registers for immediate processing. Then comes cache memory, providing quick access to frequently used data. RAM offers larger capacity with moderate speed, while storage devices like SSDs give us vast capacity at slower speeds. The fundamental tradeoff is always between speed and size - faster memory is more expensive and therefore more limited.

LLM Memory Today

Currently, LLMs primarily rely on their context window - similar to working memory in humans or RAM in computers. They can keep track of information across their entire context window with impressive recall, achieving what seems like "superhuman reading speed" - they can process and understand vast amounts of text much faster than humans.

However, the limitations are becoming increasingly clear:

Context Size Limits

Computational Costs

The problem isn't just about technical capabilities - it's about practicality:

Processing time increases significantly with longer contexts
Compute costs grow quadratically with context length
As model inference becomes more commoditized, these costs become a crucial bottleneck

Performance Degradation

While models can achieve perfect scores on simple needle-in-a-haystack tests even with 200k token contexts, this might be overselling their actual capabilities:

The ability to meaningfully use data seems to degrade with very long prompts
Recall of specific facts isn't the same as sophisticated reasoning across a large context
Even well-trained models may have hidden performance reductions when working at maximum length

Why Current Solutions Aren't Enough

If you have a piece of text you want an LLM to know about, you currently have three options:

Put it in the context window
Include it in training data
Put it in a RAG system

This is very reminiscent of computer memory - RAM for immediate access, hard drive for storage, etc. But there's a crucial difference: modern computers don't try to solve all problems by just adding more RAM. They use sophisticated caching systems and memory hierarchies.

Yet with LLMs, we've been focusing heavily on expanding context windows - essentially trying to build a computer with only RAM, or expecting someone with amnesia to become smarter by holding more information at once.

The Missing Pieces

While RAG systems attempt to provide a form of "external memory," we're missing crucial middle layers that both humans and computers use effectively:

No equivalent to CPU cache systems
No built-in mechanism for efficient long-term storage
No automatic management of what should be in immediate context vs. external storage

Moving Forward

Consider how humans access information:

Working memory (like context window) - immediate but limited
Long-term memory (like training data) - shapes fundamental understanding and patterns
Reference materials/Internet (like RAG) - vast knowledge accessed on demand

Each system has its tradeoffs. Working memory is fast but limited. Long-term memory shapes how we think but can't be quickly updated. External references provide vast knowledge but take time to access and process.

LLMs face similar tradeoffs:

Context window - immediate but computationally expensive
Training data - shapes fundamental capabilities but expensive to update
RAG systems - flexible but requires retrieval overhead

Just as we don't expect humans to work purely from short-term memory or computers to solve everything with RAM alone, the future of LLMs likely lies in more sophisticated memory architectures rather than ever-larger context windows. This means:

Developing hierarchical attention mechanisms that could serve as "cache" layers
Creating sophisticated memory management systems that can selectively retain or discard information
Better integrating all three types of memory: training data (fundamental knowledge), RAG systems (external reference), and context windows (immediate processing)

Maksym’s Substack

Discussion about this post