Sequence Modeling — The Memory Problem

This is something I'd been meaning to sit down and work through properly. Sequential modeling is the backbone of Large Language Models (LLMs) and hence I wanted to review this topic from a deeper and relatively intuitive perspective. While I was reviewing different architectures and their capabilities from Recurrent Neural Networks to Transformers, I realized that the key concept that has been redefined and evolved over time is the concept of memory. It makes sense. A sequential model needs to remember the past and use it to make a decision at the current time step. I think this is a very interesting topic and I wanted to share my thoughts on it with you. Also, I am planning to write more blog posts from now on on machine learning and recommender systems mainly from my own perspective and my own understanding with an intuitive touch to it. Feedback welcome.

What is Sequence Modeling?

To me, sequence modeling is basically the problem of understanding a sequence of events (images, tokens, price changes and so on) up to a certain time (T₁, T₂, … T_N-1) to make a decision at that time T_N. That decision could be things like what we should recommend to a user given we know what they have watched in the past (while caring about the order of the history), or what would we write after the word "next" if the previous words that came before it were "I want to visit Tehran". So the full sentence would be "I want to visit Tehran next..." which the word after "next" most likely would be a word defining some time (e.g., next week, next month, next year, etc. ). So we need to have some recollection of what came before what we see at each given time step. In other words we need some type of memory.

We humans naturally can remember things both short-term and long-term (well, some of us 🙂) but the question is how can a machine remember things in the right order and effectively use that memory to make a decision at the current time step? Note that by memory here I mean a mechanism that can be learned and used in a machine learning model to remember the past. So it is not a storage mechanism where you retrieve the exact value (the exact word for example) by some key as that is not feasible for our sequence modeling problem since that means storing an infinite number of sentences that we could create via the combination of our vocabulary set. Now that we understand what memory is in machine learning and its importance in sequential modeling, let's take a look at the story of memory in sequential modeling.

Recurrent Neural Networks ~ 1986–1990s Memory as Compression

First RNNs were introduced around 1986–1990^[1] by seeing memory as some kind of notepad to keep track of what we have seen so far in the sequence. At each step, we read the next input, update the notepad by somehow incorporating the new input into it to create an updated representation of what we have seen so far, and pass it forward. The catch is the notepad has a fixed size so no matter how long the sequence, everything you've seen needs to get squeezed into this notepad (a fixed-dimensional vector). Memory, in an RNN, is compression of all the things we have seen so far and that compression is also sequential meaning earlier information is less likely to be represented as strongly as the later information in the sequence. The notepad (our method of compressing memory denoted by h) is as follows:

h_t = f(W_h · h_t−1 + W_x · x_t + b)

The hidden state h_t is a function of the previous state (h_t-1) and the current input (x_t): a continuously rewritten summary of the sequence so far. b is the bias vector.

So technically the full memory in RNN is W_h · h_t−1 which is some representation of the previous state which by itself is also a representation of the step prior to it and that goes back until the very first word or whatever our first information was in the sequence (first image, first number etc).

There's something worth sitting with in that equation. h_t is a completely new vector computed from scratch at every step in the sequence. The previous state h_t−1 is an input to the computation, but the output fully replaces it. There is no mechanism to say "keep this part of the old state intact or only update this other part." The whole thing gets overwritten, every single step in the sequence.

It is quite a novel idea for its own time. A single vector that carries the entire past, distilled. After processing "I want to visit Tehran next," that vector should encode not just the last word, but also a comprehensive representation of the full sentence up to the current word. The network learns to do that compression through training.

But compression has costs. When the sequence gets long, early information has to survive a chain of matrix multiplications (remember the calculation of h_t is recursive). Each one slightly transforms the signal. In practice, what happened at step 1 can become nearly undetectable by step 50. Not just that training is hard because of vanishing gradients ^[2] , but that the model also forgets information in the forward path.

The Core Tension

The RNN wants to remember, but its memory is fully overwritten at every step. Compression always loses information, and the longer the sequence, the more gets lost. Memory exists. It's just fragile.

RNNs' view of sequential memory had two limitations:

It compressed the entire past into one fixed-size vector, so there was no way to prioritize certain parts of past memory at any given time.
This vector was overwritten at each step, so longer sequences gradually lost early information. In other words, no solid way to maintain long-term memory while absorbing new input

The first limitation — compressing everything into one vector with no way to prioritize — remained unsolved until attention came along years later. The second one, however, was addressed much sooner, and that's what LSTM solved.

LSTMs & GRUs ~ 1997 / 2014 Memory as Controlled Storage

The LSTM's answer^[3] for the second limitation of RNN was to make the long-term memory a separate object. The idea was what if we could have a separate mechanism that somehow maintains a long-term memory without being over-written at every step but rather it could be selectively preserved? They introduced what they called the cell state: a channel that runs alongside the hidden state, mostly untouched, dedicated to long-term storage. The hidden state still does computation similar to RNN. The cell state is more like a conveyor belt: information rides it undisturbed unless something deliberately modifies it (through what they called gates).

Textbooks spend a lot of time on the gates (input, forget, output). They are the read/write/delete controls for this memory system to control how much of the past should be maintained and how much should be replaced by the new information (something that RNNs suffered from). But the real insight isn't the gates. For example, one could have also added some type of gates to the hidden state in RNNs as well but as long as the hidden state update involved repeated multiplication by the same weight matrix, the gradient would still vanish. It's the additive update in each step. The LSTM writes to the cell state by adding to it, not replacing it. That sounds like a small implementation detail. It's not.

The Gradient Highway

Additive updates create an unobstructed path for gradients to flow backward. In a standard RNN, every gradient passes through another matrix multiplication, slightly shrinking each time. In an LSTM, the gradient can travel along the cell state (the conveyor belt) directly, without passing through a nonlinearity. The cell state is a highway. That's why LSTMs can remember things from hundreds of steps back when vanilla RNNs cannot.

The GRU (2014)^[4] later simplified LSTM and used fewer parameters for faster training and almost achieved the same performance as LSTMs in most cases.

But even LSTMs didn't change the underlying shape of the problem. Memory was still a vector. Still a single compressed summary of everything. In practice, it is important to pay more attention (pun intended) to certain parts of the input sequence rather than others. For example, in the sentence "I am thinking it would be a great idea to visit Tehran next...", it might be important to pay more attention to the word "visit" than the word "great" when generating the next word in the sequence. Neither RNNs nor LSTMs had a way to do this. That's the bottleneck that attention mechanisms addressed.

Attention Mechanisms ~ 2015 Memory as Access

By 2015, the bottleneck was well understood. Bahdanau et al.^[5] asked what that now may seem obvious: what if we just kept all the encoder states without compressing them into one vector and let the decoder look up the ones it needs, step by step? Not a summary but more like a library.

That's attention. At each decoding step, the model scores the current decoder state against every encoder hidden state. Softmax those scores into weights, take the weighted sum. This way the decoder can pay more attention to some hidden states more than others depending on where we are in the decoding process. The decoder isn't working from a compressed memory. It's reaching back into the whole sequence and selecting, with learned precision, what to actually use.

context_t = Σ_i α_ti · h_i

where α_ti = softmax(score(s_t, h_i))

At decoding step t, a context vector is formed by weighting all encoder states h_i by their relevance score. Memory is no longer a fixed vector. It's a dynamic retrieval.

The Conceptual Shift

RNNs and LSTMs were trying to build better storage of the past. Attention changed the question: instead of "how do we store more?" it asked "what if we don't need to store at all?" The past is always there. All tokens (or priovious information in the sequence) are there. Memory becomes something we retrieve on demand, not something we maintain over time. We just need to learn to query it.

And memory became distributed. There's no single vector encoding the entire history. Information sits in each encoder state, at each position. The decoder's "memory" at any step is assembled fresh, a weighted combination of whatever is currently relevant.

Transformers ~ 2017 Memory at Scale

The transformer took attention and said let's makes it the entire architecture for sequential modeling. Not an add-on to recurrence to mitigate the vanishing gradient problem but actually a replacement for it. I believe "Attention Is All You Need"^[6] wasn't just a provocative title. It was a statement that we don't need the sequential bottleneck at all.

The key move is self-attention: instead of attending from decoder to encoder, every position in a sequence attends to every other position in the same sequence. At every layer, each token directly exchanges information with every other token. Long-range dependencies, which were the chronic weakness of RNNs, no longer degrade through repeated transformations, since tokens can attend to each other directly. There’s no long sequential chain that information has to flow through. This doesn’t mean the problem disappears entirely. In practice, attention introduces its own challenges: limited context windows, quadratic scaling with sequence length, and competition between tokens for attention (even tokens are competing for attention 🙂). But it removes the specific bottleneck that made long-range dependencies so difficult in recurrent models. Though there is still a need to model the sequential nature of the data, which is why transformers use positional encodings to inject information about token order.

Transformers also killed recurrence, which unlocked parallelism which is why we can even train LLMs (large language models) on billions of tokens. RNNs processed left to right, each step waiting on the previous one. Transformers process the whole sequence at once. On modern hardware, this is what made training on hundreds of billions of tokens feasible. Scale became reachable.

On LLMs and Memory

An LLM is still, at its core, doing next-token prediction. The same task as an n-gram model from the 1980s. What changed is the memory architecture. An n-gram model has a 3–5 token window. A transformer's context window could have thousands. The "understanding" that emerges in LLMs isn't magic. It's what happens when we can model fine-grained long-range dependencies across enormous amounts of text.

In a real sense, transformer memory is the most radical departure from where we started. RNN memory lived in a hidden vector. LSTM memory lived in a gated cell. Transformer memory lives nowhere specific but it can be accessed everywhere. All the tokens stay in the context window, and attention repeatedly mixes them together so each one can pick up information from the rest. There's no single place to point to and say "that is the memory." It is the computation itself.

What "Memory" Became

End to end, what's most striking isn't the performance improvements. It's how completely the concept of memory was reinvented each time.

Architecture	Memory Defined As	Core Limitation
RNN	Compressed hidden state	Fragile, overwritten at each step, vanishing gradients problem
LSTM / GRU	Controlled, gated cell state	Still a single vector; bottlenecks seq2seq (sequence to sequence) translation
Attention	Dynamic retrieval from all positions	Quadratic cost with sequence length (O(n^2) with sequence length n)
Transformer	Distributed, contextual, attention-woven	No persistent state across contexts

So when we look at things from a birds eye view, the progression isn't store "more memory." It's a shift in the underlying metaphor each time. RNN: memory is a running summary, constantly rewritten at each step. LSTM: memory is infrastructure, a separate channel with controlled read/write. Attention: memory isn't something you store. It's something you retrieve on demand, from a sequence that's always fully available.

What changed, at each step, wasn't the data or the compute. It was the model of what memory is. Each redefinition unlocked a class of problems the previous definition had made impossible. Memory is not just about storing past interactions but about how that information is retrieved and used to inform future predictions. Not all parts of the memory have the same importance and relevance to the future predictions and recent attention mechanisms have been developed to address this issue.

The remaining gap on that table (transformers have no persistent state across contexts) is one of the most active areas in current research which is trying to answer the same question we started with: how do we effectively maintain memory?

One footnote worth mentioning: you might be surprised to learn that the idea of memory as retrieval didn't start with attention. Hopfield networks, introduced in the 1980s, were doing something conceptually similar: storing patterns and retrieving them by association. Given a partial or noisy input, the network would settle into the closest stored pattern. Memory as recall, not storage. The key limitation was that they used binary neurons and fixed weights which means they were not learnable in the modern gradient descent sense. But the spirit was there. Actually there is a recent paper by Ramsauer et al.^[7] that showed modern Hopfield networks and attention mechanisms are actually mathematically equivalent. The retrieval idea never went away. We just needed a few decades to make it differentiable and scalable. I left Hopfield networks out of the main post on purpose. They're fascinating and quite innovative for their time but they sit in a different chapter of history of machine learning (pre-backpropagation and pre-gradient descent) and I felt mentioning them together with RNNs, LSTMs and Transformers could have changed the direction of the blog post. However, I highly encourage you to take a look at them and their recent developments.

Key References

↩ Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. nature.com
↩ Vanishing gradient problem. Wikipedia
↩ Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. PDF
↩ Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014. arXiv:1406.1078
↩ Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473
↩ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
↩ Ramsauer, H., Schäfl, B., Lehner, J., et al. (2020). Hopfield Networks is All You Need. ICLR 2021. arXiv:2008.02217