This is something I'd been meaning to sit down and work through properly. Sequential modeling is the backbone of Large Language Models (LLMs) and hence I wanted to review this topic from a deeper and relatively intuitive perspective. While I was reviewing different architectures and their capabilities from Recurrent Neural Networks to Transformers, I realized that the key concept that has been redefined and evolved over time is the concept of memory. It makes sense. A sequential model needs to remember the past and use it to make a decision at the current time step. I think this is a very interesting topic and I wanted to share my thoughts on it with you. Also, I am planning to write more blog posts from now on on machine learning and recommender systems mainly from my own perspective and my own understanding with an intuitive touch to it. Feedback welcome.
To me, sequence modeling is basically the problem of understanding a sequence of events (images, tokens, price changes and so on) up to a certain time (T1, T2, … TN-1) to make a decision at that time TN. That decision could be things like what we should recommend to a user given we know what they have watched in the past (while caring about the order of the history), or what would we write after the word "next" if the previous words that came before it were "I want to visit Tehran". So the full sentence would be "I want to visit Tehran next..." which the word after "next" most likely would be a word defining some time (e.g., next week, next month, next year, etc. ). So we need to have some recollection of what came before what we see at each given time step. In other words we need some type of memory.
We humans naturally can remember things both short-term and long-term (well, some of us 🙂) but the question is how can a machine remember things in the right order and effectively use that memory to make a decision at the current time step? Note that by memory here I mean a mechanism that can be learned and used in a machine learning model to remember the past. So it is not a storage mechanism where you retrieve the exact value (the exact word for example) by some key as that is not feasible for our sequence modeling problem since that means storing an infinite number of sentences that we could create via the combination of our vocabulary set. Now that we understand what memory is in machine learning and its importance in sequential modeling, let's take a look at the story of memory in sequential modeling.
First RNNs were introduced around 1986–1990[1] by seeing memory as some kind of notepad to keep track of what we have seen so far in the sequence. At each step, we read the next input, update the notepad by somehow incorporating the new input into it to create an updated representation of what we have seen so far, and pass it forward. The catch is the notepad has a fixed size so no matter how long the sequence, everything you've seen needs to get squeezed into this notepad (a fixed-dimensional vector). Memory, in an RNN, is compression of all the things we have seen so far and that compression is also sequential meaning earlier information is less likely to be represented as strongly as the later information in the sequence. The notepad (our method of compressing memory denoted by h) is as follows:
So technically the full memory in RNN is Wh · ht−1 which is some representation of the previous state which by itself is also a representation of the step prior to it and that goes back until the very first word or whatever our first information was in the sequence (first image, first number etc).
There's something worth sitting with in that equation. ht is a completely new vector computed from scratch at every step in the sequence. The previous state ht−1 is an input to the computation, but the output fully replaces it. There is no mechanism to say "keep this part of the old state intact or only update this other part." The whole thing gets overwritten, every single step in the sequence.
It is quite a novel idea for its own time. A single vector that carries the entire past, distilled. After processing "I want to visit Tehran next," that vector should encode not just the last word, but also a comprehensive representation of the full sentence up to the current word. The network learns to do that compression through training.
But compression has costs. When the sequence gets long, early information has to survive a chain of matrix multiplications (remember the calculation of ht is recursive). Each one slightly transforms the signal. In practice, what happened at step 1 can become nearly undetectable by step 50. Not just that training is hard because of vanishing gradients [2] , but that the model also forgets information in the forward path.
The RNN wants to remember, but its memory is fully overwritten at every step. Compression always loses information, and the longer the sequence, the more gets lost. Memory exists. It's just fragile.
RNNs' view of sequential memory had two limitations:
The LSTM's answer[3] for the second limitation of RNN was to make the long-term memory a separate object. The idea was what if we could have a separate mechanism that somehow maintains a long-term memory without being over-written at every step but rather it could be selectively preserved? They introduced what they called the cell state: a channel that runs alongside the hidden state, mostly untouched, dedicated to long-term storage. The hidden state still does computation similar to RNN. The cell state is more like a conveyor belt: information rides it undisturbed unless something deliberately modifies it (through what they called gates).
Textbooks spend a lot of time on the gates (input, forget, output). They are the read/write/delete controls for this memory system to control how much of the past should be maintained and how much should be replaced by the new information (something that RNNs suffered from). But the real insight isn't the gates. For example, one could have also added some type of gates to the hidden state in RNNs as well but as long as the hidden state update involved repeated multiplication by the same weight matrix, the gradient would still vanish. It's the additive update in each step. The LSTM writes to the cell state by adding to it, not replacing it. That sounds like a small implementation detail. It's not.
Additive updates create an unobstructed path for gradients to flow backward. In a standard RNN, every gradient passes through another matrix multiplication, slightly shrinking each time. In an LSTM, the gradient can travel along the cell state (the conveyor belt) directly, without passing through a nonlinearity. The cell state is a highway. That's why LSTMs can remember things from hundreds of steps back when vanilla RNNs cannot.
The GRU (2014)[4] later simplified LSTM and used fewer parameters for faster training and almost achieved the same performance as LSTMs in most cases.
But even LSTMs didn't change the underlying shape of the problem. Memory was still a vector. Still a single compressed summary of everything. In practice, it is important to pay more attention (pun intended) to certain parts of the input sequence rather than others. For example, in the sentence "I am thinking it would be a great idea to visit Tehran next...", it might be important to pay more attention to the word "visit" than the word "great" when generating the next word in the sequence. Neither RNNs nor LSTMs had a way to do this. That's the bottleneck that attention mechanisms addressed.
By 2015, the bottleneck was well understood. Bahdanau et al.[5] asked what that now may seem obvious: what if we just kept all the encoder states without compressing them into one vector and let the decoder look up the ones it needs, step by step? Not a summary but more like a library.
That's attention. At each decoding step, the model scores the current decoder state against every encoder hidden state. Softmax those scores into weights, take the weighted sum. This way the decoder can pay more attention to some hidden states more than others depending on where we are in the decoding process. The decoder isn't working from a compressed memory. It's reaching back into the whole sequence and selecting, with learned precision, what to actually use.
RNNs and LSTMs were trying to build better storage of the past. Attention changed the question: instead of "how do we store more?" it asked "what if we don't need to store at all?" The past is always there. All tokens (or priovious information in the sequence) are there. Memory becomes something we retrieve on demand, not something we maintain over time. We just need to learn to query it.
And memory became distributed. There's no single vector encoding the entire history. Information sits in each encoder state, at each position. The decoder's "memory" at any step is assembled fresh, a weighted combination of whatever is currently relevant.
The transformer took attention and said let's makes it the entire architecture for sequential modeling. Not an add-on to recurrence to mitigate the vanishing gradient problem but actually a replacement for it. I believe "Attention Is All You Need"[6] wasn't just a provocative title. It was a statement that we don't need the sequential bottleneck at all.
The key move is self-attention: instead of attending from decoder to encoder, every position in a sequence attends to every other position in the same sequence. At every layer, each token directly exchanges information with every other token. Long-range dependencies, which were the chronic weakness of RNNs, no longer degrade through repeated transformations, since tokens can attend to each other directly. There’s no long sequential chain that information has to flow through. This doesn’t mean the problem disappears entirely. In practice, attention introduces its own challenges: limited context windows, quadratic scaling with sequence length, and competition between tokens for attention (even tokens are competing for attention 🙂). But it removes the specific bottleneck that made long-range dependencies so difficult in recurrent models. Though there is still a need to model the sequential nature of the data, which is why transformers use positional encodings to inject information about token order.
Transformers also killed recurrence, which unlocked parallelism which is why we can even train LLMs (large language models) on billions of tokens. RNNs processed left to right, each step waiting on the previous one. Transformers process the whole sequence at once. On modern hardware, this is what made training on hundreds of billions of tokens feasible. Scale became reachable.
An LLM is still, at its core, doing next-token prediction. The same task as an n-gram model from the 1980s. What changed is the memory architecture. An n-gram model has a 3–5 token window. A transformer's context window could have thousands. The "understanding" that emerges in LLMs isn't magic. It's what happens when we can model fine-grained long-range dependencies across enormous amounts of text.
In a real sense, transformer memory is the most radical departure from where we started. RNN memory lived in a hidden vector. LSTM memory lived in a gated cell. Transformer memory lives nowhere specific but it can be accessed everywhere. All the tokens stay in the context window, and attention repeatedly mixes them together so each one can pick up information from the rest. There's no single place to point to and say "that is the memory." It is the computation itself.
End to end, what's most striking isn't the performance improvements. It's how completely the concept of memory was reinvented each time.
| Architecture | Memory Defined As | Core Limitation |
|---|---|---|
| RNN | Compressed hidden state | Fragile, overwritten at each step, vanishing gradients problem |
| LSTM / GRU | Controlled, gated cell state | Still a single vector; bottlenecks seq2seq (sequence to sequence) translation |
| Attention | Dynamic retrieval from all positions | Quadratic cost with sequence length (O(n^2) with sequence length n) |
| Transformer | Distributed, contextual, attention-woven | No persistent state across contexts |
So when we look at things from a birds eye view, the progression isn't store "more memory." It's a shift in the underlying metaphor each time. RNN: memory is a running summary, constantly rewritten at each step. LSTM: memory is infrastructure, a separate channel with controlled read/write. Attention: memory isn't something you store. It's something you retrieve on demand, from a sequence that's always fully available.
What changed, at each step, wasn't the data or the compute. It was the model of what memory is. Each redefinition unlocked a class of problems the previous definition had made impossible. Memory is not just about storing past interactions but about how that information is retrieved and used to inform future predictions. Not all parts of the memory have the same importance and relevance to the future predictions and recent attention mechanisms have been developed to address this issue.
The remaining gap on that table (transformers have no persistent state across contexts) is one of the most active areas in current research which is trying to answer the same question we started with: how do we effectively maintain memory?
One footnote worth mentioning: you might be surprised to learn that the idea of memory as retrieval didn't start with attention. Hopfield networks, introduced in the 1980s, were doing something conceptually similar: storing patterns and retrieving them by association. Given a partial or noisy input, the network would settle into the closest stored pattern. Memory as recall, not storage. The key limitation was that they used binary neurons and fixed weights which means they were not learnable in the modern gradient descent sense. But the spirit was there. Actually there is a recent paper by Ramsauer et al.[7] that showed modern Hopfield networks and attention mechanisms are actually mathematically equivalent. The retrieval idea never went away. We just needed a few decades to make it differentiable and scalable. I left Hopfield networks out of the main post on purpose. They're fascinating and quite innovative for their time but they sit in a different chapter of history of machine learning (pre-backpropagation and pre-gradient descent) and I felt mentioning them together with RNNs, LSTMs and Transformers could have changed the direction of the blog post. However, I highly encourage you to take a look at them and their recent developments.