Advertisement

Google has released TITANS, the successor to the Transformer architecture

First, the term "successor to the Transformer architecture" above 👆🏻 is something I saw someone use on x.com. I don't have the capability to judge the importance of TITANS; I'm just learning about it.

The attention mechanism (Attention) has been key to the progress of most large language models (LLMs), but it cannot scale to long contexts.

“The true art of memory is the art of attention!"

— Samuel Johnson, 1787

TITANS is a new architecture that combines the attention mechanism with a meta context memory, allowing it to learn how to remember at test time. Compared to Transformers and modern linear RNNs, TITANS outperforms them in performance and can effectively scale to a context window of over 2M, surpassing even very large models like GPT-4 and Llama3-80B.


https://arxiv.org/pdf/2501.00663v1


In summary, the introduction of the TITANS architecture is an innovative attempt to address the problem of long contexts and improve memory capabilities. Compared to traditional Transformer architectures, TITANS maintains efficient performance over a larger context window and can dynamically remember during testing, demonstrating its potential.




How to design long-term memory?

The TITANS team approached this question from the perspective of human memory. Human short-term memory is very accurate but has a limited window (about 30 seconds). So, how to handle longer contexts? The TITANS team used other types of memory systems to store potentially useful information.

They believe that the attention mechanism, due to its limited context window and accurate dependency modeling, serves as short-term memory. Therefore, TITANS needs a neural network memory module that can remember a longer history as a long-term and more persistent memory.

: the memory system is responsible for storing information, but remembering training data may be useless during testing because the test data distribution may differ from the training data. Therefore, the TITANS team needs to teach the memory module how to remember/forget information during testing.

To this end, the TITANS team proposed: encoding past history into the parameters of the neural network (similar to TTT) and training an online meta-model to learn how to remember/forget data during testing.




Which tokens need to be remembered?

The TITANS team again approached this question from the perspective of human memory. The human brain prioritizes remembering events that defy expectations (i.e., surprising events). However, although an event may be surprising at one moment, it may not continue to surprise us. The initial moment is enough to draw attention, thus remembering the entire timeframe.

The TITANS team simulated this process to train long-term memory, dividing the surprise of a token into:

  1. Instantaneous surprise
  2. (Decaying) past surprise

Instantaneous surprise is measured by the gradient between the memory and the incoming token, while past surprise is the decaying cumulative value of past tokens.




How is memory forgotten?

in the memory update rule. Interestingly, this weight decay can be seen as a generalized form of data-dependent gating in RNNs, utilizing matrix or vector-valued memory.




Is this design parallelizable?

and combined it with weight decay through additional matrix multiplication. So, how to handle the decaying past surprise? The TITANS team realized that it could be calculated through parallel scanning within each mini-batch.




How to integrate memory?

The TITANS team demonstrated three architectural variants where memory can serve as:

  1. context

  2. head

  3. layer

divides the input into segments (which can be large, even equal to the context window of current attention-based large language models) and uses past memory states to extract corresponding memories, which are then updated through attention outputs.




TITANS' performance in experiments

The TITANS team focused on language modeling, common sense reasoning, "needle in a haystack," and time series prediction tasks,

GPT-4 and Llama3-80B.




Summary

The TITANS architecture demonstrates how to solve the problem of long contexts by combining a dynamic memory module with the attention mechanism. It far exceeds existing Transformer and RNN architectures in performance (according to the authors) and, through diverse memory mechanisms, can handle different tasks, showcasing its advantages in processing large context windows.