
Ali Behrouz
@behrouz_ali
Followers
4K
Following
898
Media
39
Statuses
153
Research Intern @Google, Ph.D. Student @Cornell_CS, interested in machine learning and understanding intelligence.
Joined January 2023
Attention has been the key component for most advances in LLMs, but it can’t scale to long context. Does this mean we need to find an alternative? . Presenting Titans: a new architecture with attention and a meta in-context memory that learns how to memorize at test time. Titans
79
603
3K
RT @gmongaras: Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works….
arxiv.org
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main...
0
4
0
RT @behrouz_ali: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?….
0
140
0
Everyone is talking about reviewers who don't engage or provide low-quality reviews. While harmful, I don't see that as the biggest threat to the peer review system. As both an author and reviewer, I'm seeing zero-sum debates where a reviewer puts their full effort into rejecting.
Instead of complaining that peer review is dead, take a positive step to improve it today. The reviewers are not aliens, they are us! . - Revise your review and make it clear. Identify the crucial points that impacted your score negatively and positively. - If the paper is.
1
0
11
RT @mirrokni: Proud to announce an official Gold Medal at #IMO2025🥇. The IMO committee has certified the result from our general-purpose Ge….
deepmind.google
Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical Olympiad (IMO), the world’s most prestigious competition for young...
0
35
0
RT @reza_byt: 📄 New Paper Alert! ✨. 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput. Across 135 M–1.7 B….
0
55
0
RT @yingheng_wang: ❓ Are LLMs actually problem solvers or just good at regurgitating facts?. 🚨New Benchmark Alert! We built HeuriGym to ben….
0
25
0
RT @leloykun: Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration. Hi all, I'm bacc. I have a l….
0
41
0
RT @TheTuringPost: Last week, @Google dropped a paper on ATLAS, a new architecture that reimagines how models learn and use memory. Unfort….
0
77
0
@mirrokni @meisamrr Here is the link to the paper: This is a work with Zeman Li, Praneeth Kacham, @daliri__majid, @yuandeng_cs, Peilin Zhong, @meisamrr, and @mirrokni.
2
1
41
@mirrokni @meisamrr In our experiments, we focus on language modeling, common-sense reasoning, needle in a haystack, in-context recall, and multi-query associative recall tasks. Atlas is very effective in all scales (tested up to 1.3B), even outperforming Titans, and other linear RNNs in long
1
0
36
@mirrokni @meisamrr Can we use what we have learned from the design of long-term neural memory (Atlas) and further enhance attention and Transformers?. We aim to strictly generalize original softmax Transformers from two important aspects: (1) Deep Memory: Transformers use matrix-value memories,.
1
1
32
@mirrokni @meisamrr Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based
1
2
39
@mirrokni @meisamrr Now that we addressed the first drawback, how do we enhance the memory capacity (i.e., the number of data samples that the memory can store in its parameters)?. Attention acts as an unbounded associative memory that tries to learn the mapping between a set of queries and a set of
1
1
34
@mirrokni @meisamrr How does memory prune the context? We provide additional flexibility for the model by gamma parameters to prune the context, whenever it is needed. This is similar to forgetting but with more direct access to the local tokens. That is, the model can simply ignore past tokens by
1
1
34
@mirrokni @meisamrr From the memory perspective: Our brain prioritizes events that violate the expectations (being surprising). While an event itself consists of different elements, the judgment of prioritization depends on them all. In Titans, however, the model computes the surprise metric for
3
3
46
@mirrokni @meisamrr Coming back to the first question: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?. We observe three disjoint aspects that limit the performance of long-term memory modules in long context tasks: (1)
1
0
42