behrouz_ali Profile Banner
Ali Behrouz Profile
Ali Behrouz

@behrouz_ali

Followers
4K
Following
898
Media
39
Statuses
153

Research Intern @Google, Ph.D. Student @Cornell_CS, interested in machine learning and understanding intelligence.

Joined January 2023
Don't wanna be here? Send us removal request.
@behrouz_ali
Ali Behrouz
7 months
Attention has been the key component for most advances in LLMs, but it can’t scale to long context. Does this mean we need to find an alternative? . Presenting Titans: a new architecture with attention and a meta in-context memory that learns how to memorize at test time. Titans
Tweet media one
79
603
3K
@behrouz_ali
Ali Behrouz
4 days
RT @gmongaras: Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works….
Tweet card summary image
arxiv.org
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main...
0
4
0
@behrouz_ali
Ali Behrouz
7 days
RT @behrouz_ali: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?….
0
140
0
@behrouz_ali
Ali Behrouz
8 days
Everyone is talking about reviewers who don't engage or provide low-quality reviews. While harmful, I don't see that as the biggest threat to the peer review system. As both an author and reviewer, I'm seeing zero-sum debates where a reviewer puts their full effort into rejecting.
@abeirami
Ahmad Beirami
9 days
Instead of complaining that peer review is dead, take a positive step to improve it today. The reviewers are not aliens, they are us! . - Revise your review and make it clear. Identify the crucial points that impacted your score negatively and positively. - If the paper is.
1
0
11
@behrouz_ali
Ali Behrouz
21 days
RT @reza_byt: 📄 New Paper Alert! ✨. 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput. Across 135 M–1.7 B….
0
55
0
@behrouz_ali
Ali Behrouz
1 month
RT @yingheng_wang: ❓ Are LLMs actually problem solvers or just good at regurgitating facts?. 🚨New Benchmark Alert! We built HeuriGym to ben….
0
25
0
@behrouz_ali
Ali Behrouz
2 months
RT @tdietterich: The scope of what counts as research has narrowed considerably.
0
10
0
@behrouz_ali
Ali Behrouz
2 months
RT @leloykun: Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration. Hi all, I'm bacc. I have a l….
0
41
0
@behrouz_ali
Ali Behrouz
2 months
Very interesting work!.
@InfiniAILab
Infini-AI-Lab
2 months
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46%. 🌐 Website: 🧵 1/n
0
0
7
@behrouz_ali
Ali Behrouz
2 months
RT @pmddomingos: The ratio of science to engineering in AI is approaching zero.
0
25
0
@behrouz_ali
Ali Behrouz
2 months
RT @TheTuringPost: Last week, @Google dropped a paper on ATLAS, a new architecture that reimagines how models learn and use memory. Unfort….
0
77
0
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr Here is the link to the paper: This is a work with Zeman Li, Praneeth Kacham, @daliri__majid, @yuandeng_cs, Peilin Zhong, @meisamrr, and @mirrokni.
2
1
41
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr In our experiments, we focus on language modeling, common-sense reasoning, needle in a haystack, in-context recall, and multi-query associative recall tasks. Atlas is very effective in all scales (tested up to 1.3B), even outperforming Titans, and other linear RNNs in long
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
36
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr How to incorporate the memory?. We follow Titans and use the variants of Memory as Context (MAC), as Gate (MAG), and as Layer (MAL) but without persistent memory tokens in our experiments.
Tweet media one
2
1
33
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr Can we use what we have learned from the design of long-term neural memory (Atlas) and further enhance attention and Transformers?. We aim to strictly generalize original softmax Transformers from two important aspects: (1) Deep Memory: Transformers use matrix-value memories,.
1
1
32
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based
Tweet media one
1
2
39
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr Now that we addressed the first drawback, how do we enhance the memory capacity (i.e., the number of data samples that the memory can store in its parameters)?. Attention acts as an unbounded associative memory that tries to learn the mapping between a set of queries and a set of
Tweet media one
1
1
34
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr How does memory prune the context? We provide additional flexibility for the model by gamma parameters to prune the context, whenever it is needed. This is similar to forgetting but with more direct access to the local tokens. That is, the model can simply ignore past tokens by
Tweet media one
1
1
34
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr From the memory perspective: Our brain prioritizes events that violate the expectations (being surprising). While an event itself consists of different elements, the judgment of prioritization depends on them all. In Titans, however, the model computes the surprise metric for
Tweet media one
3
3
46
@behrouz_ali
Ali Behrouz
2 months
@mirrokni @meisamrr Coming back to the first question: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?. We observe three disjoint aspects that limit the performance of long-term memory modules in long context tasks: (1)
Tweet media one
1
0
42