Aviv Bick Profile
Aviv Bick

@avivbick

Followers
269
Following
148
Media
10
Statuses
58

CS PhD student at Carnegie Mellon

Joined January 2024
Don't wanna be here? Send us removal request.
@avivbick
Aviv Bick
2 months
The Transformer–SSM retrieval gap is driven by just a few heads!. SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/
Tweet media one
5
28
192
@avivbick
Aviv Bick
10 hours
RT @rbuit_: Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the….
0
30
0
@avivbick
Aviv Bick
23 days
RT @HanGuo97: We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?. I….
0
191
0
@avivbick
Aviv Bick
1 month
RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….
0
40
0
@avivbick
Aviv Bick
2 months
RT @abk_tau: New work! 🚨. Recurrent LLMs like Mamba and RWKV can efficiently process millions of tokens, yet still underperform on real-wor….
0
24
0
@avivbick
Aviv Bick
2 months
For more details, experiments, and cool insights, check out:.📄 Full paper: [.📓 Notebook: [. This is joint work with @ericxing and @_albertgu .6/.
0
1
23
@avivbick
Aviv Bick
2 months
And this is why Hybrid models work 🦸.They solve the G&A bottleneck in SSMs. > During training, Aggregate roles go to attention. > Pretrained SSM improves sharply when one Aggregate is replaced with attention. Where should you place attention?.👉 Where G&A heads emerge. 5/
Tweet media one
1
0
17
@avivbick
Aviv Bick
2 months
Although both architectures develop G&A heads, SSMs’ smoother attention maps weaken the implementation of these heads compared to Transformers. That is, SSMs handle language and memory well, but retrieval bottlenecks in a few G&A heads 🤏.4/.
Tweet media one
3
1
18
@avivbick
Aviv Bick
2 months
What makes that head so special?.It’s part of a two-head mechanism: Gather-and-Aggregate (G&A) Heads. >Gather: pulls relevant info.>Aggregate: combines it to answer. Transformer and SSM seem different—but we find that both use G&A!.And in both, a few heads do all the work 💪.3/
Tweet media one
1
1
12
@avivbick
Aviv Bick
2 months
We removed one head from a pruned 4B Llama. 📉 MMLU dropped: 66% → 25%.The model still knew the answer, but couldn’t retrieve the correct letter (A–D) from context. The same phenomenon appears in SSMs 🐍.2/.
1
0
11
@avivbick
Aviv Bick
2 months
RT @electronickale: ✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images?. PRIS….
0
31
0
@avivbick
Aviv Bick
3 months
RT @kevinyli_: At #ICLR2025 to present two recent works on reasoning distillation and efficient VLM inference with my wonderful collaborato….
0
4
0
@avivbick
Aviv Bick
3 months
RT @ashertrockman: Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your….
0
29
0
@avivbick
Aviv Bick
3 months
RT @LiaoIsaac91893: Scores 4.17% on ARC-AGI 2 on Kaggle!. 🔗 Code provided in the Kaggle notebook:
0
21
0
@avivbick
Aviv Bick
4 months
RT @KrithikTweets: 🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achievi….
0
146
0
@avivbick
Aviv Bick
4 months
RT @_albertgu: Announcing Cartesia’s Series A towards our mission of building real-time intelligence. I’m cooking up some new models in the….
0
10
0
@avivbick
Aviv Bick
4 months
Amazing!. Small & efficient on-device Llamba reduces communication with larger cloud models—pushing the edge forward 🚀. Check it out ->
@Avanika15
Avanika Narayan
4 months
[3/7] minions daily ship 🚢. 🐍 ssms on edge — support for running @cartesia_ai’s llamba models natively on-device. s/o to @avivbick, @jundesai, @krandiash for the tlc 💚 . see the models in action, summarizing the superintelligence strategy from @DanHendrycks, @ericschmidt and
1
1
6
@avivbick
Aviv Bick
4 months
RT @PranjalAggarw16: What if you could control how long a reasoning model “thinks”?. Presenting L1-1.5B, an RL-trained reasoning model with….
0
70
0
@avivbick
Aviv Bick
4 months
0
1
10
@avivbick
Aviv Bick
4 months
6️⃣ Try it out!.We’ve released an optimized implementation for edge deployment:.🔗 For more details, check out:.📄 The full paper: 📝 The blog post:
1
3
14