Aviv Bick @avivbick X Profile

Aviv Bick

@avivbick

Followers

269

Following

148

Media

10

Statuses

58

CS PhD student at Carnegie Mellon

Joined January 2024

Don't wanna be here? Send us removal request.

Aviv Bick

@avivbick

2 months

The Transformer–SSM retrieval gap is driven by just a few heads!. SSMs lag on tasks like MMLU (multiple-choice) and GSM8K (math) due to in-context retrieval challenges. But here’s the twist: just a handful of heads handle retrieval in both architectures. What we found 👇 1/

5

28

192

Aviv Bick

@avivbick

10 hours

RT @rbuit_: Despite theoretically handling long contexts, existing recurrent models still fall short: they may fail to generalize past the….

0

30

0

Aviv Bick

@avivbick

23 days

RT @HanGuo97: We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?. I….

0

191

0

Aviv Bick

@avivbick

1 month

RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….

0

40

0

Aviv Bick

@avivbick

2 months

RT @abk_tau: New work! 🚨. Recurrent LLMs like Mamba and RWKV can efficiently process millions of tokens, yet still underperform on real-wor….

0

24

0

Aviv Bick

@avivbick

2 months

For more details, experiments, and cool insights, check out:.📄 Full paper: [.📓 Notebook: [. This is joint work with @ericxing and @_albertgu .6/.

0

1

23

Aviv Bick

@avivbick

2 months

And this is why Hybrid models work 🦸.They solve the G&A bottleneck in SSMs. > During training, Aggregate roles go to attention. > Pretrained SSM improves sharply when one Aggregate is replaced with attention. Where should you place attention?.👉 Where G&A heads emerge. 5/

1

0

17

Aviv Bick

@avivbick

2 months

Although both architectures develop G&A heads, SSMs’ smoother attention maps weaken the implementation of these heads compared to Transformers. That is, SSMs handle language and memory well, but retrieval bottlenecks in a few G&A heads 🤏.4/.

3

1

18

Aviv Bick

@avivbick

2 months

What makes that head so special?.It’s part of a two-head mechanism: Gather-and-Aggregate (G&A) Heads. >Gather: pulls relevant info.>Aggregate: combines it to answer. Transformer and SSM seem different—but we find that both use G&A!.And in both, a few heads do all the work 💪.3/

1

12

Aviv Bick

@avivbick

2 months

We removed one head from a pruned 4B Llama. 📉 MMLU dropped: 66% → 25%.The model still knew the answer, but couldn’t retrieve the correct letter (A–D) from context. The same phenomenon appears in SSMs 🐍.2/.

1

0

11

Aviv Bick

@avivbick

2 months

RT @electronickale: ✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images?. PRIS….

0

31

0

Aviv Bick

@avivbick

3 months

RT @kevinyli_: At #ICLR2025 to present two recent works on reasoning distillation and efficient VLM inference with my wonderful collaborato….

0

4

0

Aviv Bick

@avivbick

3 months

RT @ashertrockman: Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your….

0

29

0

Aviv Bick

@avivbick

3 months

RT @LiaoIsaac91893: Scores 4.17% on ARC-AGI 2 on Kaggle!. 🔗 Code provided in the Kaggle notebook:

0

21

0

Aviv Bick

@avivbick

4 months

RT @KrithikTweets: 🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achievi….

0

146

0

Aviv Bick

@avivbick

4 months

RT @_albertgu: Announcing Cartesia’s Series A towards our mission of building real-time intelligence. I’m cooking up some new models in the….

0

10

0

Aviv Bick

@avivbick

4 months

Amazing!. Small & efficient on-device Llamba reduces communication with larger cloud models—pushing the edge forward 🚀. Check it out ->

Avanika Narayan

@Avanika15

4 months

[3/7] minions daily ship 🚢. 🐍 ssms on edge — support for running @cartesia_ai’s llamba models natively on-device. s/o to @avivbick, @jundesai, @krandiash for the tlc 💚 . see the models in action, summarizing the superintelligence strategy from @DanHendrycks, @ericschmidt and

1

6

Aviv Bick

@avivbick

4 months

RT @PranjalAggarw16: What if you could control how long a reasoning model “thinks”?. Presenting L1-1.5B, an RL-trained reasoning model with….

0

70

0

Aviv Bick

@avivbick

4 months

A joint work with @TobiasKatsch @nimit_sohoni @jundesai @_albertgu from @cartesia_ai.

0

1

10

Aviv Bick

@avivbick

4 months

6️⃣ Try it out!.We’ve released an optimized implementation for edge deployment:.🔗 For more details, check out:.📄 The full paper: 📝 The blog post:

1

3

14