Nicolas Zucchet @NicolasZucchet X Profile

Nicolas Zucchet

@NicolasZucchet

Followers

496

Following

743

Media

33

Statuses

162

PhD student @CSatETH prev. student researcher @GoogleDeepMind | @Polytechnique

Joined December 2017

Don't wanna be here? Send us removal request.

Nicolas Zucchet

@NicolasZucchet

1 month

🧵What if emergence could be explained by learning a specific circuit: sparse attention?. Our new work explores this bold hypothesis, showing a link between emergence and sparse attention that reveals how data properties influence when emergence occurs during training.

1

39

257

Nicolas Zucchet

@NicolasZucchet

18 days

RT @oswaldjoh: Super happy and proud to share our novel scalable RNN model - the MesaNet! . This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹….

0

64

0

Nicolas Zucchet

@NicolasZucchet

30 days

RT @scychan_brains: Emergence in transformers is a real phenomenon!. Behaviors and capabilities can appear in models in sudden ways. Emerge….

0

42

0

Nicolas Zucchet

@NicolasZucchet

1 month

RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….

0

40

0

Nicolas Zucchet

@NicolasZucchet

1 month

RT @AndrewLampinen: Some nice analysis by Nicolas & Francesco of a clear case of emergence — and how to accelerate its acquisition!.

0

3

0

Nicolas Zucchet

@NicolasZucchet

1 month

RT @scychan_brains: Smooth predictable scaling laws are central to our conceptions and forecasts about AI -- but lots of capabilities actua….

0

2

0

Nicolas Zucchet

@NicolasZucchet

1 month

Huge thanks to my amazing coauthors @dngfra @AndrewLampinen @scychan_brains 🙏. Excited to see where this research on emergence and sparse attention leads. Check out the full paper here:

1

2

17

Nicolas Zucchet

@NicolasZucchet

1 month

This opens fascinating questions, e.g. How much observed emergence links to sparse attention? .Since sparse attention is ubiquitous in LLMs, is emergence more common than we think? .Can we accelerate it by reducing data diversity or sequence length?.

1

7

Nicolas Zucchet

@NicolasZucchet

1 month

We validate on an in-context associative recall task where Transformers learn induction heads—circuits using two attention layers focusing on few tokens. Perfect testbed for our sparse attention theory: the same qualitative findings also hold here!

1

7

Nicolas Zucchet

@NicolasZucchet

1 month

Low data diversity dramatically accelerates emergence:.- In-context repetition reduces attention sparsity, thus simplifying the task.- Cross-sample repetition accelerates learning in repeated input directions, which then speeds up attention learning and learning overall

1

7

Nicolas Zucchet

@NicolasZucchet

1 month

This task produces phase transitions typical of emergence! The initial plateau occurs as feedforward weights must learn some task structure before attention can learn. In a toy model, we can analytically predict emergence timing based on sequence length & input dimensionality.

1

9

Nicolas Zucchet

@NicolasZucchet

1 month

We test this idea with a simple linear regression variant where networks must find the relevant token and transform it. We add repetition either in-context (same token appears multiple times in the context) or cross-sample (same token appears more frequently across examples).

1

9

Nicolas Zucchet

@NicolasZucchet

2 months

RT @tyler_m_john: I really like this new op ed from @DavidDuvenaud on how so many different kinds of pressures could drive towards loss of….

0

35

0

Nicolas Zucchet

@NicolasZucchet

2 months

RT @AndrewLampinen: How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context le….

0

148

0

Nicolas Zucchet

@NicolasZucchet

2 months

RT @orvieto_antonio: This is just a reminder for your NeurIPS experiments: if you are comparing architectures, optimizers, or whatever at a….

0

5

0

Nicolas Zucchet

@NicolasZucchet

3 months

RT @sohamde_: Our new paper sheds light on the process of knowledge acquisition in language models, with implications for.- data curricula….

0

6

0

Nicolas Zucchet

@NicolasZucchet

3 months

RT @K_Ishi_AI: Google DeepMindより、LLMの知識獲得プロセスを解明した論文が出た。. LLMの学習初期には知識獲得の停滞期(プラトー期)が存在する。. だが実は、この期間に特定の要素に着目し、知識獲得を行う効率的な注意パターンを確立。そして急速な知….

0

257

0

Nicolas Zucchet

@NicolasZucchet

3 months

Thanks to my co-authors Jörg Bornschein, @scychan_brains, @AndrewLampinen, Razvan Pascanu, and @sohamde_. I couldn't have dreamed of a better team for this collaboration!. Check out the full paper for all the technical details.

1

0

12

Nicolas Zucchet

@NicolasZucchet

3 months

Our work suggests practical LLM training strategies:.1. use synthetic data early as plateau phase data isn't retained anyway.2. implement dynamic data schedulers that use low diversity during plateaus and high diversity afterward (which is similar to how we learn as infants!).

4

0

17

Nicolas Zucchet

@NicolasZucchet

3 months

Hallucinations emerge with knowledge. As models learn facts about seen individuals, they also make overconfident predictions about unseen ones. On top of that, fine-tuning struggles to add new knowledge: existing memories are quickly corrupted when learning new ones.

2

0

12