Nicolas Zucchet Profile
Nicolas Zucchet

@NicolasZucchet

Followers
496
Following
743
Media
33
Statuses
162

PhD student @CSatETH prev. student researcher @GoogleDeepMind | @Polytechnique

Joined December 2017
Don't wanna be here? Send us removal request.
@NicolasZucchet
Nicolas Zucchet
1 month
🧵What if emergence could be explained by learning a specific circuit: sparse attention?. Our new work explores this bold hypothesis, showing a link between emergence and sparse attention that reveals how data properties influence when emergence occurs during training.
Tweet media one
1
39
257
@NicolasZucchet
Nicolas Zucchet
18 days
RT @oswaldjoh: Super happy and proud to share our novel scalable RNN model - the MesaNet! . This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹….
0
64
0
@NicolasZucchet
Nicolas Zucchet
30 days
RT @scychan_brains: Emergence in transformers is a real phenomenon!. Behaviors and capabilities can appear in models in sudden ways. Emerge….
0
42
0
@NicolasZucchet
Nicolas Zucchet
1 month
RT @orvieto_antonio: We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising….
0
40
0
@NicolasZucchet
Nicolas Zucchet
1 month
RT @AndrewLampinen: Some nice analysis by Nicolas & Francesco of a clear case of emergence — and how to accelerate its acquisition!.
0
3
0
@NicolasZucchet
Nicolas Zucchet
1 month
RT @scychan_brains: Smooth predictable scaling laws are central to our conceptions and forecasts about AI -- but lots of capabilities actua….
0
2
0
@NicolasZucchet
Nicolas Zucchet
1 month
Huge thanks to my amazing coauthors @dngfra @AndrewLampinen @scychan_brains 🙏. Excited to see where this research on emergence and sparse attention leads. Check out the full paper here:
1
2
17
@NicolasZucchet
Nicolas Zucchet
1 month
This opens fascinating questions, e.g. How much observed emergence links to sparse attention? .Since sparse attention is ubiquitous in LLMs, is emergence more common than we think? .Can we accelerate it by reducing data diversity or sequence length?.
1
1
7
@NicolasZucchet
Nicolas Zucchet
1 month
We validate on an in-context associative recall task where Transformers learn induction heads—circuits using two attention layers focusing on few tokens. Perfect testbed for our sparse attention theory: the same qualitative findings also hold here!
Tweet media one
Tweet media two
1
1
7
@NicolasZucchet
Nicolas Zucchet
1 month
Low data diversity dramatically accelerates emergence:.- In-context repetition reduces attention sparsity, thus simplifying the task.- Cross-sample repetition accelerates learning in repeated input directions, which then speeds up attention learning and learning overall
Tweet media one
1
1
7
@NicolasZucchet
Nicolas Zucchet
1 month
This task produces phase transitions typical of emergence! The initial plateau occurs as feedforward weights must learn some task structure before attention can learn. In a toy model, we can analytically predict emergence timing based on sequence length & input dimensionality.
Tweet media one
1
1
9
@NicolasZucchet
Nicolas Zucchet
1 month
We test this idea with a simple linear regression variant where networks must find the relevant token and transform it. We add repetition either in-context (same token appears multiple times in the context) or cross-sample (same token appears more frequently across examples).
Tweet media one
1
1
9
@NicolasZucchet
Nicolas Zucchet
2 months
RT @tyler_m_john: I really like this new op ed from @DavidDuvenaud on how so many different kinds of pressures could drive towards loss of….
0
35
0
@NicolasZucchet
Nicolas Zucchet
2 months
RT @AndrewLampinen: How do language models generalize from information they learn in-context vs. via finetuning? We show that in-context le….
0
148
0
@NicolasZucchet
Nicolas Zucchet
2 months
RT @orvieto_antonio: This is just a reminder for your NeurIPS experiments: if you are comparing architectures, optimizers, or whatever at a….
0
5
0
@NicolasZucchet
Nicolas Zucchet
3 months
RT @sohamde_: Our new paper sheds light on the process of knowledge acquisition in language models, with implications for.- data curricula….
0
6
0
@NicolasZucchet
Nicolas Zucchet
3 months
RT @K_Ishi_AI: Google DeepMindより、LLMの知識獲得プロセスを解明した論文が出た。. LLMの学習初期には知識獲得の停滞期(プラトー期)が存在する。. だが実は、この期間に特定の要素に着目し、知識獲得を行う効率的な注意パターンを確立。そして急速な知….
0
257
0
@NicolasZucchet
Nicolas Zucchet
3 months
Thanks to my co-authors Jörg Bornschein, @scychan_brains, @AndrewLampinen, Razvan Pascanu, and @sohamde_. I couldn't have dreamed of a better team for this collaboration!. Check out the full paper for all the technical details.
1
0
12
@NicolasZucchet
Nicolas Zucchet
3 months
Our work suggests practical LLM training strategies:.1. use synthetic data early as plateau phase data isn't retained anyway.2. implement dynamic data schedulers that use low diversity during plateaus and high diversity afterward (which is similar to how we learn as infants!).
4
0
17
@NicolasZucchet
Nicolas Zucchet
3 months
Hallucinations emerge with knowledge. As models learn facts about seen individuals, they also make overconfident predictions about unseen ones. On top of that, fine-tuning struggles to add new knowledge: existing memories are quickly corrupted when learning new ones.
Tweet media one
2
0
12