Wojtek Masarczyk @e7mul X Profile

Wojtek Masarczyk

@e7mul

Followers

72

Following

49

Media

9

Statuses

1K

Usually teaching neural networks continually Sometimes PhD Student @ Warsaw Uni of Technology

Poland

Joined November 2016

Don't wanna be here? Send us removal request.

Wojtek Masarczyk

@e7mul

2 months

1/7 If Andrew Ng is right that the LR is the most important ML hyperparam, it's got some competition! We show that the softmax temperature is a game-changer in crafting NN representations. Often overlooked, it quietly governs generalization, collapse, and compression. A thread 👇

2

12

13

Wojtek Masarczyk

@e7mul

2 months

I hope you've found this thread helpful. Follow me @e7mul for more. Like/Repost the quote below if you can:.

Wojtek Masarczyk

@e7mul

2 months

1/7 If Andrew Ng is right that the LR is the most important ML hyperparam, it's got some competition! We show that the softmax temperature is a game-changer in crafting NN representations. Often overlooked, it quietly governs generalization, collapse, and compression. A thread 👇

0

Wojtek Masarczyk

@e7mul

2 months

Want the deep dive? Check the paper on arxiv: 2506.01562.Moral of the story? Stop tuning LR first—experiment with temperature today. And if you’ve seen temp save (or ruin) your model, share below! 👇

1

2

Wojtek Masarczyk

@e7mul

2 months

Huge thanks to the fantastic team for this collaborative effort: @MatOstasze, @AurelienLucchi, @tscheng516, @tomasztrzcinsk1, and Razvan Pascanu!. Also, thanks to @EhsanImanii and @PiotrRMilos for laying the foundation for this work!.

1

0

2

Wojtek Masarczyk

@e7mul

2 months

7/7 But you're not Yann LeCun, and you don't care about collapse. Hear this! Want better OOD generalization? Train with low temp & avoid collapse. Boost OOD detection? Raise the temp and maximize the collapse! These tasks are at odds and temperature gives you a control knob!

1

0

Wojtek Masarczyk

@e7mul

2 months

6/7 Rank deficit bias is an NN's tendency to find correct solutions with a rank lower than the number of classes. It breaks our intuitions about NN representations and shows that there are solutions of complexities far lower than predicted by Neural Collapse!🤯.

1

0

Wojtek Masarczyk

@e7mul

2 months

5/7 NNs align singular vectors hierarchically. Top vectors align early, creating a highway for information flow, but the remaining vectors lag behind, leading to representation collapse where only the top directions thrive. This gives rise to a novel phenomenon: rank deficit bias.

1

0

Wojtek Masarczyk

@e7mul

2 months

4/7 NNs found a clever way to boost the product norm of two matrices without increasing the norm of any of them. How? By aligning their singular subspaces. When repeated across multiple layers, it unlocks exponential growth of the logits norm! 🚀 But there's a catch.

1

0

Wojtek Masarczyk

@e7mul

2 months

3/7 High temperature decreases logits norm and makes softmax output almost uniform distribution for each sample -- hard to learn anything if everything looks the same. To break this, NNs find a clever way to increase the logits norm and break the symmetry. Can you guess it? 🧠.

1

0

Wojtek Masarczyk

@e7mul

2 months

2/7 Softmax does more than squish logits into probabilities—it's a true sculptor of representation! ⚒️ The magic lies in the interplay of logits norm and softmax temperature, enhancing or neutralizing each other.

1

0

Wojtek Masarczyk

@e7mul

6 months

RT @bartoszcyw: 🔥 New Paper!. How can sparse autoencoders (SAEs) applied to diffusion models help us solve real-world challenges?. 🚀 Introd….

0

53

0

Wojtek Masarczyk

@e7mul

11 months

RT @NousResearch: What if you could use all the computing power in the world to train a shared, open source AI model?. Preliminary report:….

0

582

0

Wojtek Masarczyk

@e7mul

1 year

RT @AurelienLucchi: My group has multiple openings both for PhD and Post-doc positions to work in the area of optimization for ML, and deep….

0

63

0

Wojtek Masarczyk

@e7mul

1 year

RT @IAmTimNguyen: Excited that my new paper on understanding LLMs is out, pushing how far we can describe LLM predictions via simple statis….

0

178

0

Wojtek Masarczyk

@e7mul

1 year

RT @hardmaru: Many people start attacking a problem by deploying the most sophisticated method possible with the belief that it will lead t….

0

65

0

Wojtek Masarczyk

@e7mul

1 year

RT @SebastienBubeck: Every day I witness the AI revolution in action, and every day I see 1 or 2 questions that would deserve an entire PhD….

0

37

0

Wojtek Masarczyk

@e7mul

2 years

RT @docmilanfar: Don't let low-order statistics fool you

0

1K

0

Wojtek Masarczyk

@e7mul

2 years

I'll be at #NeurIPS2023 ✈️ from Mon until Sat. Happy to chat about:.- repr. learning & its impact on continual learning.- all the bold ideas about why these overparametrized models generalize at all 😅.Find me around my posters (Thu&Fri) and during @unireps!.

0

3

9

Wojtek Masarczyk

@e7mul

2 years

Today, I'll present the Tunnel Effect paper on RL Sofa at MILA, 3 PM (EST). Tune in to know if there is a light at the end of every tunnel and how to use it in your favor.

0

2

13

Wojtek Masarczyk

@e7mul

2 years

RT @OwainEvans_UK: Does a language model trained on “A is B” generalize to “B is A”?.E.g. When trained only on “George Washington was the f….

0

666

0