@TimothyDuignan
Tim Duignan
6 months
@StasBekman This is a phase transition we see it in molecular simulations which are directly analogous. Loss: energy, entropy: int rho log rho: neural nets minimise the free energy so you can see states with similar free energy but different losses and jump between.
1
0
19

Replies

@StasBekman
Stas Bekman
6 months
I'm trying to understand this early grokking phenomenon (the training loss improves in a single burst). If you remember earlier I shared this grokking moment I was just trying llama-2-7b again with a totally different dataset and this time 8x 8xA100…
Tweet media one
18
20
166
@StasBekman
Stas Bekman
6 months
@TimothyDuignan Thank you for sharing, Tim!
0
0
0