@StasBekman
This is a phase transition we see it in molecular simulations which are directly analogous. Loss: energy, entropy: int rho log rho: neural nets minimise the free energy so you can see states with similar free energy but different losses and jump between.
I'm trying to understand this early grokking phenomenon (the training loss improves in a single burst).
If you remember earlier I shared this grokking moment
I was just trying llama-2-7b again with a totally different dataset and this time 8x 8xA100…