SisForCollege Profile Banner
Satoki I @ ICML Profile
Satoki I @ ICML

@SisForCollege

Followers
488
Following
2K
Media
46
Statuses
650

TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN optimization site: https://t.co/3NoUYlliTa

Joined August 2018
Don't wanna be here? Send us removal request.
@SisForCollege
Satoki I @ ICML
2 days
RT @soumithchintala: considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by de….
0
56
0
@SisForCollege
Satoki I @ ICML
23 days
I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width.
0
12
65
@SisForCollege
Satoki I @ ICML
28 days
RT @SN_INGE: BREAKING NEWS.Congratulations to Professor Shun-ichi Amari!. 2025 Kyoto Prize Laureates.
0
170
0
@SisForCollege
Satoki I @ ICML
1 month
The technical paper for Gemini 2.5 mention improvements in “signal propagation” and “optimization dynamics.” Those terms make it sound like theoretical insights have been applied, and if so, I’d be very curious to learn exactly what those insights are.
Tweet media one
0
1
14
@SisForCollege
Satoki I @ ICML
2 months
RT @anilkseth: 1/3 @geoffreyhinton once said that the future depends on some graduate student being suspicious of everything he says (via @….
0
109
0
@SisForCollege
Satoki I @ ICML
2 months
アファナシエフのコンサートでの空気感をそのままに感じられる素晴らしいショパンのマズルカ.この自由で瞑想的な演奏をあのアファナシエフの重力場の中でまた聞きたいですが,もう来日は叶いませんかね....
0
0
2
@SisForCollege
Satoki I @ ICML
2 months
In this paper, they are experimenting with the combination of muon and muP, but since muon is mathematically equivalent to Shampoo, the muP of muon should correspond to the muP of Shampoo in our paper. . muP for Shampoo.
0
0
7
@SisForCollege
Satoki I @ ICML
3 months
2️⃣ Poster #608.🕒 Sat, Apr 26 • 15:00 – 17:30 .📍 Location: #608.🔍 Title: PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis. #ICLR2025.
0
1
0
@SisForCollege
Satoki I @ ICML
3 months
Today, I'll be presenting 2 posters in #ICLR2025 .1️⃣ Poster #135.🕙 Sat, Apr 26 • 10:00 – 12:30 .📍 Location: #135.🔬 Title: Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation.
1
1
6
@SisForCollege
Satoki I @ ICML
3 months
RT @myamada0: We’re happy to share that members of our MLDS unit (@OISTedu) will present several papers at #ICLR2025!. Topics include brain….
0
5
0
@SisForCollege
Satoki I @ ICML
3 months
RT @iclr_conf: Test of Time Winner. Adam: A Method for Stochastic Optimization.Diederik P. Kingma, Jimmy Ba. Adam revolutionized neural net….
0
49
0
@SisForCollege
Satoki I @ ICML
3 months
1. Whitening and second-order optimization.2. Whitening for data that follow Zipf’s law.
0
1
5
@SisForCollege
Satoki I @ ICML
3 months
Second-order optimization has an effect similar to data whitening. How would this apply to GPT? Data following Zipf’s law are far from whitened. If the faster convergence of Adam or Muon is due to unwhitened data, it might be worth comparing optimization when data is whitened.
2
2
13
@SisForCollege
Satoki I @ ICML
4 months
Since gradients are stored in E5M2, using matmul in E5M2 with Newton-Schulz might stabilize training. However, matmul in E5M2 isn’t supported, as far as I know. We might need to make a rich optimizer with overhead operating in low precision, like fp8 or fp4.
0
0
6
@SisForCollege
Satoki I @ ICML
4 months
In Keller's modded GPT Muon, we use FP8 for both fwd/bwd passes, while Newton-Schulz does not. Training diverges when we use E4M3 matmul in Newton-Schulz. Using bf16 precision in Newton-Schulz may limit further acceleration, but is it impossible to lower the precision further?
Tweet media one
5
8
87
@SisForCollege
Satoki I @ ICML
4 months
RT @andrewgwils: Good research is mostly about knowing what questions to ask, not about answering questions that other people are asking.
0
50
0
@SisForCollege
Satoki I @ ICML
4 months
RT @andrewgwils: My new paper "Deep Learning is Not So Mysterious or Different": Generalization behaviours in deep….
0
299
0
@SisForCollege
Satoki I @ ICML
5 months
What does he mean by "training details" here? How much detail does he intend? Everyone adjusts the learning rate, but they don’t scale embeddings and layer norms larger than other layers based on muP. Also, few correctly scale ε. Personally, I think it’s these kinds of details.
0
0
0
@SisForCollege
Satoki I @ ICML
5 months
What does "the science of scaling" refer to here? What algorithms are they using? Based on Igor's past papers, muP seems a likely candidate for the science of scaling. However, leaks suggest OAI also uses muP. I'm curious about what they mean by it.
@ibab
Igor Babuschkin
5 months
@GavinSBaker Algorithms are also important, and they become increasingly important as the model size increases. I suspect the main reason people haven’t been able to train a better model than Grok 3 is that they didn’t get all the details of the training right. There is a huge amount of.
2
0
1
@SisForCollege
Satoki I @ ICML
5 months
It's nice to see top researchers gathering at an ISM that is somehow close to my home (though I'm not affiliated with ISM) 🤣.
@levelfour_
HB
5 months
Officially I moved to the Institute of Statistical Mathematics as an associate professor. This was not possible without the supports of my collaborators. I would like to thank all of them and further develop ML theory (+ more!) in the new environment.
Tweet media one
1
0
5