Satoki I @ ICML @SisForCollege X Profile

Satoki I @ ICML

@SisForCollege

Followers

488

Following

2K

Media

46

Statuses

650

TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN optimization site: https://t.co/3NoUYlliTa

Joined August 2018

Don't wanna be here? Send us removal request.

Satoki I @ ICML

@SisForCollege

2 days

RT @soumithchintala: considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by de….

0

56

0

Satoki I @ ICML

@SisForCollege

23 days

I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width.

0

12

65

Satoki I @ ICML

@SisForCollege

28 days

RT @SN_INGE: BREAKING NEWS.Congratulations to Professor Shun-ichi Amari!. 2025 Kyoto Prize Laureates.

0

170

0

Satoki I @ ICML

@SisForCollege

1 month

The technical paper for Gemini 2.5 mention improvements in “signal propagation” and “optimization dynamics.” Those terms make it sound like theoretical insights have been applied, and if so, I’d be very curious to learn exactly what those insights are.

0

1

14

Satoki I @ ICML

@SisForCollege

2 months

RT @anilkseth: 1/3 @geoffreyhinton once said that the future depends on some graduate student being suspicious of everything he says (via @….

0

109

0

Satoki I @ ICML

@SisForCollege

2 months

アファナシエフのコンサートでの空気感をそのままに感じられる素晴らしいショパンのマズルカ．この自由で瞑想的な演奏をあのアファナシエフの重力場の中でまた聞きたいですが，もう来日は叶いませんかね．．．.

0

2

Satoki I @ ICML

@SisForCollege

2 months

In this paper, they are experimenting with the combination of muon and muP, but since muon is mathematically equivalent to Shampoo, the muP of muon should correspond to the muP of Shampoo in our paper. . muP for Shampoo.

0

7

Satoki I @ ICML

@SisForCollege

3 months

2️⃣ Poster #608.🕒 Sat, Apr 26 • 15:00 – 17:30 .📍 Location: #608.🔍 Title: PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis. #ICLR2025.

0

1

0

Satoki I @ ICML

@SisForCollege

3 months

Today, I'll be presenting 2 posters in #ICLR2025 .1️⃣ Poster #135.🕙 Sat, Apr 26 • 10:00 – 12:30 .📍 Location: #135.🔬 Title: Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation.

1

6

Satoki I @ ICML

@SisForCollege

3 months

RT @myamada0: We’re happy to share that members of our MLDS unit (@OISTedu) will present several papers at #ICLR2025!. Topics include brain….

0

5

0

Satoki I @ ICML

@SisForCollege

3 months

RT @iclr_conf: Test of Time Winner. Adam: A Method for Stochastic Optimization.Diederik P. Kingma, Jimmy Ba. Adam revolutionized neural net….

0

49

0

Satoki I @ ICML

@SisForCollege

3 months

1. Whitening and second-order optimization.2. Whitening for data that follow Zipf’s law.

0

1

5

Satoki I @ ICML

@SisForCollege

3 months

Second-order optimization has an effect similar to data whitening. How would this apply to GPT? Data following Zipf’s law are far from whitened. If the faster convergence of Adam or Muon is due to unwhitened data, it might be worth comparing optimization when data is whitened.

2

13

Satoki I @ ICML

@SisForCollege

4 months

Since gradients are stored in E5M2, using matmul in E5M2 with Newton-Schulz might stabilize training. However, matmul in E5M2 isn’t supported, as far as I know. We might need to make a rich optimizer with overhead operating in low precision, like fp8 or fp4.

0

6

Satoki I @ ICML

@SisForCollege

4 months

In Keller's modded GPT Muon, we use FP8 for both fwd/bwd passes, while Newton-Schulz does not. Training diverges when we use E4M3 matmul in Newton-Schulz. Using bf16 precision in Newton-Schulz may limit further acceleration, but is it impossible to lower the precision further?

5

8

87

Satoki I @ ICML

@SisForCollege

4 months

RT @andrewgwils: Good research is mostly about knowing what questions to ask, not about answering questions that other people are asking.

0

50

0

Satoki I @ ICML

@SisForCollege

4 months

RT @andrewgwils: My new paper "Deep Learning is Not So Mysterious or Different": Generalization behaviours in deep….

0

299

0

Satoki I @ ICML

@SisForCollege

5 months

What does he mean by "training details" here? How much detail does he intend? Everyone adjusts the learning rate, but they don’t scale embeddings and layer norms larger than other layers based on muP. Also, few correctly scale ε. Personally, I think it’s these kinds of details.

0

Satoki I @ ICML

@SisForCollege

5 months

What does "the science of scaling" refer to here? What algorithms are they using? Based on Igor's past papers, muP seems a likely candidate for the science of scaling. However, leaks suggest OAI also uses muP. I'm curious about what they mean by it.

Igor Babuschkin

@ibab

5 months

@GavinSBaker Algorithms are also important, and they become increasingly important as the model size increases. I suspect the main reason people haven’t been able to train a better model than Grok 3 is that they didn’t get all the details of the training right. There is a huge amount of.

2

0

1

Satoki I @ ICML

@SisForCollege

5 months

It's nice to see top researchers gathering at an ISM that is somehow close to my home (though I'm not affiliated with ISM) 🤣.

HB

@levelfour_

5 months

Officially I moved to the Institute of Statistical Mathematics as an associate professor. This was not possible without the supports of my collaborators. I would like to thank all of them and further develop ML theory (+ more!) in the new environment.

1

0

5