Satoki Ishikawa
@SisForCollege
Followers
523
Following
3K
Media
47
Statuses
660
TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN optimization. Looking for great collabolation research. site: https://t.co/3NoUYlliTa
Joined August 2018
Bach is so timeless because he wasn't writing for people, he was writing for a higher power. Try writing your next paper for God. Imagine how many rubbish papers we wouldn't see anymore. Your audience sees your every thought and intention. There would be no ego, no pretense.
5
11
186
I can accept that the max LR transfers well with μP. However, the optimal LR seems far more complex. It's influenced by many other factors, such as finding a rate that avoids "forgetting" or instability. Of course, alignment between vectors would be also important...
1
0
1
One thing I've been wondering about HP transfer in μP is what criterion they're using to define "transfer." For instance, TP4 seems to state that the max LR (the maximum LR that doesn't diverge) transfers. But then, TP5 claims that the optimal LR transfers. Which is correct?
Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.
1
1
1
I'm updating awesome-second-order optimization. If you find important / interesting papers not cited in this repository, please let me know. https://t.co/1TqAt5mtii
github.com
Contribute to riverstone496/awesome-second-order-optimization development by creating an account on GitHub.
2
3
12
ACT-X「次世代AIを築く数理・情報 科学の革新」に採択されました.引き続き,ニューラルネットワークの最適化について,深い理解を得られるよう研究していきます😁
3
4
62
I won’t make it to ICML this year, but our work will be presented at the 2nd AI for Math Workshop @ ICML 2025 (@ai4mathworkshop). Huge thanks to my co‑author @SisForCollege for presenting on my behalf. please drop by if you’re around!
1
8
48
considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... https://t.co/ev2J8hgf3a
34
59
854
I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width. https://t.co/S1I2Eb1DGK
arxiv.org
Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On...
0
12
67
BREAKING NEWS Congratulations to Professor Shun-ichi Amari! 2025 Kyoto Prize Laureates https://t.co/4fOpidHtO1
kyotoprize.org
Shun-ichi Amari
0
163
471
The technical paper for Gemini 2.5 mention improvements in “signal propagation” and “optimization dynamics.” Those terms make it sound like theoretical insights have been applied, and if so, I’d be very curious to learn exactly what those insights are. https://t.co/M9Q7GLIZdu
0
1
14
1/3 @geoffreyhinton once said that the future depends on some graduate student being suspicious of everything he says (via @lexfridman). He also said was that it was impossible to find biologically plausible approaches to backprop that scale well: https://t.co/pbMiB8Qgis.
17
109
981
アファナシエフのコンサートでの空気感をそのままに感じられる素晴らしいショパンのマズルカ.この自由で瞑想的な演奏をあの���ファナシエフの重力場の中でまた聞きたいですが,もう来日は叶いませんかね... https://t.co/lwuxJsRof6
0
0
2
In this paper, they are experimenting with the combination of muon and muP, but since muon is mathematically equivalent to Shampoo, the muP of muon should correspond to the muP of Shampoo in our paper. https://t.co/TeyHIvyc87 muP for Shampoo https://t.co/w0IMoNWXC7
arxiv.org
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on...
0
0
7
2️⃣ Poster #608 🕒 Sat, Apr 26 • 15:00 – 17:30 📍 Location: #608 🔍 Title: PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis https://t.co/nFPukrpIKw
#ICLR2025
0
1
0
Today, I'll be presenting 2 posters in #ICLR2025 1️⃣ Poster #135 🕙 Sat, Apr 26 • 10:00 – 12:30 📍 Location: #135 🔬 Title: Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation https://t.co/CbryYEMayC
1
1
6
Test of Time Winner Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Adam revolutionized neural network training, enabling significantly faster convergence and more stable training across a wide variety of architectures and tasks.
4
48
525
1. Whitening and second-order optimization https://t.co/nDDySqJUDG 2. Whitening for data that follow Zipf’s law https://t.co/P4ZYjMyw0M
arxiv.org
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an...
0
1
5
Second-order optimization has an effect similar to data whitening. How would this apply to GPT? Data following Zipf’s law are far from whitened. If the faster convergence of Adam or Muon is due to unwhitened data, it might be worth comparing optimization when data is whitened.
2
2
13