SisForCollege Profile Banner
Satoki Ishikawa Profile
Satoki Ishikawa

@SisForCollege

Followers
523
Following
3K
Media
47
Statuses
660

TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN optimization. Looking for great collabolation research. site: https://t.co/3NoUYlliTa

Joined August 2018
Don't wanna be here? Send us removal request.
@andrewgwils
Andrew Gordon Wilson
8 days
Bach is so timeless because he wasn't writing for people, he was writing for a higher power. Try writing your next paper for God. Imagine how many rubbish papers we wouldn't see anymore. Your audience sees your every thought and intention. There would be no ego, no pretense.
5
11
186
@SisForCollege
Satoki Ishikawa
11 days
I can accept that the max LR transfers well with μP. However, the optimal LR seems far more complex. It's influenced by many other factors, such as finding a rate that avoids "forgetting" or instability. Of course, alignment between vectors would be also important...
1
0
1
@SisForCollege
Satoki Ishikawa
11 days
One thing I've been wondering about HP transfer in μP is what criterion they're using to define "transfer." For instance, TP4 seems to state that the max LR (the maximum LR that doesn't diverge) transfers. But then, TP5 claims that the optimal LR transfers. Which is correct?
@jasondeanlee
Jason Lee
11 days
Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.
1
1
1
@SisForCollege
Satoki Ishikawa
2 months
I'm updating awesome-second-order optimization. If you find important / interesting papers not cited in this repository, please let me know. https://t.co/1TqAt5mtii
Tweet card summary image
github.com
Contribute to riverstone496/awesome-second-order-optimization development by creating an account on GitHub.
2
3
12
@thoefler
Torsten Hoefler 🇨🇭
7 months
Rio Yokota from Tokyo Tech talks about scaling laws for #HPC, #AI training, inference, and spending 💸. We're in the exponential scaling part of a logistic curve - when will we hit the bottom? Nice discussion and analogies between the fields 🤔.
0
5
14
@SisForCollege
Satoki Ishikawa
2 months
ACT-X「次世代AIを築く数理・情報 科学の革新」に採択されました.引き続き,ニューラルネットワークの最適化について,深い理解を得られるよう研究していきます😁
3
4
62
@Setuna7777_2
Taishi Nakamura
4 months
I won’t make it to ICML this year, but our work will be presented at the 2nd AI for Math Workshop @ ICML 2025 (@ai4mathworkshop). Huge thanks to my co‑author @SisForCollege for presenting on my behalf. please drop by if you’re around!
1
8
48
@soumithchintala
Soumith Chintala
4 months
considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... https://t.co/ev2J8hgf3a
34
59
854
@SisForCollege
Satoki Ishikawa
5 months
I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width. https://t.co/S1I2Eb1DGK
Tweet card summary image
arxiv.org
Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On...
0
12
67
@SN_INGE
Information Geometry
5 months
BREAKING NEWS Congratulations to Professor Shun-ichi Amari! 2025 Kyoto Prize Laureates https://t.co/4fOpidHtO1
Tweet card summary image
kyotoprize.org
Shun-ichi Amari
0
163
471
@SisForCollege
Satoki Ishikawa
5 months
The technical paper for Gemini 2.5 mention improvements in “signal propagation” and “optimization dynamics.” Those terms make it sound like theoretical insights have been applied, and if so, I’d be very curious to learn exactly what those insights are. https://t.co/M9Q7GLIZdu
0
1
14
@anilkseth
Anil Seth
6 months
1/3 @geoffreyhinton once said that the future depends on some graduate student being suspicious of everything he says (via @lexfridman). He also said was that it was impossible to find biologically plausible approaches to backprop that scale well: https://t.co/pbMiB8Qgis.
17
109
981
@SisForCollege
Satoki Ishikawa
6 months
アファナシエフのコンサートでの空気感をそのままに感じられる素晴らしいショパンのマズルカ.この自由で瞑想的な演奏をあの���ファナシエフの重力場の中でまた聞きたいですが,もう来日は叶いませんかね... https://t.co/lwuxJsRof6
0
0
2
@SisForCollege
Satoki Ishikawa
6 months
In this paper, they are experimenting with the combination of muon and muP, but since muon is mathematically equivalent to Shampoo, the muP of muon should correspond to the muP of Shampoo in our paper. https://t.co/TeyHIvyc87 muP for Shampoo https://t.co/w0IMoNWXC7
Tweet card summary image
arxiv.org
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on...
0
0
7
@SisForCollege
Satoki Ishikawa
7 months
2️⃣ Poster #608 🕒 Sat, Apr 26 • 15:00 – 17:30 📍 Location: #608 🔍 Title: PhiNets: Brain-inspired Non-contrastive Learning Based on Temporal Prediction Hypothesis https://t.co/nFPukrpIKw #ICLR2025
0
1
0
@SisForCollege
Satoki Ishikawa
7 months
Today, I'll be presenting 2 posters in #ICLR2025 1️⃣ Poster #135 🕙 Sat, Apr 26 • 10:00 – 12:30 📍 Location: #135 🔬 Title: Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation https://t.co/CbryYEMayC
1
1
6
@myamada0
myamada0
7 months
We’re happy to share that members of our MLDS unit (@OISTedu) will present several papers at #ICLR2025! Topics include brain-inspired representation learning, optimal transport, decentralized learning, anomaly detection, and LLM uncertainty quantification. Feel free to stop by
0
5
28
@iclr_conf
ICLR 2026
7 months
Test of Time Winner Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Adam revolutionized neural network training, enabling significantly faster convergence and more stable training across a wide variety of architectures and tasks.
4
48
525
@SisForCollege
Satoki Ishikawa
7 months
Second-order optimization has an effect similar to data whitening. How would this apply to GPT? Data following Zipf’s law are far from whitened. If the faster convergence of Adam or Muon is due to unwhitened data, it might be worth comparing optimization when data is whitened.
2
2
13