Horace He
@cHHillee
Followers
43K
Following
7K
Media
426
Statuses
3K
@thinkymachines Formerly @PyTorch "My learning style is Horace twitter threads" - @typedfemale
chhillee
Joined February 2010
Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
74
206
3K
I'll be at the Triton conference today, the Pytorch conference in the morning tomorrow, and then GPU mode Hackathon on Friday! DM me if you'd like to meet up/chat about Thinking Machines, Pytorch, ML systems, or anything else!
4
6
219
I'll be at Triton conference on Tuesday, Pytorch Conference in the morning on Wednesday, and GPU mode hackathon on Friday. DM me if you want to chat!
4
3
256
Luckily, that's where Tinker comes in! We can batch user requests together and run on efficient training/inferencing setups, enabling far better efficiency for users without needing massive multi-GPU setups or futzing with the infra. I'm hopeful that this makes it far easier for
5
5
126
This isn't even talking about memory, since even if you had 8192 parallel requests to process, your GPU probably doesn't have have enough memory to handle all of these parallel requests. Sadly, these factors all push fine-tuning/RL out of reach of hobbyist setups :( (4/5)
1
1
71
Since decoding processes only one token, we need >256 parallel requests for efficiency. *However*, in MoE , each token only gets routed to some experts! So for deepseekv3 (32 sparsity factor), we now need 256*32=8192 parallel requests to get good efficiency! (3/5)
1
6
73
Fundamentally, in order to get good efficiency on GPUs, you must run with a large "batch size". As we see above, the matmuls simply don't have enough arithmetic intensity at low batch sizes to achieve good performance. In practice, this means you need >256 tokens. (2/5)
1
1
73
One interesting "fundamental" reason for Tinker today is the rise of MoE. Whereas hackers used to deploy llama3-70B efficiently on one node, modern deployments of MoE models require large multinode deployments for efficiency. The underlying reason? Arithmetic intensity. (1/5)
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
15
69
837
I'm actually just very confused how the numbering on this leaderboard works.
2
0
29
> top 10 > Ranked 23rd
🚨 Big leaderboard update on the toughest Arena to crack: Text 📝 Seven new models landed today, and five broke straight into the Top 10 🏎️ 💨 🔹#8: Qwen3-VL-235B-a22b-Instruct & Qwen3-Max-2025-09-23 (tied) by @alibaba_qwen 🔹#9: DeepSeek V3.1 Terminus (Standard & Thinking
3
6
141
I quite enjoyed this and it covers a bunch of topics without good introductory resources! 1. A bunch of GPU hardware details in one place (warp schedulers, shared memory, etc.) 2. A breakdown/walkthrough of reading PTX and SASS. 3. Some details/walkthroughs of a number of other
New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute
8
94
983
Modular Manifolds: managed metrics (ie: Muon) meets manifolds, making matrix magnitudes manageable Or M^11 as I like to call it. Check out this great post by @jxbz! It introduces some cool new ideas but also doubles as a great intro to optimization beyond Adam.
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
4
23
364
I actually just gave a talk at MIT a couple days ago on some challenges in ML compilers where this was a slide. When I saw this today I hurriedly sent this blog post over.
12
11
253
Thanks to everyone who helped me with the figures and design(@alhyunsoo), helped me with experiments (@jacobmenick), and also to cut down my exclamation points by a factor of 3. :)
6
0
104
Suno 4.5 is quite impressive. Previously AI music was only ever interesting for the novelty. Now, I wouldn't blink if I heard one of these songs on a playlist. First generation I tried: Prompt: "Pop song about optimizing CUDA kernels for LLM training" https://t.co/p2ehQlpacr
8
9
213
When it comes to hardware that's meant for training or inference, most think about in hardware specs like memory bandwidth even though dev velocity is often a more important factor. One implication is that RL training and prod. inference are meaningfully different workloads.
11
8
258