Horace He Profile
Horace He

@cHHillee

Followers
43K
Following
7K
Media
426
Statuses
3K

@thinkymachines Formerly @PyTorch "My learning style is Horace twitter threads" - @typedfemale

chhillee
Joined February 2010
Don't wanna be here? Send us removal request.
@cHHillee
Horace He
2 months
Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!
@thinkymachines
Thinking Machines
2 months
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
74
206
3K
@cHHillee
Horace He
23 days
Actually, will be at Pytorch Conference on Thursday!
0
0
10
@RFupdates
Reasonable Faith
5 days
3 Step Moral Argument for God!
11
22
101
@cHHillee
Horace He
23 days
I'll be at the Triton conference today, the Pytorch conference in the morning tomorrow, and then GPU mode Hackathon on Friday! DM me if you'd like to meet up/chat about Thinking Machines, Pytorch, ML systems, or anything else!
4
6
219
@cHHillee
Horace He
23 days
I'll be at Triton conference on Tuesday, Pytorch Conference in the morning on Wednesday, and GPU mode hackathon on Friday. DM me if you want to chat!
4
3
256
@cHHillee
Horace He
1 month
🤔
@krishnanrohit
rohit
1 month
Most LLM prose is flat. But, there's no logical reason it should be! So I built “Horace” to see if measuring cadence can help steer models toward better writing, by measuring rhythm and surprise.
2
2
126
@cHHillee
Horace He
1 month
Luckily, that's where Tinker comes in! We can batch user requests together and run on efficient training/inferencing setups, enabling far better efficiency for users without needing massive multi-GPU setups or futzing with the infra. I'm hopeful that this makes it far easier for
5
5
126
@cHHillee
Horace He
1 month
This isn't even talking about memory, since even if you had 8192 parallel requests to process, your GPU probably doesn't have have enough memory to handle all of these parallel requests. Sadly, these factors all push fine-tuning/RL out of reach of hobbyist setups :( (4/5)
1
1
71
@cHHillee
Horace He
1 month
Since decoding processes only one token, we need >256 parallel requests for efficiency. *However*, in MoE , each token only gets routed to some experts! So for deepseekv3 (32 sparsity factor), we now need 256*32=8192 parallel requests to get good efficiency! (3/5)
1
6
73
@cHHillee
Horace He
1 month
Fundamentally, in order to get good efficiency on GPUs, you must run with a large "batch size". As we see above, the matmuls simply don't have enough arithmetic intensity at low batch sizes to achieve good performance. In practice, this means you need >256 tokens. (2/5)
1
1
73
@cHHillee
Horace He
1 month
One interesting "fundamental" reason for Tinker today is the rise of MoE. Whereas hackers used to deploy llama3-70B efficiently on one node, modern deployments of MoE models require large multinode deployments for efficiency. The underlying reason? Arithmetic intensity. (1/5)
@thinkymachines
Thinking Machines
1 month
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
15
69
837
@cHHillee
Horace He
1 month
I'm actually just very confused how the numbering on this leaderboard works.
2
0
29
@cHHillee
Horace He
1 month
> top 10 > Ranked 23rd
@arena
lmarena.ai
1 month
🚨 Big leaderboard update on the toughest Arena to crack: Text 📝 Seven new models landed today, and five broke straight into the Top 10 🏎️ 💨 🔹#8: Qwen3-VL-235B-a22b-Instruct & Qwen3-Max-2025-09-23 (tied) by @alibaba_qwen 🔹#9: DeepSeek V3.1 Terminus (Standard & Thinking
3
6
141
@cHHillee
Horace He
1 month
I quite enjoyed this and it covers a bunch of topics without good introductory resources! 1. A bunch of GPU hardware details in one place (warp schedulers, shared memory, etc.) 2. A breakdown/walkthrough of reading PTX and SASS. 3. Some details/walkthroughs of a number of other
@gordic_aleksa
Aleksa Gordić (水平问题)
2 months
New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute
8
94
983
@cHHillee
Horace He
2 months
Modular Manifolds: managed metrics (ie: Muon) meets manifolds, making matrix magnitudes manageable Or M^11 as I like to call it. Check out this great post by @jxbz! It introduces some cool new ideas but also doubles as a great intro to optimization beyond Adam.
@thinkymachines
Thinking Machines
2 months
Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
4
23
364
@cHHillee
Horace He
2 months
I actually just gave a talk at MIT a couple days ago on some challenges in ML compilers where this was a slide. When I saw this today I hurriedly sent this blog post over.
12
11
253
@cHHillee
Horace He
2 months
Lots of sympathy to the Anthropic team 🙏🙏🙏
@claudeai
Claude
2 months
In our investigation, we uncovered three separate bugs. They were partly overlapping, making diagnosis even trickier. We've now resolved all three bugs and written a technical report on what happened, which you can find here:
21
71
2K
@cHHillee
Horace He
2 months
Thanks to everyone who helped me with the figures and design(@alhyunsoo), helped me with experiments (@jacobmenick), and also to cut down my exclamation points by a factor of 3. :)
6
0
104
@cHHillee
Horace He
3 months
Suno 4.5 is quite impressive. Previously AI music was only ever interesting for the novelty. Now, I wouldn't blink if I heard one of these songs on a playlist. First generation I tried: Prompt: "Pop song about optimizing CUDA kernels for LLM training" https://t.co/p2ehQlpacr
8
9
213
@cHHillee
Horace He
3 months
6
21
314
@cHHillee
Horace He
3 months
When it comes to hardware that's meant for training or inference, most think about in hardware specs like memory bandwidth even though dev velocity is often a more important factor. One implication is that RL training and prod. inference are meaningfully different workloads.
11
8
258