Benjamin F Spector
@bfspector
Followers
4K
Following
3K
Media
22
Statuses
127
stanford cs phd student. i make ml go brr.
Joined October 2020
Shout out to ThunderKittens for writing simple yet very performant GPU code. We're working on "tinykittens" which uses the same insight but in tinygrad's language. The insight is that GPU "registers" are the wrong primitive and TK's "register tile" is a lot more sensible.
10
34
693
@StanfordHAI just ran this story on self-study and cartridges -- it's a really nice overview for those curious about our work
1
19
45
(8/8) This is joint work with my amazing collaborators @jordanjuravsky, @stuart_sul, @dylan__lim, @OwenDugan, @simran_s_arora, and @HazyResearch. And special thanks to @togethercompute for providing the GPUs to make this work possible!
0
1
31
(7/8) Code is at https://t.co/2F5WMW1Wd2; it is (emphasis) research code. You can also play with our custom profiler at https://t.co/bHlWqQwqOY! We’ve written up both a brief, introductory post at https://t.co/hLW9WB28qZ and a longer, more technical one:
hazyresearch.stanford.edu
1
4
44
(6/8) We pipeline scheduling work on the CPU while the GPU runs the last batch, so that new instructions are ready by the time they’re needed. In practice, this usually leaves the CPU idle for ~90% of runtime.
1
0
24
(5/8) Third, we interleave communication-intensive instructions with compute-intensive instructions, which lets us reduce peak networking bandwidth. In large batches, this can make a substantial difference.
1
0
24
(4/8) Second, we dynamically schedule work across each GPU's SM's at runtime: SM’s just pull new work from a global work queue as needed. This simplifies scheduling on the CPU, and prevents jitter on the GPU from leading to stalls.
1
0
27
(3/8) There are three key kernel optimizations that help us achieve this performance. The first is inter-instruction pipelining. We specialize our threads to follow a dataflow pattern, so that we can overlap instructions and keep the matrix multiplies rolling!
1
0
31
(2/8) Our megakernel is built on the same on-GPU interpreter used in our low-latency megakernel, and extended with TK’s new PGL primitives to scale across GPUs. We’ve written a larger, richer instruction set to run these more complex workloads without coarsening synchronization.
1
0
35
(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
7
50
326
(1/6) We’re happy to share that ThunderKittens now supports writing multi-GPU kernels, with the same programming model and full compatibility with PyTorch + torchrun. We’re also releasing collective ops and fused multi-GPU GEMM kernels, up to 2.6x faster than PyTorch + NCCL.
5
42
364
MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our
29
105
880
Paradigm is the AI-native spreadsheet to eliminate menial work. Thousands of users have saved 10,000+ hours with Paradigm, and you can be next. Get your first month free today, then plans start at just $20/month.
249
190
2K
Introducing MirageLSD: The First Live-Stream Diffusion (LSD) AI Model Input any video stream, from a camera or video chat to a computer screen or game, and transform it into any world you desire, in real-time (<40ms latency). Here’s how it works (w/ demo you can use!):
110
336
2K
1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:
13
121
641
Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with @achakravarthy01, @ryansehrlich, @EyubogluSabri, @brad19brown, @jshetaye,
7
49
207
3 months ago, Stanford's Hazy Research lab introduced Minions, a project that connects Ollama to frontier cloud models to reduce cloud costs by 5-30x while achieving 98% of frontier model accuracy. Secure Minion turns an H100 into a secure enclave, where all memory and
24
179
1K
We wrote a megakernel! Excited to share how we fused Llama-1B into a single kernel to reach SOTA latency. Check out our blog post and code below!
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint
3
9
64
(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: https://t.co/732gnseVzP, Blog:
hazyresearch.stanford.edu
4
16
194