Benjamin F Spector Profile
Benjamin F Spector

@bfspector

Followers
4K
Following
2K
Media
18
Statuses
109

stanford cs phd student. i make ml go brr.

Joined October 2020
Don't wanna be here? Send us removal request.
@bfspector
Benjamin F Spector
15 days
RT @stuart_sul: MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we comp….
0
98
0
@bfspector
Benjamin F Spector
16 days
RT @annarmonaco: Paradigm is the AI-native spreadsheet to eliminate menial work. Thousands of users have saved 10,000+ hours with Paradigm,….
0
191
0
@grok
Grok
21 days
"A stylish woman in a beige coat walking confidently past a moving train at a modern train station.". Create images and videos in seconds with Grok Imagine.
777
722
4K
@bfspector
Benjamin F Spector
2 months
RT @DecartAI: Introducing MirageLSD: The First Live-Stream Diffusion (LSD) AI Model. Input any video stream, from a camera or video chat to….
0
343
0
@bfspector
Benjamin F Spector
2 months
RT @typedfemale: presenting: big jeff's trainium hell
0
186
0
@bfspector
Benjamin F Spector
2 months
RT @jerrywliu: 1/10.ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we intr….
0
122
0
@bfspector
Benjamin F Spector
3 months
RT @jordanjuravsky: Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for….
0
48
0
@bfspector
Benjamin F Spector
3 months
RT @ollama: 3 months ago, Stanford's Hazy Research lab introduced Minions, a project that connects Ollama to frontier cloud models to reduc….
0
182
0
@bfspector
Benjamin F Spector
3 months
RT @jordanjuravsky: We wrote a megakernel!. Excited to share how we fused Llama-1B into a single kernel to reach SOTA latency. Check out o….
0
9
0
@bfspector
Benjamin F Spector
3 months
(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: Blog:
hazyresearch.stanford.edu
4
16
195
@bfspector
Benjamin F Spector
3 months
(4/5) A big problem is synchronization. Normally, kernel boundaries synchronize for you. But we got rid of them all, so we have to do it ourselves. Fortunately, we found fine-grained synchronization enabled other optimizations, too -- like starting some attention heads early!.
1
2
77
@bfspector
Benjamin F Spector
3 months
(3/5) To run Llama-1B fast, we need to hide latencies like loading weights. So, we divide each SM’s shared memory into 16KiB pages, and specialize threads by role. So, loader threads can start loading future weights while worker threads work on the current ones.
1
2
74
@bfspector
Benjamin F Spector
3 months
(2/5) Our Llama megakernel is built around an on-GPU interpreter. Each SM fetches and executes huge, custom instructions from a special instruction tensor, so the GPU can be doing many different things. Without kernel boundaries, each SM can go from one instruction to the next.
1
3
81
@bfspector
Benjamin F Spector
3 months
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically:. (Joint
Tweet media one
34
145
888
@bfspector
Benjamin F Spector
4 months
RT @j0nathanj: Introducing Multiverse: the first AI-generated multiplayer game. Multiplayer was the missing piece in AI-generated worlds —….
0
199
0
@bfspector
Benjamin F Spector
5 months
RT @tanishqkumar07: trained a nanoGPT? feeling behind before o4-mini?. 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help peo….
0
49
0
@bfspector
Benjamin F Spector
6 months
(6/6) Check out our kernels and learn more on our blog. And a huge thanks to @togethercompute for GPUs and their broader, continuing support.
0
1
14
@bfspector
Benjamin F Spector
6 months
RT @realDanFu: A little pre-GTC present for everyone. new Blackwell kernels, all written in ThunderKittens! ⚡️🐱. BF16 & FP8 GEMMs, attent….
0
13
0
@bfspector
Benjamin F Spector
6 months
(5/6) We also discovered the B200 tensor cores behave. similarly to 128×128 systolics, meaning you want M and N >= 128 for full FLOP utilization. Smaller values run at corresponding fractions—a bit different from Hopper where much smaller shapes could max out the GPU.
1
1
18
@bfspector
Benjamin F Spector
6 months
(4/6) The secret? We pipeline everything—from HBM to shared memory, from cluster shared memory to tensor cores, from tensor memory into registers, from registers into shared memory, and then finally out into HBM. No bubbles!.
1
2
14
@bfspector
Benjamin F Spector
6 months
(3/6) Happily, the new hardware features fit perfectly into TK's tile-based abstractions. We're taking full advantage of 5th-generation tensor cores, Tensor Memory, and CTA pairs to write simple kernels that match or exceed NVIDIA’s handcrafted libraries.
1
0
14