William Hu
@_williamhu
Followers
534
Following
162
Media
4
Statuses
46
CS @Stanford š²
Stanford, CA
Joined July 2022
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
7
37
161
Happy š¦ Thanksgiving weekend! š This year, we cooked up a new recipe for juicy fact-storing MLPs. Instead of picking apart trained models, we asked: Can we construct fact-storing MLPs from scratch? š¤ Spoiler: we can & we figured out how to slot these hand-crafted MLPs into
8
47
336
Super excited for ParallelKittens led by @stuart_sul! From the Nvidia A100 to the B200, BF16 tensor core performance improved by 7.2Ć and High Bandwidth Memory bandwidth by 5.1Ć, while intra-node communication (NVLink) improved by only 3Ć and inter-node (PCIe/InfiniBand) by just
(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! Weāre happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new
4
22
194
(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! Weāre happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new
8
60
516
š
0
0
14
AI is multi-silicon š
AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and
0
2
22
We had a great time previewing HipKittens at AMD Dev Day a few weeks ago!!
AI has been built on one vendorās stack for too long. AMDās GPUs now offer state-of-the-art peak compute and memory bandwidth ā but the lack of mature software / the āCUDA moatā keeps that power locked away. Time to break it and ride into our multi-silicon future. š It's been a
7
8
223
HipKittens is here. A new stack of fast, readable AMD GPU kernels built for real performance and real developer velocity. Check it out on the @HazyResearch blog: https://t.co/zoeAB7Ujfs
#AMDevs
0
7
55
When your ArXiv submission goes from on hold to accepted [ https://t.co/0Nd01UB1do]. Thereās a pun at the end of the introduction if anyone spots it š
2
3
18
Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we proposeĀ intelligence per watt (IPW):
48
137
439
Stanford used to be an NVIDIA stronghold, it even has a building named after Jensen, the Jen-Hsun Huang Engineering Center. But it seems AMD is starting to gain traction in Stanfordās research labs now, with experimental ROCm support in ThunderKittens. NVIDIA will need to
AI has been built on one vendorās stack for too long. AMDās GPUs now offer state-of-the-art peak compute and memory bandwidth ā but the lack of mature software / the āCUDA moatā keeps that power locked away. Time to break it and ride into our multi-silicon future. š It's been a
14
34
450
Couldnāt be prouder of @_williamhu ā man really made AMD finally go brrrrrrrrr š„ Back in our compiler class this winter, Will and I came across a @SemiAnalysis_ article on the gap between AMD vs NVIDIA (MI300X vs H200) performance, even though AMD hardware looks beefier on
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
3
10
148
AI has been built on one vendorās stack for too long. AMDās GPUs now offer state-of-the-art peak compute and memory bandwidth ā but the lack of mature software / the āCUDA moatā keeps that power locked away. Time to break it and ride into our multi-silicon future. š It's been a
13
97
583
Fast Kernels on AMD hardware!! Great work from @simran_s_arora @_williamhu @Drewwad and team on HipKittens ( https://t.co/6tAvdO1ypA).
github.com
Fast and Furious AMD Kernels. Contribute to HazyResearch/HipKittens development by creating an account on GitHub.
AI has been built on one vendorās stack for too long. AMDās GPUs now offer state-of-the-art peak compute and memory bandwidth ā but the lack of mature software / the āCUDA moatā keeps that power locked away. Time to break it and ride into our multi-silicon future. š It's been a
3
18
151
HipKittens š¤ tinykittens
HipKittens! These kittens will go great with tinykittens. We're working to reproduce their high speed results on our MI350X with code written in tinygrad UOps that implements the same data access patterns. Thank you for the release @HazyResearch
0
0
11
Check out HipKittens led by @_williamhu @Drewwad @simran_s_arora and the patterns they learned about how to write fast AMD kernels! My favorite nuggets - fine-grained overlapping of compute and I/O with 8- and 4-warp schedules, and better L2 reuse with chiplets. Fun stuff!
AI has been built on one vendorās stack for too long. AMDās GPUs now offer state-of-the-art peak compute and memory bandwidth ā but the lack of mature software / the āCUDA moatā keeps that power locked away. Time to break it and ride into our multi-silicon future. š It's been a
1
2
16
AIās future is multi-silicon. HipKittens shows we can hit SoTa speeds on AMD without sacrificing simplicity. šāā¬Read the blog -> [ https://t.co/lYrV6QBX6O] [ https://t.co/siiqrU9Hk1] šPaper -> [ https://t.co/zcO76asXaC] š»Code -> [ https://t.co/yUno2y0GPR] Work done with our amazing
1
2
18
šHK 8-wave & 4-wave kernels: - GEMMs: 1.6 PFLOPs (peak) - GQA non-causal backwards: +1.8x over AITER/CK/SDPA - Fused dropout-residual-layernorm: +1.5x over torch.compile Peak performance and readable code!
1
0
14
We reverse-engineered: - Undocumented shared-memory phase behavior - Register scheduling limits in HIPCC - Optimal wave schedules - Cache-aware grid scheduling ⦠and š§āš³baked it all into composable HK primitives.
1
0
19