William Hu Profile
William Hu

@_williamhu

Followers
534
Following
162
Media
4
Statuses
46

CS @Stanford 🌲

Stanford, CA
Joined July 2022
Don't wanna be here? Send us removal request.
@_williamhu
William Hu
1 month
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
7
37
161
@OwenDugan
Owen Dugan
24 days
Happy 🦃 Thanksgiving weekend! šŸ‚ This year, we cooked up a new recipe for juicy fact-storing MLPs. Instead of picking apart trained models, we asked: Can we construct fact-storing MLPs from scratch? šŸ¤” Spoiler: we can & we figured out how to slot these hand-crafted MLPs into
8
47
336
@simran_s_arora
Simran Arora
1 month
Super excited for ParallelKittens led by @stuart_sul! From the Nvidia A100 to the B200, BF16 tensor core performance improved by 7.2Ɨ and High Bandwidth Memory bandwidth by 5.1Ɨ, while intra-node communication (NVLink) improved by only 3Ɨ and inter-node (PCIe/InfiniBand) by just
@stuart_sul
Stuart Sul
1 month
(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new
4
22
194
@stuart_sul
Stuart Sul
1 month
(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new
8
60
516
@_williamhu
William Hu
1 month
šŸ‘€
@simran_s_arora
Simran Arora
1 month
0
0
14
@__tinygrad__
the tiny corp
1 month
A great deep dive into CDNA4 here.
Tweet card summary image
hazyresearch.stanford.edu
multi silicon ai is coming
1
2
54
@_williamhu
William Hu
1 month
AI is multi-silicon šŸš€
@AIatAMD
AI at AMD
1 month
AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and
0
2
22
@simran_s_arora
Simran Arora
1 month
We had a great time previewing HipKittens at AMD Dev Day a few weeks ago!!
@simran_s_arora
Simran Arora
1 month
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the ā€œCUDA moatā€ keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
7
8
223
@AMDGPU_
AMDGPU
1 month
Computer Scientists at Stanford University achieve breakthrough AI performance on AMDs MI355x GPUs.
@simran_s_arora
Simran Arora
1 month
The results speak for themselves!!!! We outperform the baseline frameworks averaged across these important AI workloads.
2
11
128
@AIatAMD
AI at AMD
1 month
HipKittens is here. A new stack of fast, readable AMD GPU kernels built for real performance and real developer velocity. Check it out on the @HazyResearch blog: https://t.co/zoeAB7Ujfs #AMDevs
0
7
55
@_williamhu
William Hu
1 month
When your ArXiv submission goes from on hold to accepted [ https://t.co/0Nd01UB1do]. There’s a pun at the end of the introduction if anyone spots it šŸ‘€
2
3
18
@JonSaadFalcon
Jon Saad-Falcon
1 month
Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we proposeĀ intelligence per watt (IPW):
48
137
439
@SemiAnalysis_
SemiAnalysis
1 month
Stanford used to be an NVIDIA stronghold, it even has a building named after Jensen, the Jen-Hsun Huang Engineering Center. But it seems AMD is starting to gain traction in Stanford’s research labs now, with experimental ROCm support in ThunderKittens. NVIDIA will need to
@simran_s_arora
Simran Arora
1 month
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the ā€œCUDA moatā€ keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
14
34
450
@simonguozirui
Simon Guo
1 month
Couldn’t be prouder of @_williamhu — man really made AMD finally go brrrrrrrrr šŸ”„ Back in our compiler class this winter, Will and I came across a @SemiAnalysis_ article on the gap between AMD vs NVIDIA (MI300X vs H200) performance, even though AMD hardware looks beefier on
@_williamhu
William Hu
1 month
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
3
10
148
@simran_s_arora
Simran Arora
1 month
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the ā€œCUDA moatā€ keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
13
97
583
@AnushElangovan
Anush Elangovan
1 month
Fast Kernels on AMD hardware!! Great work from @simran_s_arora @_williamhu @Drewwad and team on HipKittens ( https://t.co/6tAvdO1ypA).
Tweet card summary image
github.com
Fast and Furious AMD Kernels. Contribute to HazyResearch/HipKittens development by creating an account on GitHub.
@simran_s_arora
Simran Arora
1 month
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the ā€œCUDA moatā€ keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
3
18
151
@_williamhu
William Hu
1 month
HipKittens šŸ¤ tinykittens
@__tinygrad__
the tiny corp
1 month
HipKittens! These kittens will go great with tinykittens. We're working to reproduce their high speed results on our MI350X with code written in tinygrad UOps that implements the same data access patterns. Thank you for the release @HazyResearch
0
0
11
@realDanFu
Dan Fu
1 month
Check out HipKittens led by @_williamhu @Drewwad @simran_s_arora and the patterns they learned about how to write fast AMD kernels! My favorite nuggets - fine-grained overlapping of compute and I/O with 8- and 4-warp schedules, and better L2 reuse with chiplets. Fun stuff!
@simran_s_arora
Simran Arora
1 month
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the ā€œCUDA moatā€ keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
1
2
16
@_williamhu
William Hu
1 month
AI’s future is multi-silicon. HipKittens shows we can hit SoTa speeds on AMD without sacrificing simplicity. šŸˆā€ā¬›Read the blog -> [ https://t.co/lYrV6QBX6O] [ https://t.co/siiqrU9Hk1] šŸ“„Paper -> [ https://t.co/zcO76asXaC] šŸ’»Code -> [ https://t.co/yUno2y0GPR] Work done with our amazing
1
2
18
@_williamhu
William Hu
1 month
šŸ“ˆHK 8-wave & 4-wave kernels: - GEMMs: 1.6 PFLOPs (peak) - GQA non-causal backwards: +1.8x over AITER/CK/SDPA - Fused dropout-residual-layernorm: +1.5x over torch.compile Peak performance and readable code!
1
0
14
@_williamhu
William Hu
1 month
We reverse-engineered: - Undocumented shared-memory phase behavior - Register scheduling limits in HIPCC - Optimal wave schedules - Cache-aware grid scheduling … and šŸ§‘ā€šŸ³baked it all into composable HK primitives.
1
0
19