William Hu @_williamhu X Profile

William Hu

@_williamhu

Followers

534

Following

162

Media

4

Statuses

46

CS @Stanford 🌲

https://t.co/35POtDHjZN

Stanford, CA

Joined July 2022

Don't wanna be here? Send us removal request.

William Hu

@_williamhu

1 month

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to

7

37

161

Owen Dugan

@OwenDugan

24 days

Happy 🦃 Thanksgiving weekend! 🍂 This year, we cooked up a new recipe for juicy fact-storing MLPs. Instead of picking apart trained models, we asked: Can we construct fact-storing MLPs from scratch? 🤔 Spoiler: we can & we figured out how to slot these hand-crafted MLPs into

8

47

336

Simran Arora

@simran_s_arora

1 month

Super excited for ParallelKittens led by @stuart_sul! From the Nvidia A100 to the B200, BF16 tensor core performance improved by 7.2× and High Bandwidth Memory bandwidth by 5.1×, while intra-node communication (NVLink) improved by only 3× and inter-node (PCIe/InfiniBand) by just

Stuart Sul

@stuart_sul

1 month

(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new

4

22

194

Stuart Sul

@stuart_sul

1 month

(1/6) GPU networking is the remaining AI efficiency bottleneck, and the underlying hardware is changing fast! We’re happy to release ParallelKittens, an update to ThunderKittens that lets you easily write fast computation-communication overlapped multi-GPU kernels, along with new

8

60

516

William Hu

@_williamhu

1 month

👀

Simran Arora

@simran_s_arora

1 month

exciting! https://t.co/qKNB4CWU8H

0

14

the tiny corp

@__tinygrad__

1 month

A great deep dive into CDNA4 here.

hazyresearch.stanford.edu

multi silicon ai is coming

1

2

54

William Hu

@_williamhu

1 month

AI is multi-silicon 🚀

AI at AMD

@AIatAMD

1 month

AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and

0

2

22

Simran Arora

@simran_s_arora

1 month

We had a great time previewing HipKittens at AMD Dev Day a few weeks ago!!

Simran Arora

@simran_s_arora

1 month

AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a

7

8

223

AMDGPU

@AMDGPU_

1 month

Computer Scientists at Stanford University achieve breakthrough AI performance on AMDs MI355x GPUs.

Simran Arora

@simran_s_arora

1 month

The results speak for themselves!!!! We outperform the baseline frameworks averaged across these important AI workloads.

2

11

128

AI at AMD

@AIatAMD

1 month

HipKittens is here. A new stack of fast, readable AMD GPU kernels built for real performance and real developer velocity. Check it out on the @HazyResearch blog: https://t.co/zoeAB7Ujfs #AMDevs

0

7

55

William Hu

@_williamhu

1 month

When your ArXiv submission goes from on hold to accepted [ https://t.co/0Nd01UB1do]. There’s a pun at the end of the introduction if anyone spots it 👀

2

3

18

Jon Saad-Falcon

@JonSaadFalcon

1 month

Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW):

48

137

439

SemiAnalysis

@SemiAnalysis_

1 month

Stanford used to be an NVIDIA stronghold, it even has a building named after Jensen, the Jen-Hsun Huang Engineering Center. But it seems AMD is starting to gain traction in Stanford’s research labs now, with experimental ROCm support in ThunderKittens. NVIDIA will need to

Simran Arora

@simran_s_arora

1 month

AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a

14

34

450

Simon Guo

@simonguozirui

1 month

Couldn’t be prouder of @_williamhu — man really made AMD finally go brrrrrrrrr 🔥 Back in our compiler class this winter, Will and I came across a @SemiAnalysis_ article on the gap between AMD vs NVIDIA (MI300X vs H200) performance, even though AMD hardware looks beefier on

William Hu

@_williamhu

1 month

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to

3

10

148

Simran Arora

@simran_s_arora

1 month

AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a

13

97

583

Anush Elangovan

@AnushElangovan

1 month

Fast Kernels on AMD hardware!! Great work from @simran_s_arora @_williamhu @Drewwad and team on HipKittens ( https://t.co/6tAvdO1ypA).

github.com

Fast and Furious AMD Kernels. Contribute to HazyResearch/HipKittens development by creating an account on GitHub.

Simran Arora

@simran_s_arora

1 month

AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a

3

18

151

William Hu

@_williamhu

1 month

HipKittens 🤝 tinykittens

the tiny corp

@__tinygrad__

1 month

HipKittens! These kittens will go great with tinykittens. We're working to reproduce their high speed results on our MI350X with code written in tinygrad UOps that implements the same data access patterns. Thank you for the release @HazyResearch

0

11

Dan Fu

@realDanFu

1 month

Check out HipKittens led by @_williamhu @Drewwad @simran_s_arora and the patterns they learned about how to write fast AMD kernels! My favorite nuggets - fine-grained overlapping of compute and I/O with 8- and 4-warp schedules, and better L2 reuse with chiplets. Fun stuff!

Simran Arora

@simran_s_arora

1 month

AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a

1

2

16

William Hu

@_williamhu

1 month

AI’s future is multi-silicon. HipKittens shows we can hit SoTa speeds on AMD without sacrificing simplicity. 🐈‍⬛Read the blog -> [ https://t.co/lYrV6QBX6O] [ https://t.co/siiqrU9Hk1] 📄Paper -> [ https://t.co/zcO76asXaC] 💻Code -> [ https://t.co/yUno2y0GPR] Work done with our amazing

1

2

18

William Hu

@_williamhu

1 month

📈HK 8-wave & 4-wave kernels: - GEMMs: 1.6 PFLOPs (peak) - GQA non-causal backwards: +1.8x over AITER/CK/SDPA - Fused dropout-residual-layernorm: +1.5x over torch.compile Peak performance and readable code!

1

0

14

William Hu

@_williamhu

1 month

We reverse-engineered: - Undocumented shared-memory phase behavior - Register scheduling limits in HIPCC - Optimal wave schedules - Cache-aware grid scheduling … and 🧑‍🍳baked it all into composable HK primitives.

1

0

19