Horace He Profile
Horace He

@cHHillee

Followers
23,561
Following
449
Media
325
Statuses
2,383

Working at the intersection of ML and Systems @ PyTorch "My learning style is Horace twitter threads" - @typedfemale

chhillee
Joined February 2010
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@cHHillee
Horace He
5 months
Happy to OSS gpt-fast, a fast and hackable implementation of transformer inference in <1000 lines of native PyTorch with support for quantization, speculative decoding, TP, Nvidia/AMD support, and more! Code: Blog: (1/12)
47
1K
2K
@cHHillee
Horace He
1 year
I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces. Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems. This strongly points to contamination. 1/4
Tweet media one
Tweet media two
@cHHillee
Horace He
1 year
How is it even … possible to have a codeforces rating of 392? That’s very low. Like, my understanding was as long as you participated in a couple of contests (regardless of how you did), you'd have a rating above 392.
14
21
255
81
697
4K
@cHHillee
Horace He
2 years
Everybody wants their models to run faster. However, researchers often cargo cult performance without a solid understanding on the underlying principles. To address that, I wrote a post called "Making Deep Learning Go Brrrr From First Principles". (1/3)
Tweet media one
28
404
2K
@cHHillee
Horace He
1 year
Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster. But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood! Here's a plot of FLOPs achieved for square matmuls. Let's explain each curve! 1/19
Tweet media one
@karpathy
Andrej Karpathy
1 year
The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.
86
360
5K
20
278
2K
@cHHillee
Horace He
2 years
Why is OpenAI's new compiler, Triton, so exciting? And what distinguishes it from other efforts to provide a Python DSL for programming Nvidia GPUs, like Numba? To answer that, we need to look at the operation behind all of deep learning - matrix multiplication. (1/7)
Tweet media one
15
227
2K
@cHHillee
Horace He
1 year
Eager mode was what made PyTorch successful. So why did we feel the need to depart from eager mode in PyTorch 2.0? Answer: it's the damn hardware! Let's tell a story about how the assumptions PyTorch were based off of became untrue, and why PyTorch needed to evolve. (1/10)
18
203
1K
@cHHillee
Horace He
2 years
Ever since the V100, Nvidia has been cramming more and more "tensor cores" into each GPU generation. But what *are* tensor cores? How can you use them to accelerate deep learning models by >10x? And ... why does their existence make me somewhat sad :( (1/9)
Tweet media one
13
190
1K
@cHHillee
Horace He
1 year
I'm going to start posting a series of PyTorch 2.0 benchmarks demonstrating what kinds of things we speed up as well as exploring a variety of ML compiler optimizations! For the first one, let's talk about good old operator fusion - the workhorse of all ML compilers. (1/8)
14
121
1K
@cHHillee
Horace He
1 year
Let's talk about a detail that occurs during PyTorch 2.0's codegen - tiling. In many cases, tiling is needed to generate efficient kernels. Even for something as basic as torch.add(A, B), you might need tiling to be efficient! But what is tiling? And when is it needed? (1/13)
Tweet media one
7
138
904
@cHHillee
Horace He
7 months
It’s wild that someone can just have a ChatGPT wrapper, pay Apple for ads to the “ChatGPT” keyword, show up above OpenAI on the AppStore, and build a massively successful app off of that.
Tweet media one
42
16
834
@cHHillee
Horace He
1 year
Another thing PyTorch 2.0 helps speed up - overhead. Overhead is everything other than the GPU doing work. It can come from Python, the ML framework, CUDA kernel launches, etc. - regardless, it's why your nvidia-smi util is so low! So... how do we diagnose and resolve it? (1/8)
Tweet media one
12
136
819
@cHHillee
Horace He
10 months
Fun fact: PyTorch's codebase recently flipped back to having more Python than C++, for the first time since nearly its inception back in 2017! There's still plenty of "borgification" in PyTorch, but recent PyTorch 2.0 has allowed us to do much more in Python :) (1/4)
Tweet media one
7
74
760
@cHHillee
Horace He
5 months
I reverse-engineered AlphaCode2's submission history and manually performed the Codeforces evals. I'm ... again concerned that data leakage is affecting the results. For the DP problem highlighted in the AlphaCode2 release, look at AC2's solution vs. the tutorial. (1/5)
Tweet media one
14
70
759
@cHHillee
Horace He
2 months
Tweet media one
Tweet media two
10
43
717
@cHHillee
Horace He
3 years
A very ... interesting 4 page paper at ICLR. I'm curious to see the reviewers' reactions.
Tweet media one
19
93
656
@cHHillee
Horace He
2 years
I've found it unexpectedly useful to memorize facts about systems I work with. Knowing these numbers allows one to 1. sanity check performance, 2. sketch out feasibility of technical solutions, and 3. reason about performance characteristics. Some examples below: (1/7)
7
64
649
@cHHillee
Horace He
3 years
Haven't seen anybody else mention this, but Huawei just announced they trained a 200 BILLION transformer model - PanGu-α. This is bigger than GPT-3, but trained only for 40B tokens. Moreover, they're trained on an entirely Chinese stack: Huawei chips and Mindspore framework. 1/2
Tweet media one
Tweet media two
8
138
603
@cHHillee
Horace He
2 years
5 different ways of writing a matmul in PyTorch.
Tweet media one
20
47
542
@cHHillee
Horace He
3 months
@alicemazzy One funny downstream effect of this is that a really high rating on Google maps (4+) is often anti-correlated with quality, since it indicates that it's primarily tourists rating the place as opposed to locals.
5
5
539
@cHHillee
Horace He
5 years
Finally managed to release this data! Some highlights: From CVPR 2018-2019, PyTorch has grown from 82 -> 280 papers, while TensorFlow has gone from 116 -> 125 papers. For ACL PyTorch: 26 -> 103 TF: 34 -> 33 The trend continues at all the major research conferences.
@gradientpub
The Gradient
5 years
The war between ML frameworks has raged on since the rebirth of deep learning. Who is winning? @cHHillee 's data analysis shows clear trends: PyTorch is winning dramatically among researchers, while Tensorflow still dominates industry. #PyTorch #Tensorflow
12
281
589
7
169
539
@cHHillee
Horace He
1 year
So... the fact that it solves 10/10 problems from pre-2021 and 0/10 of the most recent problems (which it has never seen before) is very suspicious. Considering the codeforces results in the paper (very poor!), they might have only evaluated it on recent problems. 3/4
2
28
512
@cHHillee
Horace He
1 year
However, it does make me wonder how much of the performance on other tests *is* due to contamination! Of course, on some tests the line between "contamination" and "knowing the material" is very blurred, but competitive programming isn't one of them. 4/4
2
12
454
@cHHillee
Horace He
19 days
After some intense plotting (and MS Paint lines), I'd like to share the updated performance projections from our top minds.
Tweet media one
@karpathy
Andrej Karpathy
19 days
THE REVENGE OF PYTORCH just kidding :) @cHHillee (from PyTorch team) was kindly able to help improve the PyTorch baseline, done by 1) upgrading to nightly, 2) using the "compound" F.sdpa (scaled dot product attention) layer directly, and turning on a torch compile flag:…
34
45
1K
9
19
438
@cHHillee
Horace He
5 months
As mentioned previously, I found AlphaCode2 accounts, and through stalking their submission history, I manually performed the AlphaCode2 Codeforces evals. Overall, very impressive! I arrive at a rating of ~1650, which is the 85-90th percentile of CF users. (1/19)
Tweet media one
Tweet media two
11
62
413
@cHHillee
Horace He
5 months
I’m at Neurips 2023 all week. Happy to talk with anyone about PyTorch, ML compilers, LLM inference, etc. I’d especially encourage folks to reach out if you’re just starting to get into ML systems.
Tweet media one
13
6
403
@cHHillee
Horace He
5 months
Two additions to gpt-fast this week. The first one is an optimization to tensor-parallelism added by @foofoobuggy which improves our TP perf by 20-50%. This gives us 200 => 330 tok/s for Llama-7B fp16 and 64 => 91 tok/s for Llama-70B int4 with *no* speculative decoding. (1/4)
Tweet media one
Tweet media two
9
47
400
@cHHillee
Horace He
4 years
When I published my PyTorch vs TensorFlow article, some people raised questions about whether it applied to non-NLP conferences. With NeurIPS posting all their papers, the answer is clear! Pytorch: 68 -> 166 papers Tensorflow: 91 -> 74 papers
Tweet media one
Tweet media two
5
96
390
@cHHillee
Horace He
5 months
Issues with understanding my memory usage (or finding memory leaks) used to be one of my biggest frustrations in PyTorch. The recent suite of memory profiler/visualization tools from @Zachary_DeVito and Aaron Shi have completely resolved that for me. A testimonial here (1/3)
Tweet media one
@PyTorch
PyTorch
5 months
Understanding GPU Memory 1: Visualizing All Allocations over Time 👀 In part 1 of this series, we show how to use Memory Snapshot, the Memory Profiler, and the Reference Cycle Detector to debug out of memory errors and improve memory usage. Read more:
Tweet media one
1
192
1K
2
42
371
@cHHillee
Horace He
1 month
It's somehow incredibly hard to get actual specs of the new Nvidia GPUs, between all the B100/B200/GB200/sparse/fp4 numbers floating around. @tri_dao linked this doc which thankfully has all the numbers in a table:
Tweet media one
6
48
319
@cHHillee
Horace He
2 years
For my future reference (and since this was remarkably difficult for me to search), here's a minimal example of floating-point nondeterminism in PyTorch on GPUs. The underlying cause is that floating-point additions aren't associative, and scatter is likely using atomic adds.
Tweet media one
6
34
311
@cHHillee
Horace He
5 months
PSA for anybody writing Triton kernels. Use `triton.jit(interpret=True)` for debugging! It allows you to inspect what's going on inside your Triton kernel using regular Python (so print statements, breakpoints, etc.)
Tweet media one
Tweet media two
3
26
296
@cHHillee
Horace He
2 months
Announcing it 2 months after the work was done, but gpt-fast now supports Mixtral + MoE models! Featuring: - faster decoding than any (non-Groq) API endpoint, at up to 220 tok/s/user. - no custom kernels - int8/TP - still simple! How do we do it? Well, torch.compile :) (1/5)
Tweet media one
2
34
297
@cHHillee
Horace He
1 year
800-rated problems are the easiest problems on Codeforces, and are determined automatically based off of the ratings of the people solving them during the contest. Thus, I would expect that these problems are roughly of "equal" difficulty, and my spot check would agree. 2/4
1
11
284
@cHHillee
Horace He
1 year
I finally found Nvidia GPU FLOPs across generations in one table! These are always such a pain to search up.
Tweet media one
6
35
267
@cHHillee
Horace He
1 year
PS: I am *not* saying that these models don't know anything or that they don't have any "understanding". But it is indicative to me that performance on competitive programming problems is heavily susceptible to contamination.
3
5
266
@cHHillee
Horace He
1 year
How is it even … possible to have a codeforces rating of 392? That’s very low. Like, my understanding was as long as you participated in a couple of contests (regardless of how you did), you'd have a rating above 392.
@OpenAI
OpenAI
1 year
Announcing GPT-4, a large multimodal model, with our best-ever results on capabilities and alignment:
2K
18K
64K
14
21
255
@cHHillee
Horace He
2 months
Before people sell all their GPUs to go buy Groq hardware, I'd recommend answering two questions: 1. What is the cost of the system you're purchasing? 2. How many users can you serve at 500 tok/s+? Hint: Very high, and not many
16
14
260
@cHHillee
Horace He
2 years
In my experience, folks often underestimate how cool vmap is. People often think "I already know how to batch my model, why I do need vmap?" This is like saying "I already know how to compute derivatives, why do I need autograd?" (1/4)
@PyTorch
PyTorch
2 years
Today we’d like to highlight features from functorch, a beta PyTorch library that provides JAX-inspired function transformations like vmap. () If you’re not sure what sort of cool new things vmap allows you to do, read on to learn more! (1/n)
6
133
715
2
17
254
@cHHillee
Horace He
2 months
With the new release of Gemma-2B, I thought I'd see how torch.compile performs. Gemma 2B for a single prompt runs at 144 tokens/s on a V100, a 4x increase over the uncompiled HF version. We're working with @huggingface to upstream these improvements too!
Tweet media one
9
25
248
@cHHillee
Horace He
1 year
Excited to finally announce Pytorch 2.0! I've had so many opinions and things I've wanted to write about ML compilers that I haven't been able to. Can't wait to finally be able to write a ton of stuff about it.
@PyTorch
PyTorch
1 year
We just introduced PyTorch 2.0 at the #PyTorchConference , introducing torch.compile! Available in the nightlies today, stable release Early March 2023. Read the full post: 🧵below! 1/5
Tweet media one
23
524
2K
2
11
233
@cHHillee
Horace He
1 year
It's hilarious how hardware vendors insist on describing their hardware features with platitudes instead of just telling us what hardware instruction they added. "We accelerate GPT3 by 1000x with our new Transformer Engines" smh you're just adding fp8 matmul support.
9
11
233
@cHHillee
Horace He
1 year
Arguably top 5 most important papers in the last several years - incredible that it was rejected. Everybody training large LLMs refers to all the megatron papers constantly. Heck, the most common sharding scheme for transformers is often referred to as “megatron-style sharding”.
@ctnzr
Bryan Catanzaro
1 year
Thank you Nando. This paper was rejected from all the conferences for lack of novelty - I’m glad it was able to find an audience on arXiv.
5
25
399
4
12
233
@cHHillee
Horace He
4 years
With CVPR concluded, I thought I'd check on framework mentions at the most recent CV and NLP conferences respectively (CVPR/EMNLP). PyTorch is up to nearly a 4:1 ratio vs TF at both conferences! CVPR PT: 283 -> 405 TF: 136 -> 102 EMNLP PT: 55 -> 125 TF: 42 -> 36
Tweet media one
Tweet media two
4
47
228
@cHHillee
Horace He
4 years
Hinton responds to Schmidhuber on r/ML: "Despite my own best judgement, I feel that I cannot leave his charges completely unanswered so I am going to respond once and only once."
2
33
221
@cHHillee
Horace He
5 months
If you want a summary of the major events of the recent OpenAI drama, I made a timeline of the major events plotted on a prediction market of whether Sam Altman will remain CEO. Data taken from @ManifoldMarkets (1/3)
Tweet media one
10
17
217
@cHHillee
Horace He
10 months
I’m offering due diligence of due diligence as a service. Before you invest 10k into some Twitter rando’s due diligence, hire me for 100$. I will evaluate the quality of their due diligence and odds of them flaking given their Twitter profile.
@alth0u
alth0u🤸
10 months
I'm offering due diligence as a service. Before you invest millions into an AI or hard tech startup, hire me for $9.9K. I will evaluate feasibility of idea and realisticness of timeline given founder.
8
3
111
4
7
214
@cHHillee
Horace He
1 month
Some folks suggested that I should make some of my tweet threads more archivable. So here's a blog post on how shapes affect matmul performance! Some bonus content includes: 1. Why doesn't torch.compile just fix this? 2. Some more quiz questions :)
Tweet media one
@cHHillee
Horace He
1 year
Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster. But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood! Here's a plot of FLOPs achieved for square matmuls. Let's explain each curve! 1/19
Tweet media one
20
278
2K
2
26
208
@cHHillee
Horace He
2 years
This is insane. However, some perspective. 54th percentile, while insane, only corresponds to only about 1 or 2 problems per contest (in div 2). The final rating they achieved (~1200), corresponds to solving ~20% of Codeforces problems. On APPS, they still only get to 8%.
@GoogleDeepMind
Google DeepMind
2 years
Introducing #AlphaCode : a system that can compete at average human level in competitive coding competitions like @codeforces . An exciting leap in AI problem-solving capabilities, combining many advances in machine learning! Read more: 1/
Tweet media one
179
2K
8K
3
29
202
@cHHillee
Horace He
3 years
Many of my friends have applied for ML PhDs both this year and last, and the competition seems to be absolutely brutal (and ramping up every year). Sadly, stories like this are not uncommon:
5
18
206
@cHHillee
Horace He
5 months
One of the nice things about gpt-fast being so small is that it's easy to make changes. Here's a PR of adding *fp8* quantization in about 5 minutes and 20 LOC. I'll probably do a couple more gpt-fast example patches. lmk what you want to see.
Tweet media one
@vgoklani_ai
Vishal Goklani
5 months
@cHHillee @abacaj what about fp8?
1
0
2
8
18
195
@cHHillee
Horace He
2 months
If you don't have Groq chips, you can still run Mixtral at nearly 300 tok/s with gpt-fast :) Thanks @kurumuz for the H100 node!
11
17
196
@cHHillee
Horace He
3 years
Wow, when did Google start offering a Colab Pro+? For 50$ a month??? For your 50$ a month, this is what their FAQ says you have access to: "With Colab Pro you get priority access to our fastest GPUs, and with Pro+ even more so."
Tweet media one
15
19
191
@cHHillee
Horace He
1 year
This is a good walkthrough of what happens when you call torch.compile under the hood! One cool thing about the torch.compile stack is that we start with normal PyTorch code and end up with a file of Triton operators (still in python!)
Tweet media one
@shshnkp
Shashank Prasanna
1 year
New blogpost! a visual primer on how @PyTorch 2.0 compiler technologies for graph capture, IRs, operator fusions and automatic C++ and @NVIDIAAIDev GPU code generation. This is your one stop shop to grok PyTorch 2.0's torch.compile() API Summary 🧵👇
7
87
366
2
30
186
@cHHillee
Horace He
3 years
Lots of cool things, but most excited about `torch.fx` - a toolkit that allows you to write composable transformations in Python(!) that take in PyTorch code and output PyTorch code. Because of this, the result trivially composed with other transforms or Torchscript. (1/4)
@PyTorch
PyTorch
3 years
PyTorch 1.8 is here! Highlights include updates for compiler, code optimization, frontend APIs for scientific computing, large scale training for pipeline and model parallelism, and Mobile tutorials. Blog👇
1
278
1K
5
30
170
@cHHillee
Horace He
4 years
Excited about new paper (w/ @qhwang3 , @abaesingh , @sernamlim , @austinbenson ): We show that by combining label propagation with simple models, we can often match or outperform SOTA GNNs at node tasks, usually at a fraction of the parameters/runtime!
Tweet media one
5
39
166
@cHHillee
Horace He
9 days
One thing I really enjoy about working on an OSS facing project like PyTorch is that OSS really cuts through a lot of "politics" and "fake work". Within a company, people are incentivized to do all sorts of things other than "build the right thing". Unfortunately, no amount…
@schrep
Mike Schroepfer
12 days
True Story! One of the many reasons I love open source is it doesn't give a damn about the org chart or "managing up." If people outside of FB/Meta didn't use or like our OSS then something was wrong with it. PyTorch succeeded because of the hyper focus on developer…
8
56
528
4
6
162
@cHHillee
Horace He
2 years
Here's a recent cool effort that's helped to reduce PyTorch's "entropy" - cleaning up convolutions :P Over the years, PyTorch has accrued ... many convolution implementations. In theory, that's not a problem. The issue is that it's leaked out into the actual operator set. (1/3)
Tweet media one
1
23
157
@cHHillee
Horace He
7 months
@jeremyphoward An allreduce has 2N comm volume. Reduce-scatter and all-gather both have N comm volume. Allreduce can be created from a combination of reduce-scatter + all-gather. DDP: Do allreduce after getting your gradients. Zero-1: Split allreduce into reduce-scatter + all-gather (allows…
5
4
156
@cHHillee
Horace He
1 year
For many of the problems it failed on, I tried copying the grader's feedback (i.e. the input it failed on, its answer, and expected answer). This consistently resulted in a new solution, but never led it to be correct, except in one old problem I tested (not in the group of 10).
3
0
153
@cHHillee
Horace He
2 years
@moyix But yes, humans were not designed to visualize objects with dimension > 3. The general strategy for doing so is described succinctly in this slide from Geoff Hinton.
Tweet media one
4
24
150
@cHHillee
Horace He
1 year
Also, I would love to see more details from OpenAI on their evaluation procedure for the codeforces problems!
1
1
148
@cHHillee
Horace He
3 years
Codex paper (model behind Github CoPilot) is out They evaluate on APPS (competitive programming tasks @DanHendrycks ), and show some progress. Nowhere close to humans on interview-level tasks or harder though, even with generous amounts of sampling.
Tweet media one
3
52
142
@cHHillee
Horace He
7 months
@yacineMTB Got some bad news
Tweet media one
5
0
136
@cHHillee
Horace He
1 year
@ID_AA_Carmack As a side note, most of the overhead doesn't actually even come from Python - it comes from all the other stuff involved in a PyTorch kernel (computing dtypes, output sizes, etc.) Here's a flamegraph of a 1 element add - you can see most of the time is in C++.
Tweet media one
2
9
140
@cHHillee
Horace He
25 days
I had a question at the end of "what shapes do matrix multiplications like?" for testing understanding. Unfortunately... most got it wrong. Let's explain the answer below. For a full writeup (as well as 4 others), check out ! (1/11)
Tweet media one
@cHHillee
Horace He
1 year
Let's say I have a [M x K] @ [K x N] matmul. Which one of these configurations will have the best perf? Think about the actual ramifications of tiling! A: M=2047, K=N=2048 B: K=2047, M=N=2048 C: N=2047, M=K=2048 19/19
9
0
26
3
20
142
@cHHillee
Horace He
2 months
@felix_red_panda From my understanding, that is exactly how it works :) One thing I'd note is that for LLM inference, the bandwidth for networking really doesn't matter. You're communicating activations in either TP/PP, and those are tiny.
Tweet media one
4
6
137
@cHHillee
Horace He
1 year
Some other notes on my evaluation procedure - I copy pasted the entire problem statement into ChatGPT (with the GPT-4 backend), including the input/output pairs, and asked it to output a C++ solution. The formatting definitely isn't ideal haha.
Tweet media one
2
2
138
@cHHillee
Horace He
2 years
Do you like einops? Do you like NamedTensors? Do you struggle with Numpy-style positional-based indexing? Check out first class dimensions (from @Zachary_DeVito )! It unifies einops/named dims under one concept, and allows for even more! Here are some neat examples (1/4)
@Zachary_DeVito
Zachary DeVito
2 years
We're developing a new take on named tensors by adding dimensions objects to PyTorch. No need to figure out how gather works, expressions look like a loop body but execute as a single kernel. Lots more examples here
Tweet media one
10
73
413
3
16
128
@cHHillee
Horace He
3 years
Also presented a tweet length implementation of their module haha
Tweet media one
2
13
129
@cHHillee
Horace He
7 months
Tomorrow, @christianpurh and I will be talking about accelerating Generative AI inference at PyTorch Conference! () In particular, I'll be talking about transformer inference with native PyTorch. Tune in to see how successful we are at our goals!
4
6
127
@cHHillee
Horace He
1 year
It's unfortunate that graph-mode compilation necessitates additional complexity, but luckily, eager-mode isn't going anywhere! It'll still be ... reasonably fast. But if PyTorch wanted to keep up with where the hardware's going, PyTorch 2.0 needed to happen :) (10/10)
3
3
126
@cHHillee
Horace He
4 days
Many don't know that GPUs automatically leverage ternary and fine-grained sparsity to accelerate your matmuls! e.g. A matmul with ternary + 90% sparsity results in 33% more FLOPs in my benchmark. (not joking) I explore this "optimization" here: (1/3)
Tweet media one
18
41
316
@cHHillee
Horace He
2 years
Enter... Triton. Not only can Triton achieve matmul performance competitive with CuBLAS, it can achieve it in a (relatively) understandable 40 lines of code! And that's what separates Triton from other DSLs like Numba. (4/7)
Tweet media one
2
9
121
@cHHillee
Horace He
2 years
Very entertaining profile, and a fairly inspiring path into mathematics. But… 1. Drops out of high school to be a poet 2. ??? 3. Ends up at Seoul National University, the most prestigious school in Korea? What’s step 2?
@Noahpinion
Noah Smith 🐇🇺🇸🇺🇦
2 years
This dude dropped out of high school to become a POET, didn't even like math til his sixth year of college, and just won the Fields Medal. Success doesn't always take the straight and narrow path, folks.
65
679
4K
5
7
119
@cHHillee
Horace He
5 months
Apparently gpt-fast can be easily modified to support off-the-shelf GPTQ models as well, which is a nice surprise :) An user reported getting 193 tokens/s on Llama-7B after doing so.
1
19
121
@cHHillee
Horace He
2 years
Opening up the box allows many other interesting things. Remember Flash-Attention, the fused attention kernel ()? Well, the authors' codebase is thousands of lines of C++ (much of it taken from Apex). OTOH, you can write it in 200 lines of Triton! (5/7)
Tweet media one
Tweet media two
@tri_dao
Tri Dao
2 years
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
Tweet media one
31
359
2K
3
7
118
@cHHillee
Horace He
2 years
PyTorch's conda packages were unavailable for some time yesterday (due to a Cloudflare issue), and some users were getting a bit ... upset.
Tweet media one
13
3
119
@cHHillee
Horace He
1 year
TL;DR: PyTorch came out in a time where graph-mode compilation couldn't provide significant performance wins due to models being bottlenecked by matmuls. 5 years of Nvidia matmul perf improvements later, graph-mode compilation can bring significant wins. (9/10)
1
9
117
@cHHillee
Horace He
2 years
For single-GPU performance, there are 3 main areas your model might be bottlenecked by. Those are: 1. Compute, 2. Memory-Bandwidth, and 3. Overhead. Correspondingly, the optimizations that matter *also* depend on which regime you're in. (2/3)
Tweet media one
1
6
114
@cHHillee
Horace He
2 years
Little known fact: C++ experience is actually an infohazard for ML devs. Knowledge of C++ metaprogramming is even more dangerous.
3
9
116
@cHHillee
Horace He
2 years
That being said, Triton is still a much lower-level abstraction than say, PyTorch. In some sense, this might be even more exciting for framework devs than end users :P That being said, the PyTorch team have some exciting work leveraging Triton - stay tuned! (7/7)
2
4
114
@cHHillee
Horace He
15 days
The live updating image generator on is a pretty sick UX.
4
16
114
@cHHillee
Horace He
21 days
@karpathy I think you can largely do this with torch.compile tbh :P We actually support codegening into C++ with no PyTorch dependency
3
2
111
@cHHillee
Horace He
3 years
Can't wait for people to start talking about to get a job at MANGA (instead of FAANG)
4
7
108
@cHHillee
Horace He
1 year
Also, if you wanna check out my submissions/what problems I tested it on, check out the most recent page of submissions here:
5
0
109
@cHHillee
Horace He
7 days
If anybody is at ASPLOS, I'll be at the PyTorch tutorial! I'll be presenting a tutorial on TorchInductor and also some random speculation on research ideas.
1
17
109
@cHHillee
Horace He
2 years
If you've ever wondered why AMD hasn't made much of an impact on deep learning despite having a lot of hype, this reddit thread is a good primer.
5
15
103
@cHHillee
Horace He
2 years
Not only is this more understandable, it's also *significantly* faster than the author's implementation. So, Triton is very cool! If you're interested in hand-writing CUDA code, I'd strongly recommend looking at Triton. (6/7)
Tweet media one
4
4
102
@cHHillee
Horace He
3 years
Just found this great 3B1B-styled ML youtube channel: Only has a couple of videos (on Normalizing Flows, Automatic Differentiation, and Transformers), but they're all very good.
5
8
98
@cHHillee
Horace He
3 years
I think this is a fun example of how hard benchmarking is. There are 2 errors in the linked tweet, I encourage you to find them yourself :) Benchmarking is quite hard, so I think it's interesting to talk about the pitfalls. After making the benchmarks fair, PyTorch is faster :P
Tweet media one
@k_saifullaah
khalid
3 years
Was intrigued by @karpathy 's tweet and was curious to see how `math.sqrt()` performs in scale, with other frameworks (np.sqrt, torch.sqrt, jax's sqrt). (btw, map() was really fast, it's list conversion that took time)
Tweet media one
9
19
125
4
14
93
@cHHillee
Horace He
3 years
Glad to have helped on this very cool massive text dataset for large scale language modeling from EleutherAI. Not only does this make it possible for people to replicate GPT3 without the tedium of dataset cleaning, it finally provides an alternative to common crawl! (1/3)
@nabla_theta
Leo Gao
3 years
Announcing a new dataset: the Pile! A free and publicly available 800GB dataset of diverse English text for language modeling! Download: Paper: 1/7
Tweet media one
7
276
1K
1
18
95
@cHHillee
Horace He
1 year
It’s a tragedy that all the bandwagoners and hypechasers hopping from web3 into AI are right this time. 😤😤
3
10
94
@cHHillee
Horace He
3 months
I keep on clicking on these Sora videos expecting them to have sound, which I haven't done before on AI-generated video.
1
2
94
@cHHillee
Horace He
3 years
One of my favorite guides to empirical research (and one that helped me a lot personally) was this reddit comment by @ajmooch . TL;DR: Do stuff, all the time.
Tweet media one
1
10
90
@cHHillee
Horace He
2 months
@yanboliang just landed a significant improvement in mixtral perf (). For one A100 + int8, we go from 56 tok/s to 98 tok/s! That's about 64% MBU, not too bad. Essentially, we had one of the weight matrices transposed the wrong way :)
Tweet media one
2
3
87
@cHHillee
Horace He
2 years
Here's a pretty cool (prototype) approach to capturing static PyTorch graphs by dynamically modifying PyTorch bytecode (by Jason Ansel). Essentially, you introspect to ensure that the bytecode you're about to run is identical, and then cache that! (1/2)
Tweet media one
1
10
88
@cHHillee
Horace He
1 year
So, if 90% of your network's time is spent in matmuls, then it's no big deal if you give up 10% in performance in exchange for a better UX. But then... Nvidia added tensor cores. And ever since, they've been relentlessly doubling or tripling FLOPS every generation. (4/10)
Tweet media one
3
3
86
@cHHillee
Horace He
3 years
Tweet media one
2
7
86
@cHHillee
Horace He
2 years
@JFPuget I'm affiliated with Pytorch, and I'd say that Jax is worth checking out :) It represents a meaningfully different point in the design space and introduces a lot of cool ideas. Something doesn't need to be "better" to have interesting ideas and be worth learning.
4
1
85