aryagxr Profile Banner
arya Profile
arya

@aryagxr

Followers
3K
Following
3K
Media
94
Statuses
386

multiplying matrices

🪐
Joined January 2017
Don't wanna be here? Send us removal request.
@aryagxr
arya
7 days
single file version open sourced here: https://t.co/98bopwSYkA
0
0
3
@aryagxr
arya
7 days
my attempts at implementing both flash attention papers 1 & 2 from when I was learning now that I look at it, it could use a lot more optimizations and it’s time I revisit this weekend
@alexinexxx
alexine šŸ“ā€ā˜ ļø
8 days
>revisiting this paper & FlashAttention has never looked prettier
3
5
201
@aryagxr
arya
18 days
CUDA brain dump:
@aryagxr
arya
21 days
tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch
14
31
849
@aryagxr
arya
20 days
being confused is a good sign you’re learning
1
2
10
@aryagxr
arya
21 days
tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch
3
17
318
@aryagxr
arya
22 days
uhh same
@vikhyatk
vik
23 days
yes i totally understand what's going on here
0
0
4
@aryagxr
arya
23 days
guess I can’t hide from inline ptx kernels anymore xD because if optimized well the speedup you can get from this seems promising https://t.co/2vPz6DERBy
0
0
8
@aryagxr
arya
23 days
there isn’t much info out there on how to write inline PTX, or what PTX instructions to use. but I found a few blog posts and some snippets from nvidia’s docs and was able to piece it together to write a GEMM kernel using mma instructions, at warp level
6
3
72
@aryagxr
arya
24 days
forever a city girl !
0
0
11
@aryagxr
arya
25 days
knowing the limitations for this gpu saved me a lot of debugging time. I’ve made this mistake before where I try running an example code from the docs, not realizing it’s meant to be run on hopper architecture, or needs more tensor cores etc.
0
0
3
@aryagxr
arya
25 days
you should be spending more time getting familiar with your gpu architecture if you want to write efficient kernels. im writing mma kernels using tensor cores, on a 4050 that has > 80 tensor cores > 20 SMs > 4 tensor cores per sm > 4 warp schedulers per sm > 256 fused multiply
2
0
9
@aryagxr
arya
25 days
it’s that time of the year!
@m_sirovatka
Matej Sirovatka
25 days
It's that time of the year again and we're coming with another @GPU_MODE competition! This time in collaboration with @nvidia focused on NVFP4. Focused on NVFP4 and B200 GPUs (thanks to @sestercegroup ) we'll release 4 problems over the following 3 months: 1. NVFP4 Batched GEMV
0
0
8
@aryagxr
arya
26 days
my muse for this week 🫠 i will call it a successful week if I can internalize these fragment layouts from tensor core mma
0
0
8
@aryagxr
arya
27 days
looking at building cuda/c++ extensions in pytorch so I can call custom kernels to sit in my inference stack. worth mentioning, for weeks i could not include <torch/extension.h> into my stack without errors and couldn’t figure out why. I fixed it after installing libtorch,
1
0
13
@aryagxr
arya
29 days
this is awesome, and they have a whole section for infra, gpu, kernel stuff i know what im doing this weekend yay
@eliebakouch
elie
29 days
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23
1
0
13
@aryagxr
arya
1 month
my only fall activity this year is to wake up before the sun and lock in šŸ you can expect a blog/worklog from me on llm optimizations soon
8
3
184
@aryagxr
arya
1 month
I try to clear up most of my week and spare time for side project maxxing. I’m a full time student at uni but I make sure to spend as little time and mindshare possible on homework, tests and pointless electives once a week i speed run all my uni work and free up the rest of my
1
0
13
@aryagxr
arya
1 month
if we’re talking about consciousness as simply being aware, I thought the closest we got to giving LLMs some form of consciousness was through RL
@willccbb
will brown
1 month
i think LLMs are obviously not conscious because there was no selection pressure for them to be, but rather to mimic byproducts of consciousness humans are conscious because it was evolutionarily useful for us to be
0
0
5
@aryagxr
arya
1 month
it’s quite cool that you can just see your entire inference file run and identify exactly what’s slowing you down
1
0
10