arya @aryagxr X Profile

arya

@aryagxr

Followers

3K

Following

3K

Media

94

Statuses

386

multiplying matrices

https://t.co/B1fdgu5CCo

🪐

Joined January 2017

Don't wanna be here? Send us removal request.

arya

@aryagxr

7 days

single file version open sourced here: https://t.co/98bopwSYkA

0

3

arya

@aryagxr

7 days

my attempts at implementing both flash attention papers 1 & 2 from when I was learning now that I look at it, it could use a lot more optimizations and it’s time I revisit this weekend

alexine 🏴‍☠️

@alexinexxx

8 days

>revisiting this paper & FlashAttention has never looked prettier

3

5

201

arya

@aryagxr

18 days

CUDA brain dump:

arya

@aryagxr

21 days

tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch

14

31

849

arya

@aryagxr

20 days

being confused is a good sign you’re learning

1

2

10

arya

@aryagxr

21 days

reading this right now: https://t.co/dya2GZQPGq

spatters.ca

Using Tensor Cores is now a prerequisite to get anywhere near peak performance on NVIDIA GPUs. In this post we work through the process of developing an efficient Tensor Core matrix multiplication...

1

3

21

arya

@aryagxr

21 days

tfw you find a good cuda blog that you can actually follow along and reproduce the results because it’s optimized for your gpu arch

3

17

318

arya

@aryagxr

22 days

uhh same

vik

@vikhyatk

23 days

yes i totally understand what's going on here

0

4

arya

@aryagxr

23 days

guess I can’t hide from inline ptx kernels anymore xD because if optimized well the speedup you can get from this seems promising https://t.co/2vPz6DERBy

0

8

arya

@aryagxr

23 days

there isn’t much info out there on how to write inline PTX, or what PTX instructions to use. but I found a few blog posts and some snippets from nvidia’s docs and was able to piece it together to write a GEMM kernel using mma instructions, at warp level

6

3

72

arya

@aryagxr

24 days

forever a city girl !

0

11

arya

@aryagxr

25 days

knowing the limitations for this gpu saved me a lot of debugging time. I’ve made this mistake before where I try running an example code from the docs, not realizing it’s meant to be run on hopper architecture, or needs more tensor cores etc.

0

3

arya

@aryagxr

25 days

you should be spending more time getting familiar with your gpu architecture if you want to write efficient kernels. im writing mma kernels using tensor cores, on a 4050 that has > 80 tensor cores > 20 SMs > 4 tensor cores per sm > 4 warp schedulers per sm > 256 fused multiply

2

0

9

arya

@aryagxr

25 days

it’s that time of the year!

Matej Sirovatka

@m_sirovatka

25 days

It's that time of the year again and we're coming with another @GPU_MODE competition! This time in collaboration with @nvidia focused on NVFP4. Focused on NVFP4 and B200 GPUs (thanks to @sestercegroup ) we'll release 4 problems over the following 3 months: 1. NVFP4 Batched GEMV

0

8

arya

@aryagxr

26 days

my muse for this week 🫠 i will call it a successful week if I can internalize these fragment layouts from tensor core mma

0

8

arya

@aryagxr

27 days

looking at building cuda/c++ extensions in pytorch so I can call custom kernels to sit in my inference stack. worth mentioning, for weeks i could not include <torch/extension.h> into my stack without errors and couldn’t figure out why. I fixed it after installing libtorch,

1

0

13

arya

@aryagxr

29 days

this is awesome, and they have a whole section for infra, gpu, kernel stuff i know what im doing this weekend yay

elie

@eliebakouch

29 days

Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23

1

0

13

arya

@aryagxr

1 month

my only fall activity this year is to wake up before the sun and lock in 🍁 you can expect a blog/worklog from me on llm optimizations soon

8

3

184

arya

@aryagxr

1 month

I try to clear up most of my week and spare time for side project maxxing. I’m a full time student at uni but I make sure to spend as little time and mindshare possible on homework, tests and pointless electives once a week i speed run all my uni work and free up the rest of my

1

0

13

arya

@aryagxr

1 month

if we’re talking about consciousness as simply being aware, I thought the closest we got to giving LLMs some form of consciousness was through RL

will brown

@willccbb

1 month

i think LLMs are obviously not conscious because there was no selection pressure for them to be, but rather to mimic byproducts of consciousness humans are conscious because it was evolutionarily useful for us to be

0

5

arya

@aryagxr

1 month

it’s quite cool that you can just see your entire inference file run and identify exactly what’s slowing you down

1

0

10