arya
@aryagxr
Followers
3K
Following
3K
Media
94
Statuses
386
multiplying matrices
šŖ
Joined January 2017
tfw you find a good cuda blog that you can actually follow along and reproduce the results because itās optimized for your gpu arch
3
17
318
guess I canāt hide from inline ptx kernels anymore xD because if optimized well the speedup you can get from this seems promising https://t.co/2vPz6DERBy
0
0
8
there isnāt much info out there on how to write inline PTX, or what PTX instructions to use. but I found a few blog posts and some snippets from nvidiaās docs and was able to piece it together to write a GEMM kernel using mma instructions, at warp level
6
3
72
knowing the limitations for this gpu saved me a lot of debugging time. Iāve made this mistake before where I try running an example code from the docs, not realizing itās meant to be run on hopper architecture, or needs more tensor cores etc.
0
0
3
you should be spending more time getting familiar with your gpu architecture if you want to write efficient kernels. im writing mma kernels using tensor cores, on a 4050 that has > 80 tensor cores > 20 SMs > 4 tensor cores per sm > 4 warp schedulers per sm > 256 fused multiply
2
0
9
itās that time of the year!
It's that time of the year again and we're coming with another @GPU_MODE competition! This time in collaboration with @nvidia focused on NVFP4. Focused on NVFP4 and B200 GPUs (thanks to @sestercegroup ) we'll release 4 problems over the following 3 months: 1. NVFP4 Batched GEMV
0
0
8
my muse for this week š« i will call it a successful week if I can internalize these fragment layouts from tensor core mma
0
0
8
looking at building cuda/c++ extensions in pytorch so I can call custom kernels to sit in my inference stack. worth mentioning, for weeks i could not include <torch/extension.h> into my stack without errors and couldnāt figure out why. I fixed it after installing libtorch,
1
0
13
this is awesome, and they have a whole section for infra, gpu, kernel stuff i know what im doing this weekend yay
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didnāt, and how to make it run reliably https://t.co/iN2JtWhn23
1
0
13
my only fall activity this year is to wake up before the sun and lock in š you can expect a blog/worklog from me on llm optimizations soon
8
3
184
I try to clear up most of my week and spare time for side project maxxing. Iām a full time student at uni but I make sure to spend as little time and mindshare possible on homework, tests and pointless electives once a week i speed run all my uni work and free up the rest of my
1
0
13
if weāre talking about consciousness as simply being aware, I thought the closest we got to giving LLMs some form of consciousness was through RL
i think LLMs are obviously not conscious because there was no selection pressure for them to be, but rather to mimic byproducts of consciousness humans are conscious because it was evolutionarily useful for us to be
0
0
5
itās quite cool that you can just see your entire inference file run and identify exactly whatās slowing you down
1
0
10