__tensorcore__ Profile Banner
Vijay Profile
Vijay

@__tensorcore__

Followers
2K
Following
8K
Media
59
Statuses
1K

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

Joined July 2015
Don't wanna be here? Send us removal request.
@__tensorcore__
Vijay
2 months
🚨🔥 CUTLASS 4.0 is released 🔥🚨. pip install nvidia-cutlass-dsl. 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python. slidehelloworld.png.
Tweet media one
15
82
410
@__tensorcore__
Vijay
1 month
Another 🔥 blog about CUTLASS from @colfaxintl, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them.
0
36
158
@__tensorcore__
Vijay
1 month
RT @MoonL88537: did i mention that this is totally nuts?
Tweet media one
0
446
0
@__tensorcore__
Vijay
1 month
RT @tri_dao: We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GT….
0
56
0
@__tensorcore__
Vijay
2 months
Every GPU kernel writer in shambles
Tweet media one
4
8
118
@__tensorcore__
Vijay
2 months
Tweet media one
0
4
46
@__tensorcore__
Vijay
2 months
RT @__tensorcore__: 🚨🔥 CUTLASS 4.0 is released 🔥🚨. pip install nvidia-cutlass-dsl. 4.0 marks a major shift for CUTLASS: towards native GPU….
0
82
0
@__tensorcore__
Vijay
2 months
RT @jinaycodes: Introducing soarXiv ✈️, the most beautiful way to explore human knowledge. Take any paper's URL and replace arxiv with soar….
0
1K
0
@__tensorcore__
Vijay
2 months
RT @elliotarledge: timelapse #58 (14.5 hrs): .- used cutlass python DSL to increase elementwise add/mul memory throughput (from pytorch 500….
0
3
0
@__tensorcore__
Vijay
2 months
RT @tri_dao: I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in….
0
25
0
@__tensorcore__
Vijay
2 months
RT @__tensorcore__: We believe low level access to hardware is extremely important. High level generators rob away the freedom of programme….
0
2
0
@__tensorcore__
Vijay
2 months
Lastly, I want to say a massive thank you to the work (and sacrifices) of everyone who has worked on project. This is the first time CUTLASS has done a compiler, and it required a lot of collaboration across our CUDA and driver, compiler, Python, frameworks, DevTech.
2
0
21
@__tensorcore__
Vijay
2 months
CuTe DSL will be in beta for the next few months. We would love to hear your feedback and suggestions especially during this period. Check us out on GitHub and file issues or contribute examples:.
1
0
15
@__tensorcore__
Vijay
2 months
If you missed it, you can also watch our 40 min GTC talk that dives deep into CuTe DSL
1
1
17
@__tensorcore__
Vijay
2 months
We have also heard the your feedback asking for better documentation. CuTe DSL documentation and all existing CUTLASS C++ documentation is now homed at with a fresh coat of paint!
Tweet media one
1
0
19
@__tensorcore__
Vijay
2 months
We even have a series of Jupyter notebooks to get you started. My favorite is the one that teaches how to print types and values at both compile time and runtime. One of the best and most frequently used ways to debug kernels 😊.
2
1
24
@__tensorcore__
Vijay
2 months
🔋 Batteries included in the form of many new examples across multiple architectures 🔌. Of note are Ampere FlashAattention-2 and Blackwell FlashAttention-3 implementations that are at parity with C++ implementations in terms of performance:.
1
2
30
@__tensorcore__
Vijay
2 months
We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡
Tweet media one
2
2
27
@__tensorcore__
Vijay
2 months
This initial release ships with CuTe DSL, a programming language that is fully consistent with CuTe C++ in its programming model, APIs, abstraction level, and performance. Kernels in CuTe DSL look and feel like CuTe C++ but compile 100x faster without compromise in performance.
Tweet media one
1
1
28
@__tensorcore__
Vijay
2 months
RT @memorypaladin: Most exciting addition in CUDA 12.9 for me is CUDA_LOG_FILE. You can finally get error strings to describe the error you….
0
1
0