Vijay @__tensorcore__ X Profile

Vijay

@__tensorcore__

Followers

2K

Following

8K

Media

59

Statuses

1K

MLIR, CUTLASS,Tensor Core arch @NVIDIA. Mechanic @hpcgarage. Exercise of any 1st amendment rights are for none other than myself.

Joined July 2015

Don't wanna be here? Send us removal request.

Vijay

@__tensorcore__

2 months

🚨🔥 CUTLASS 4.0 is released 🔥🚨. pip install nvidia-cutlass-dsl. 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python. slidehelloworld.png.

15

82

410

Vijay

@__tensorcore__

1 month

Another 🔥 blog about CUTLASS from @colfaxintl, this time focusing on the gory details of block-scaled MXFP and NVFP data types and Blackwell kernels for them.

0

36

158

Vijay

@__tensorcore__

1 month

RT @MoonL88537: did i mention that this is totally nuts?

0

446

0

Vijay

@__tensorcore__

1 month

RT @tri_dao: We've been thinking about what the "ideal" architecture should look like in the era where inference is driving AI progress. GT….

0

56

0

Vijay

@__tensorcore__

2 months

Every GPU kernel writer in shambles

4

8

118

Vijay

@__tensorcore__

2 months

0

4

46

Vijay

@__tensorcore__

2 months

RT @__tensorcore__: 🚨🔥 CUTLASS 4.0 is released 🔥🚨. pip install nvidia-cutlass-dsl. 4.0 marks a major shift for CUTLASS: towards native GPU….

0

82

0

Vijay

@__tensorcore__

2 months

RT @jinaycodes: Introducing soarXiv ✈️, the most beautiful way to explore human knowledge. Take any paper's URL and replace arxiv with soar….

0

1K

0

Vijay

@__tensorcore__

2 months

RT @elliotarledge: timelapse #58 (14.5 hrs): .- used cutlass python DSL to increase elementwise add/mul memory throughput (from pytorch 500….

0

3

0

Vijay

@__tensorcore__

2 months

RT @tri_dao: I love Cutlass, and this new Python DSL looks very well-designed. Will for sure accelerate kernel dev + exploring new ideas in….

0

25

0

Vijay

@__tensorcore__

2 months

RT @__tensorcore__: We believe low level access to hardware is extremely important. High level generators rob away the freedom of programme….

0

2

0

Vijay

@__tensorcore__

2 months

Lastly, I want to say a massive thank you to the work (and sacrifices) of everyone who has worked on project. This is the first time CUTLASS has done a compiler, and it required a lot of collaboration across our CUDA and driver, compiler, Python, frameworks, DevTech.

2

0

21

Vijay

@__tensorcore__

2 months

CuTe DSL will be in beta for the next few months. We would love to hear your feedback and suggestions especially during this period. Check us out on GitHub and file issues or contribute examples:.

1

0

15

Vijay

@__tensorcore__

2 months

If you missed it, you can also watch our 40 min GTC talk that dives deep into CuTe DSL

1

17

Vijay

@__tensorcore__

2 months

We have also heard the your feedback asking for better documentation. CuTe DSL documentation and all existing CUTLASS C++ documentation is now homed at with a fresh coat of paint!

1

0

19

Vijay

@__tensorcore__

2 months

We even have a series of Jupyter notebooks to get you started. My favorite is the one that teaches how to print types and values at both compile time and runtime. One of the best and most frequently used ways to debug kernels 😊.

2

1

24

Vijay

@__tensorcore__

2 months

🔋 Batteries included in the form of many new examples across multiple architectures 🔌. Of note are Ampere FlashAattention-2 and Blackwell FlashAttention-3 implementations that are at parity with C++ implementations in terms of performance:.

1

2

30

Vijay

@__tensorcore__

2 months

We believe low level access to hardware is extremely important. High level generators rob away the freedom of programmers to experiment with new ideas and kernel designs while C++ is too slow to compile, learn, and debug. CuTe DSL provides the best of both worlds ⚡

2

27

Vijay

@__tensorcore__

2 months

This initial release ships with CuTe DSL, a programming language that is fully consistent with CuTe C++ in its programming model, APIs, abstraction level, and performance. Kernels in CuTe DSL look and feel like CuTe C++ but compile 100x faster without compromise in performance.

1

28

Vijay

@__tensorcore__

2 months

RT @memorypaladin: Most exciting addition in CUDA 12.9 for me is CUDA_LOG_FILE. You can finally get error strings to describe the error you….

0

1

0