Thien Tran @gaunernst X Profile

Thien Tran

@gaunernst

Followers

756

Following

9K

Media

89

Statuses

627

Singapore

Joined October 2016

Don't wanna be here? Send us removal request.

Thien Tran

@gaunernst

1 day

Meanwhile, vision folks

2

0

9

Thien Tran

@gaunernst

1 day

Looking at RoPE. Why does everyone do RoPE in BF16 for LLMs 👁️

4

3

92

Grok

@grok

4 hours

Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.

6

7

48

Thien Tran

@gaunernst

2 days

Remind me of the Importance Sampling stuff when I did Peter Shirley's raytracing series.

Feng Yao

@fengyao1909

5 days

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL?. ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights!. 📉 Blog:

0

6

Thien Tran

@gaunernst

7 days

I was trying to add Wan2.2 to my repo just ytd.

github.com

Adding Qwen-Image This PR introduces Qwen-Image into the diffusers library. For more information about Qwen-Image and advanced usage examples, see the official repository: Qwen-Image Thanks the su...

0

8

Thien Tran

@gaunernst

8 days

Kinda hate that autocast is sprinkled generously across the codebase. Reasons to NOT use autocast.- Dtype casting rules can be unexpected.- Different casting behavior from FSDP's.- Non-trivial overhead.Would be better to explicitly do .float() / .bfloat16() instead

0

5

Thien Tran

@gaunernst

8 days

just found out Wan2.2 was released in FP32 👁️. Was wondering why I got OOM 🤣

2

1

20

Thien Tran

@gaunernst

21 days

Let's see. My kernel is only off by 1e36. Doesn't look too shabby

0

27

Thien Tran

@gaunernst

23 days

Prototyped an attention with MXFP8 for QK.T.Good news: it's faster than BF16 attention.Bad news: MXFP8 matmul becomes slower in e2e 🤡.Have a feeling it's related to power limit / throttling

1

0

17

Thien Tran

@gaunernst

24 days

SageAttention folks are ingenious. Per-thread quantization scaling based on MMA accumulator thread layout to utilize INT4 Tensor Cores🤯.Figure taken from SageAttention2 paper -

0

1

35

Thien Tran

@gaunernst

26 days

How to beat Cutlass (on 5090 FP8)? Use a faster instruction 🤣. Lolz aside, >100% SOL is ofc not possible. With the faster instruction, SOL is doubled, so % SOL is actually only ~60%

Thien Tran

@gaunernst

29 days

I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.

5

1

48

Thien Tran

@gaunernst

29 days

This is funny since the more complicated instruction is apparently faster. Follow up q. Why can't PTX->SASS compiler emit the faster SASS automatically when the slow PTX is used?.

0

9

Thien Tran

@gaunernst

29 days

I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.

琥珀青葉@LyCORIS

@KBlueleaf

2 months

Is there any MXfp8 matmul implementation in triton ?. just found blackwell's fp8/6/4 support is actually through different instruction than adalovelace's fp8 instruction. ada only support single scale, and this instruction in blackwell will actually have same speed as fp16.

3

2

62

Thien Tran

@gaunernst

29 days

I should be reading SageAttention papers. .

0

7

Thien Tran

@gaunernst

29 days

In attention, the 1st matmul Q @ K.T does reduction over head_dim, which is typically small e.g. 128. This seems ideal of low-bit matmul? I think even int8 with 1 scale per head_dim would work fine?.The 2nd matmul P @ V is a different beast, due to reduction along sequence dim. .

3

0

22

Thien Tran

@gaunernst

1 month

Managed to get mma.block_scale working. The good thing is, the scale layout is more sane than I thought initially. The bad thing is, it means the diagram is wrong. Or it's just very confusing. Not sure where is the lower/higher address/index. NVIDIA folks perhaps can clarify.

Thien Tran

@gaunernst

1 month

Have been staring at this for the last few days 😵‍💫

0

14

Thien Tran

@gaunernst

1 month

Was quite surprised that Qwen3 impl in HF transformers looks very clean. And some layers use kernels from HF kernels hub 👀

1

51

Thien Tran

@gaunernst

1 month

Have been staring at this for the last few days 😵‍💫

2

0

11

Thien Tran

@gaunernst

1 month

CuDNN seems to be only using sm80 features though. So maybe with TMA, it can be even faster.

1

0

8

Thien Tran

@gaunernst

1 month

Faster than FA2 on 5090 👀.Hopefully I can get to CuDNN speed

6

0

36

Thien Tran

@gaunernst

1 month

SageAttention3 has NVFP4 attention for 5090 👀.Doesn't look like code has been released yet though.

Thien Tran

@gaunernst

2 months

Bottlenecked by attention. Will need to grab FP8 attention from somewhere

2

5

49