Thien Tran Profile
Thien Tran

@gaunernst

Followers
756
Following
9K
Media
89
Statuses
627

Singapore
Joined October 2016
Don't wanna be here? Send us removal request.
@gaunernst
Thien Tran
1 day
Meanwhile, vision folks
Tweet media one
2
0
9
@gaunernst
Thien Tran
1 day
Looking at RoPE. Why does everyone do RoPE in BF16 for LLMs ๐Ÿ‘๏ธ
Tweet media one
4
3
92
@grok
Grok
4 hours
Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.
6
7
48
@gaunernst
Thien Tran
2 days
Remind me of the Importance Sampling stuff when I did Peter Shirley's raytracing series.
@fengyao1909
Feng Yao
5 days
Failing on ๐ฅ๐š๐ซ๐ ๐ž-๐ฌ๐œ๐š๐ฅ๐ž ๐‘๐‹ with VeRL?. โš ๏ธ Mixing inference backend (๐ฏ๐‹๐‹๐Œ/๐’๐†๐‹๐š๐ง๐ ) with training backends (๐…๐’๐ƒ๐/๐Œ๐ž๐ ๐š๐ญ๐ซ๐จ๐ง) ๐ฌ๐ž๐œ๐ซ๐ž๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐‘๐‹ ๐ข๐ง๐ญ๐จ ๐จ๐Ÿ๐Ÿ-๐ฉ๐จ๐ฅ๐ข๐œ๐ฒ โ€” even if they share the same weights!. ๐Ÿ“‰ย Blog:
Tweet media one
0
0
6
@gaunernst
Thien Tran
8 days
Kinda hate that autocast is sprinkled generously across the codebase. Reasons to NOT use autocast.- Dtype casting rules can be unexpected.- Different casting behavior from FSDP's.- Non-trivial overhead.Would be better to explicitly do .float() / .bfloat16() instead
Tweet media one
0
0
5
@gaunernst
Thien Tran
8 days
just found out Wan2.2 was released in FP32 ๐Ÿ‘๏ธ. Was wondering why I got OOM ๐Ÿคฃ
Tweet media one
2
1
20
@gaunernst
Thien Tran
21 days
Let's see. My kernel is only off by 1e36. Doesn't look too shabby
Tweet media one
0
0
27
@gaunernst
Thien Tran
23 days
Prototyped an attention with MXFP8 for QK.T.Good news: it's faster than BF16 attention.Bad news: MXFP8 matmul becomes slower in e2e ๐Ÿคก.Have a feeling it's related to power limit / throttling
Tweet media one
1
0
17
@gaunernst
Thien Tran
24 days
SageAttention folks are ingenious. Per-thread quantization scaling based on MMA accumulator thread layout to utilize INT4 Tensor Cores๐Ÿคฏ.Figure taken from SageAttention2 paper -
Tweet media one
0
1
35
@gaunernst
Thien Tran
26 days
How to beat Cutlass (on 5090 FP8)? Use a faster instruction ๐Ÿคฃ. Lolz aside, >100% SOL is ofc not possible. With the faster instruction, SOL is doubled, so % SOL is actually only ~60%
Tweet media one
@gaunernst
Thien Tran
29 days
I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.
5
1
48
@gaunernst
Thien Tran
29 days
This is funny since the more complicated instruction is apparently faster. Follow up q. Why can't PTX->SASS compiler emit the faster SASS automatically when the slow PTX is used?.
0
0
9
@gaunernst
Thien Tran
29 days
I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.
@KBlueleaf
็ฅ็€้’่‘‰@LyCORIS
2 months
Is there any MXfp8 matmul implementation in triton ?. just found blackwell's fp8/6/4 support is actually through different instruction than adalovelace's fp8 instruction. ada only support single scale, and this instruction in blackwell will actually have same speed as fp16.
3
2
62
@gaunernst
Thien Tran
29 days
I should be reading SageAttention papers. .
0
0
7
@gaunernst
Thien Tran
29 days
In attention, the 1st matmul Q @ K.T does reduction over head_dim, which is typically small e.g. 128. This seems ideal of low-bit matmul? I think even int8 with 1 scale per head_dim would work fine?.The 2nd matmul P @ V is a different beast, due to reduction along sequence dim. .
3
0
22
@gaunernst
Thien Tran
1 month
Managed to get mma.block_scale working. The good thing is, the scale layout is more sane than I thought initially. The bad thing is, it means the diagram is wrong. Or it's just very confusing. Not sure where is the lower/higher address/index. NVIDIA folks perhaps can clarify.
@gaunernst
Thien Tran
1 month
Have been staring at this for the last few days ๐Ÿ˜ตโ€๐Ÿ’ซ
Tweet media one
0
0
14
@gaunernst
Thien Tran
1 month
Was quite surprised that Qwen3 impl in HF transformers looks very clean. And some layers use kernels from HF kernels hub ๐Ÿ‘€
Tweet media one
1
1
51
@gaunernst
Thien Tran
1 month
Have been staring at this for the last few days ๐Ÿ˜ตโ€๐Ÿ’ซ
Tweet media one
2
0
11
@gaunernst
Thien Tran
1 month
CuDNN seems to be only using sm80 features though. So maybe with TMA, it can be even faster.
Tweet media one
1
0
8
@gaunernst
Thien Tran
1 month
Faster than FA2 on 5090 ๐Ÿ‘€.Hopefully I can get to CuDNN speed
Tweet media one
6
0
36
@gaunernst
Thien Tran
1 month
SageAttention3 has NVFP4 attention for 5090 ๐Ÿ‘€.Doesn't look like code has been released yet though.
Tweet media one
@gaunernst
Thien Tran
2 months
Bottlenecked by attention. Will need to grab FP8 attention from somewhere
Tweet media one
2
5
49