
Thien Tran
@gaunernst
Followers
756
Following
9K
Media
89
Statuses
627
Remind me of the Importance Sampling stuff when I did Peter Shirley's raytracing series.
Failing on ๐ฅ๐๐ซ๐ ๐-๐ฌ๐๐๐ฅ๐ ๐๐ with VeRL?. โ ๏ธ Mixing inference backend (๐ฏ๐๐๐/๐๐๐๐๐ง๐ ) with training backends (๐
๐๐๐/๐๐๐ ๐๐ญ๐ซ๐จ๐ง) ๐ฌ๐๐๐ซ๐๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐๐ ๐ข๐ง๐ญ๐จ ๐จ๐๐-๐ฉ๐จ๐ฅ๐ข๐๐ฒ โ even if they share the same weights!. ๐ย Blog:
0
0
6
I was trying to add Wan2.2 to my repo just ytd.
github.com
Adding Qwen-Image This PR introduces Qwen-Image into the diffusers library. For more information about Qwen-Image and advanced usage examples, see the official repository: Qwen-Image Thanks the su...
0
0
8
How to beat Cutlass (on 5090 FP8)? Use a faster instruction ๐คฃ. Lolz aside, >100% SOL is ofc not possible. With the faster instruction, SOL is doubled, so % SOL is actually only ~60%
I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.
5
1
48
I can confirm that mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.f32.e4m3.e4m3.f32.ue8m0 is faster than mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 on 5090. mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.f32.e4m3.e4m3.f32 has the same speed as old one.
Is there any MXfp8 matmul implementation in triton ?. just found blackwell's fp8/6/4 support is actually through different instruction than adalovelace's fp8 instruction. ada only support single scale, and this instruction in blackwell will actually have same speed as fp16.
3
2
62