Felix
@felix_red_panda
Followers
5K
Following
22K
Media
192
Statuses
6K
speech synthesis and LLM nerd, DMs open, working on LLM stuff
Berlin, Germany
Joined June 2020
Part 2 of the Penny worklog out now! This time we're fixing our previous shortcomings on small buffer sizes and we're outperforming NCCL on all sizes that matter for LLM inference ๐งต
2
6
71
Ever wondered why embedding models are offered so cheaply? It is because you can process literally billions of tokens a day even on a consumer-grade GPU like the 4090. Check out our new text investigating the economics of embedding model inference. Link in the next tweet
6
23
207
Why does embedding the entire Wikipedia only cost a few dollars? Deep dive blog post, link below
5
17
253
amazing total lunar eclipse yesterday (picture taken with a 600mm lens)
2
0
25
What are the profit margins of serving DeepSeek ๐ณ? @schreiberic and I discuss large-scale MoE inference in depth. Blog post link below
11
25
242
single-threaded vector masked bit group counting + mmap, 821x faster, 11.4GiB/s
wrote the same word-counting program 5 times, each faster than the last best result: 494ร faster than my first Python version โ by using SIMD in C all are O(n), but some squeeze far more from the CPU and memory!
1
1
119
Re: GPT-4o depreciation: LLMs are highly susceptible to Hyrumโs Law
0
0
16
Adding multi-level performance models to diagrams. This will allow performance models of FlashAttention / matmul / distributed MoEs to be dynamically calculated. Colors indicate execution at different levels, and the hexagons indicate a partitioned axis.
1
4
75
I solved every single problem in the CUDA mode book. A quick thread summarizing this experience and what I learned 1/x
31
244
2K