mako_dev_ai Profile Banner
Mako Profile
Mako

@mako_dev_ai

Followers
61
Following
15
Media
4
Statuses
18

AI-powered GPU kernel generation enabling continuous optimization and universal deployment

Joined January 2025
Don't wanna be here? Send us removal request.
@mako_dev_ai
Mako
15 days
Introducing MakoGenerate, a CUDA-writing AI agent. We're excited to make the research preview of MakoGenerate available today, completely free.
6
15
75
@mako_dev_ai
Mako
3 days
Try it for free on
0
0
0
@mako_dev_ai
Mako
3 days
You can now generate GPU kernels in #CUDA and #Triton for any arbitrary PyTorch code you have. Give it a shot!.
@wAIeedatallah
Waleed Atallah
3 days
MakoGenerate now supports custom problems, meaning you can generate #CUDA or #Triton kernels for any @PyTorch reference code you have! . Lets walk through an example using @GPU_MODE's latest contest: Triangle Multiplicative Update (TriMul) module
1
1
4
@mako_dev_ai
Mako
14 days
And this is just the beginning! There are so many new features to explore and evaluate. LLMs+Search+RL is proving to be a game changer in capability. If this kind of work excited you, apply at .
0
0
0
@mako_dev_ai
Mako
14 days
@METR_Evals Level 5 covers more complex, real-world kernels, including DeepSeek MLA, among others. MakoGenerate wins on 4/14 kernels.
1
0
0
@mako_dev_ai
Mako
14 days
KernelBench Level 2 includes slightly more complex operations with simple fusion patterns. MakoGenerate again wins on 68/100 problems.
1
0
0
@mako_dev_ai
Mako
14 days
@ScalingIntelLab KernelBench Levels 1 includes simple PyTorch operations like matmul or linear layers. MakoGenerate matches or beats torch.compile on 68/100 problems.
1
0
0
@mako_dev_ai
Mako
14 days
MakoGenerate with Evolutionary Search is already creating production-quality #CUDA kernels that beat torch.compile and expert-written kernels on real world use cases. We'll be posting examples with code throughout the week, but a few highlights are below 🧵. (ps we're hiring)
Tweet media one
1
0
2
@mako_dev_ai
Mako
15 days
Iterative refinement is pretty neat and can yield some decent results, but the real innovation is in applying evolutionary search. Stay tuned for some cool results coming later this week.
0
0
4
@mako_dev_ai
Mako
15 days
RT @wAIeedatallah: The research preview of MakoGenerate is available today, completely free. Keep reading to see what it does, how to try i….
0
1
0
@mako_dev_ai
Mako
15 days
0
0
1
@mako_dev_ai
Mako
15 days
Free @nvidia Blackwell GPUs for code generation and testing???? šŸ‘€ how long can we keep this up???.
1
0
1
@mako_dev_ai
Mako
15 days
Try it for free at This research preview is a fun way to see how well different models do when it comes to GPU code generation.
0
0
5
@mako_dev_ai
Mako
3 months
We started benchmarking @Meta Llama 4 Scout on @AMD MI300X and @NVIDIA H100. Shoutout to the zero day support making life easier. Using AMD's vLLM container on long-ish context lengths (5000/1000) we get the following:.2x MI300X - 526 tps.4x MI300X - 996 tps.8x MI300X - 1144 tps
Tweet media one
0
0
1
@mako_dev_ai
Mako
4 months
And now with the latest @AMD's AITER library, there's even more performance to be unlocked! Exciting times ahead.
0
0
0
@mako_dev_ai
Mako
4 months
The secret sauce? A combination of:. AMD's Composable Kernel for Flash Attention.GEMM tuning via PyTorch TunableOps.Liger Kernel's normalization layers (at high batch sizes).torch.compile for everything else.
1
0
0
@mako_dev_ai
Mako
4 months
There's no single "best" kernel library for operations like attention. The optimal choice depends on your specific workload, batch size, and hardware. Our Mako Compiler automates the process of finding the best combinations.
1
0
0
@mako_dev_ai
Mako
4 months
Kernels Together Strong 🦧 Our last blog post showed how combining multiple kernel libraries can deliver state-of-the-art performance for AI models. We achieved up to 60% speedup on the FLUX.1-schnell model using #AMD MI300X GPUs! #AI #GPUOptimization
1
0
2