
Mohamed Abdelfattah
@mohsaied
Followers
1K
Following
465
Media
87
Statuses
615
Assistant Prof @CornellECE and @cornell_tech. At the intersection of machine learning and hardware. Father. Muslim.
NYC
Joined March 2009
Feature drop from @mako_dev_ai !! You can now define your own custom pytorch problem to generate GPU kernels on our #makogenerate platform!.
MakoGenerate now supports custom problems, meaning you can generate #CUDA or #Triton kernels for any @PyTorch reference code you have! . Lets walk through an example using @GPU_MODE's latest contest: Triangle Multiplicative Update (TriMul) module
0
1
5
We use large-scale text-to-text regression to predict specific parameters (e.g. utilization) of compute nodes in Google's datacenter, purely based on training on a (very) large corpus of unstructured system logs!!. Paper: Code:
Seeing text-to-text regression work for Google’s massive compute cluster (billion $$ problem!) was the final result to convince us we can reward model literally any world feedback. Paper: Code: Just train a simple encoder-decoder
0
0
6
RT @gene_ch0u: We've released all code and models for FlashDepth! It produces depth maps from a 2k, streaming video in real-time. This was….
0
70
0
For now you can use your favorite LLM from @OpenAI, @AnthropicAI, and @Google. We're also fine-tuning our own LLM to generate the best GPU code. More on that soon.
0
0
0
Write a #GPU kernel in 60 seconds! Check out MakoGenerate (completely free) and watch our launch video below! :).
Introducing MakoGenerate, a CUDA-writing AI agent. We're excited to make the research preview of MakoGenerate available today, completely free.
1
0
5
Thank you @songhan_mit for an insightful and dense guest lecture on the latest quantization and model efficiency methods developed in your group for generative AI!
0
0
8
Palu is our latest work, recently accepted to @iclr_conf!.What does @deepseek_ai and Palu have in common? We both perform low-rank projection of the KV-Cache to reduce the memory footprint of of LLMs during inference. Read more:
0
1
12
Bitmod is our latest work accepted at @HpcaArchCon!. Our mixed-precision bit-serial compute architecture for LLMs capable of running per-group quantized LLMs efficiently, with a new trick to gain more accuracy than prior work. Paper:
1
0
8
Next week at @MicroArchConf, Yuzong will present our BBS paper. For the first time, we can take advantage of *bidirectional* bit-level sparsity, extracting even more performance out of quantized DNNs. Read more:
0
0
4
RT @akhauriyash123: Excited to see @jaseweston highlighting the importance of contextual behavior in transformers! In our #EMNLP2024 paper….
0
5
0