Ammar Ahmad Awan Profile
Ammar Ahmad Awan

@ammar_awan

Followers
280
Following
847
Media
28
Statuses
479

DeepSpeed-er @Microsoft, @MSFTDeepSpeed, Father, PhD, Wanna-be Professor, Technology Enthusiast.

Bellevue, Washington
Joined September 2011
Don't wanna be here? Send us removal request.
@ammar_awan
Ammar Ahmad Awan
2 years
Very excited to lead this effort at Microsoft! My first release as a manager. Feeling proud of myself and my team who worked very hard to make this happen in a short amount of time :-)
@DeepSpeedAI
DeepSpeed
2 years
Introducing Mixtral, Phi2, Falcon, and Qwen support in #DeepSpeed-FastGen! - Up to 2.5x faster LLM inference - Optimized SplitFuse and token sampling - Exciting new features like RESTful API and more! For more details: https://t.co/386OvJtQLk #DeepSpeeed #AI
1
1
9
@DeepPsycho_HQ
Deep Psychology
3 months
This is literally Worth GOLD!
77
3K
18K
@TivadarDanka
Tivadar Danka
4 months
My feelings when working with tensors:
23
167
2K
@jeffra45
Jeff Rasley
7 months
🧵1/ New release from @Snowflake AI Research: Shift Parallelism is a new LLM inference technique built on top of vLLM, released through ArcticInference. It dramatically improves latency while preserving high throughput. Here’s what it looks like in action 👇
1
18
73
@minchoi
Min Choi
7 months
I asked ChatGPT o3 the top 10 most weirdest prompts people ask
34
32
257
@hsu_byron
Byron Hsu
1 year
(1/n) Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training. https://t.co/cgJXXqXpM4
19
173
963
@DeepSpeedAI_JP
DeepSpeed (日本語アカウント)
1 year
オハイオ州立大学で開かれたイベントで、メンバーのAmmar Ahmad Awan @ammar_awan が、DeepSpeedの最適化に関する講演を行いました! オハイオ州立大学は分散並列処理の研究で広く知られており、DeepSpeedチームにも出身者が多くいます。
@DeepSpeedAI
DeepSpeed
1 year
Great to see the amazing DeepSpeed optimizations from @Guanhua_Wang_, Heyang Qin, @toh_tana, @QuentinAnthon15, and @samadejacobs presented by @ammar_awan at MUG '24.
0
1
4
@ammar_awan
Ammar Ahmad Awan
1 year
Felt great to be back at OSU. Thank you @dhabalkpanda, @harisubramoni for inviting me and enabling me to share the awesome DeepSpeed work with @mvapich team!
@DeepSpeedAI
DeepSpeed
1 year
Great to see the amazing DeepSpeed optimizations from @Guanhua_Wang_, Heyang Qin, @toh_tana, @QuentinAnthon15, and @samadejacobs presented by @ammar_awan at MUG '24.
0
0
2
@mvapich
MVAPICH
1 year
Dr. Ammar Ahmad Awan from Microsoft DeepSpeed giving a presentation at MUG '24 over Trillion-parameter LLMs and optimization with MVAPICH. @OSUengineering @Microsoft @OhTechCo @mvapich @MSFTDeepSpeed @MSFTDeepSpeedJP #MUG24 #MPI #AI #LLM #DeepSpeed
1
5
8
@DeepSpeedAI
DeepSpeed
1 year
Announcing that DeepSpeed now runs natively on Windows. This exciting combination unlocks DeepSpeed optimizations to Windows users and empowers more people and organizations with AI innovations. - HF Inference & Finetuning - LoRA - CPU Offload Blog: https://t.co/LeNHlDZH3C
1
6
38
@DMogahed
Dalia Mogahed
2 years
Brought me to tears. She’s so respected by her peers. What an achievement. Academic excellence and ethical leadership. Asna Tabassum, we salute you.
2
36
134
@SebastienBubeck
Sebastien Bubeck
2 years
The phi-3 family is the work of an amazing team over many months, kudos to everyone!
4
4
110
@SebastienBubeck
Sebastien Bubeck
2 years
phi-3 is here, and it's ... good :-). I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning! (And ofc this wouldn't be complete without the usual table of benchmarks!)
39
175
922
@StasBekman
Stas Bekman
2 years
If you're trying to run MoE Mixtral-8x7b under @MSFTDeepSpeed it's likely to hang on the first forward The solution is here https://t.co/6O1JgJnMle and you need deepspeed>=0.13.0 Thanks to Masahiro Tanaka for the fix. edit: looks like someone codified it even better:
Tweet card summary image
github.com
ZeRO3 does not work with MoE models because the order of executing modules can change at every forward/backward pass (#4094, #4808). This PR adds an API to stop breaking down a module for parameter...
2
9
79
@ylecun
Yann LeCun
2 years
* Language is low bandwidth: less than 12 bytes/second. A person can read 270 words/minutes, or 4.5 words/second, which is 12 bytes/s (assuming 2 bytes per token and 0.75 words per token). A modern LLM is typically trained with 1x10^13 two-byte tokens, which is 2x10^13 bytes.
@prmshra
Parmita Mishra
2 years
This is an essential point people seem to misrepresent.
558
2K
9K
@DeepSpeedAI
DeepSpeed
2 years
#DeepSpeed joins forces with @Sydney_Uni to unveil an exciting tech #FP6. Just supply your FP16 models, and we deliver: 🚀 1.5x performance boost for #LLMs serving on #GPUs 🚀 Innovative (4+2)-bit system design 🚀 Quality-preserving quantization link: https://t.co/m6vcmXaWxb
1
26
168
@StasBekman
Stas Bekman
2 years
The other news is the introduction of @MSFTDeepSpeed Meetups, which will be conducted once about every 3 months. The inaugural one will be on Feb 12 6:00 PM - 8:00 PM at Redmond Reactor https://t.co/qikTXXkyrI Quote: "This will be the first ever meetup for the DeepSpeed
0
2
24
@DeepSpeedAI
DeepSpeed
2 years
Thanks @StasBekman! DeepSpeed team is hiring for various engineering and research roles! Come join us and steer the future of large scale AI training and inference.
@StasBekman
Stas Bekman
2 years
Finally I'm being told MSFT allocated more engineering positions on the @MSFTDeepSpeed team. As a long time Deepspeed user for the first few years I had to fix many bugs myself since the team was so small, and finally the time has come where I can just report them and the
0
4
22
@DeepSpeedAI
DeepSpeed
2 years
Are you a #DeepSpeed user, fan, contributor, and/or advocate? Are you interested in meeting people behind @MSFTDeepSpeed tech? Are you interested in #AI? If yes, come and meet the team at our first in-person meetup in the Seattle area! Register here: https://t.co/C3acG4Wvx0
2
9
21
@QuentinAnthon15
Quentin Anthony
2 years
Getting the most out of your hardware when training transformers requires thinking about your model as a sequence of GPU kernel calls. This mindset, common in HPC, is rare in ML and leads to inefficiencies in LLM training. Learn more in our paper https://t.co/qy6Q2MEJpw
7
76
330
@StasBekman
Stas Bekman
2 years
If you were holding off to try @MSFTDeepSpeed ZeRO++ it looks like deepspeed@master should work well now: https://t.co/Bzg5DNyxym ZeRO++'s main feature is allowing you to use a hybrid approach if you can fit a model on a single node of 8 gpus. So it takes benefit of the super
3
12
78