Tri Dao
@tri_dao
Followers
35K
Following
2K
Media
54
Statuses
879
Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.
Stanford, CA
Joined May 2012
FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
29
341
2K
AI has been built on one vendor’s stack for too long. AMD’s GPUs now offer state-of-the-art peak compute and memory bandwidth — but the lack of mature software / the “CUDA moat” keeps that power locked away. Time to break it and ride into our multi-silicon future. 🌊 It's been a
13
97
581
Today’s LLMs are painfully slow and expensive. They are autoregressive and spit out words sequentially. One. At. A. Time. Our dLLMs generate text in parallel, delivering answers up to 10X faster. Now we’ve raised $50M to scale them. Full story from @russellbrandom in
techcrunch.com
Diffusion models already power AI image generators, but Inception thinks they can be even more powerful applied in software development.
16
48
436
Tons of effort from IBM and vLLM folks to make these hybrid models go fast. Thank you!
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
0
10
111
Thank you @schmidtsciences for the 2025 #AI2050 Early Career Fellowship supporting my work on self-improving AI systems: as AI gets better, it should help human experts design better model architectures and faster training & inference systems
We're excited to welcome 28 new AI2050 Fellows! This 4th cohort of researchers are pursuing projects that include building AI scientists, designing trustworthy models, and improving biological and medical research, among other areas. https://t.co/8oY7xdhxvF
9
2
86
Congratulations to @tri_dao and @ZhongingAlong on being named AI2050 Early Career Fellows by @schmidtsciences! The AI2050 fellowships fund researchers working to solve hard problems in AI and improve technology for the benefit of humanity by 2050. https://t.co/FSNfP2YFtE
1
15
104
State space architecture for state of the art voice model!
We've raised $100M from Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA. Today we're introducing Sonic-3 - the state-of-the-art model for realtime conversation. What makes Sonic-3 great: - Breakthrough naturalness - laughter and full emotional range - Lightning fast -
6
17
296
Feels like a dream! I’ve recently started my Ph.D. in Computer Science @Princeton! Working on exciting research with Professors @HazanPrinceton and @tri_dao 🤩
109
96
2K
SSMs promised efficient language modeling for long context, but so far seem to underperform compared to Transformers in many settings. Our new work suggests that this is not a problem with SSMs, but with how we are currently using them. Arxiv: https://t.co/bCzxawF452 🧵
6
84
415
This work, led by @_junxiong_wang and @ben_athi, is a first step towards building AI systems that evolve and get better as you use them. More to come!
What if your LLM inference automatically got faster the more you used it? Introducing ATLAS from the Together AI Turbo research team. Read more: https://t.co/ASRNUpqoAE Here’s Together AI Founder and Chief Scientist @tri_dao introducing ATLAS:
3
36
301
Clarification: I was comparing A @ B + C here, where the cute-dsl version is quite good at overlapping the epilogue. On the standard matmul A @ B, cuBLAS is very good. Updated numbers here
Using CUTLASS CuTe-DSL, TogetherAI's Chief Scientist @tri_dao announced that he has written kernels that is 50% faster than NVIDIA's latest cuBLAS 13.0 library for small K reduction dim shapes on Blackwell during today's hotchip conference. His kernels beats cuBLAS by using 2
2
20
205
We just released Nemotron Nano V2 with great accuracies and unprecedented inference speeds. With the goal of true open source models, we also released most of the data used to train this model. Check it out!
Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the
0
4
45
Today we're releasing NVIDIA Nemotron Nano v2 - a 9B hybrid SSM that is 6X faster than similarly sized models, while also being more accurate. Along with this model, we are also releasing most of the data we used to create it, including the pretraining corpus. Links to the
39
243
1K
🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥
13
24
111
Attention was never enough. The hybrid LLM era is here—and it’s moving fast. From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months. 🧵 A timeline of what’s changed and why it matters ↓ 🔗
12
99
472
🚀We're expanding the Tencent Hunyuan open-source LLM ecosystem with four compact models (0.5B, 1.8B, 4B, 7B)! Designed for low-power scenarios like consumer-grade GPUs, smart vehicles, smart home devices, mobile phones, and PCs, these models support cost-effective fine-tuning
39
190
1K
Falcon-H1 is a very dense research paper exploring the space of hybrid attention designs and tuning *every* hyperparameter there is. It's more interesting than models themselves. If you were intrigued by that «AlphaGo move» slop, this is the real thing.
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance Falcon's side-by-side attention-SSM hybrid model. Very detailed, from tokenizers to data preparation and optimization strategies.
1
12
67
Llama-8b 1.2M sequence length training is now possible on a 1x H200 gpu with ALST + FA3 + Liger-Kernel. That's 2.4x longer than with 1x H100. Ready to run recipes: https://t.co/fZZ1IL83lJ For how this is possible see: https://t.co/rpJ3WPipSK To build FA3:
6
28
218
⏱️AI is making verification process easier, with models verifying proofs in minutes. 💻 Now, @prfsanjeevarora, @chijinML, @danqi_chen and @PrincetonPLI have released Goedel Prover V2, a model more efficient and more accurate than any previous model. 👉 https://t.co/v7500VNytz
1
21
96
🧠 Qwen3 just leveled up on Together AI 🚀 Qwen3-235B-A22B-Instruct-2507-FP8 isn't just another model update - it's a leap forward 📈
3
12
59