
Dan Fu
@realDanFu
Followers
6K
Following
1K
Media
190
Statuses
797
Incoming assistant professor at UCSD CSE in MLSys. Currently recruiting students! Also running the kernels team @togethercompute.
Joined September 2019
Excited to share that I will be joining UCSD CSE as an assistant professor in January 2026! I'll be recruiting PhD students from the 2024 application pool - if you're interested in anything ML Sys/efficiency/etc please reach out & put my name on your application! Until then
47
40
572
It's really exciting to see @OpenAI releasing open-source models again. These models look really great, excited to see what we can do with them! 120B available now on @togethercompute, more coming soon!
🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥
0
0
15
🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥
13
24
111
Crazy fast!! Great work from @haoailab
(1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as “sparse distillation”, to speed up video denoising time by 70X! 🖥️ Live
0
1
4
DeepCogito models available and scaling on Together, check them out
A small update - we had more traffic than anticipated. However, the endpoints are now scalable on Together AI for all models, including the 671B MoE. Test out the model here: https://t.co/Od1NXYVBxU (A huge thanks to the folks at @togethercompute for making this happen so
0
0
3
It’s been great working with you @mjlbach! Great models + great kernels & infra -> amazing things
Working with @togethercompute has been one of the greatest accelerants for our research & inference team. We've scaled to thousands of chips seamlessly and migrated across multiple architectures thanks to their amazing kernel group. @vipulved is also a personal icon of mine.
0
0
2
I really enjoyed this talk from @bariskasikci at @ESFoMo - some really fine-grained analysis of compute patterns of LLM serving in the throughput-bound regime, and how to schedule operations to push the boundaries (a linear program)! Great work!
0
3
15
ES-FoMo is back tomorrow! Come join is in East Exhibition Hall A bright and early at 8:30AM for a great slate of invited talks, orals, spotlight lightning talks, and 150 posters!
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
0
2
14
Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/
3
22
81
And @keshigeyan is going to be presenting about Grafting - a great collaboration with @MichaelPoli6 on how to distill pretrained diffusion models into new architectures (Transformers -> Hyenas) 4/
1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6
0
3
7
Two papers at the workshop I’m a bit fond of… @austinsilveria and @SohamGovande are going to be presenting Chipmunk - come chat with them about how they made video diffusion 3.7x faster! (With custom column-sparse attention kernels) 3/
Training-free acceleration of Diffusion Transformers with dynamic sparsity and cross-step attention/MLP deltas--collaboration with @SohamGovande and @realDanFu! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!
1
2
10
On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) https://t.co/w2nhjqNxPb 2/
ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇
1
3
15
I’m off to #ICML2025 in Vancouver! (After an unusually eventful first flight - our plane had a wing problem, so we had to take an emergency landing back to SFO & switch planes) Reach out if you’d like to chat about (mega)kernels, @togethercompute, or anything MLSys! 1/
2
0
22
Fastest Deepseek! Super proud of the amazing inference team at Together for pulling this off!
Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃Fastest time to first answer token:
0
1
7
Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃Fastest time to first answer token:
7
14
106
Synthetics like associative recall, MQAR are a great guide to building models. Excited to see this work from @nick11roberts to create new LMs!
🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]
0
2
13
🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]
1
18
44
This is really cool! There’s a ton of places where a dynamic differentiable hierarchy makes a ton of sense. Awesome to see progress here!
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
0
1
19
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
61
192
1K
HMAR code and models are out!
Happy to share that our HMAR code and pre-trained models are now publicly available. Please try them out here: code: https://t.co/HZloGGrLFG checkpoints:
0
0
8
Happy to share that our HMAR code and pre-trained models are now publicly available. Please try them out here: code: https://t.co/HZloGGrLFG checkpoints:
huggingface.co
Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from
0
11
39