Dan Fu @realDanFu X Profile

Dan Fu

@realDanFu

Followers

6K

Following

1K

Media

190

Statuses

797

Incoming assistant professor at UCSD CSE in MLSys. Currently recruiting students! Also running the kernels team @togethercompute.

https://t.co/pGWFSyjNcO

Joined September 2019

Don't wanna be here? Send us removal request.

Dan Fu

@realDanFu

1 year

Excited to share that I will be joining UCSD CSE as an assistant professor in January 2026! I'll be recruiting PhD students from the 2024 application pool - if you're interested in anything ML Sys/efficiency/etc please reach out & put my name on your application! Until then

47

40

572

Dan Fu

@realDanFu

1 month

It's really exciting to see @OpenAI releasing open-source models again. These models look really great, excited to see what we can do with them! 120B available now on @togethercompute, more coming soon!

Together AI

@togethercompute

1 month

🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥

0

15

Together AI

@togethercompute

1 month

🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥

13

24

111

Dan Fu

@realDanFu

2 months

Crazy fast!! Great work from @haoailab

Hao AI Lab

@haoailab

2 months

(1/n) 🚀 With FastVideo, you can now generate a 5-second video in 5 seconds on a single H200 GPU! Introducing FastWan series, a family of fast video generation models trained via a new recipe we term as “sparse distillation”, to speed up video denoising time by 70X! 🖥️ Live

0

1

4

Dan Fu

@realDanFu

2 months

DeepCogito models available and scaling on Together, check them out

Drishan Arora

@drishanarora

2 months

A small update - we had more traffic than anticipated. However, the endpoints are now scalable on Together AI for all models, including the 671B MoE. Test out the model here: https://t.co/Od1NXYVBxU (A huge thanks to the folks at @togethercompute for making this happen so

0

3

Dan Fu

@realDanFu

2 months

It’s been great working with you @mjlbach! Great models + great kernels & infra -> amazing things

Michael Lingelbach

@mjlbach

2 months

Working with @togethercompute has been one of the greatest accelerants for our research & inference team. We've scaled to thousands of chips seamlessly and migrated across multiple architectures thanks to their amazing kernel group. @vipulved is also a personal icon of mine.

0

2

Dan Fu

@realDanFu

2 months

I really enjoyed this talk from @bariskasikci at @ESFoMo - some really fine-grained analysis of compute patterns of LLM serving in the throughput-bound regime, and how to schedule operations to push the boundaries (a linear program)! Great work!

ES-FoMo@ICML2025

@ESFoMo

2 months

@wanchao_ Next we have @bariskasikci with a talk on the quest for blazingly fast LLM inference!

0

3

15

Dan Fu

@realDanFu

2 months

ES-FoMo is back tomorrow! Come join is in East Exhibition Hall A bright and early at 8:30AM for a great slate of invited talks, orals, spotlight lightning talks, and 150 posters!

ES-FoMo@ICML2025

@ESFoMo

2 months

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

0

2

14

ES-FoMo@ICML2025

@ESFoMo

2 months

Looking forward to seeing everyone for ES-FoMo part three tomorrow! We'll be in East Exhibition Hall A (the big one), and we've got an exciting schedule of invited talks, orals, and posters planned for you tomorrow. Let's meet some of our great speakers! 1/

3

22

81

Dan Fu

@realDanFu

2 months

And @keshigeyan is going to be presenting about Grafting - a great collaboration with @MichaelPoli6 on how to distill pretrained diffusion models into new architectures (Transformers -> Hyenas) 4/

Keshigeyan Chandrasegaran

@keshigeyan

3 months

1/ Model architectures have been mostly treated as fixed post-training. 🌱 Introducing Grafting: A new way to edit pretrained diffusion transformers, allowing us to customize architectural designs on a small compute budget. 🌎 https://t.co/fjOTVqfVZr Co-led with @MichaelPoli6

0

3

7

Dan Fu

@realDanFu

2 months

Two papers at the workshop I’m a bit fond of… @austinsilveria and @SohamGovande are going to be presenting Chipmunk - come chat with them about how they made video diffusion 3.7x faster! (With custom column-sparse attention kernels) 3/

Austin Silveria

@austinsilveria

5 months

Training-free acceleration of Diffusion Transformers with dynamic sparsity and cross-step attention/MLP deltas--collaboration with @SohamGovande and @realDanFu! ⚡️ 3.7x faster video and 1.6x faster image generation while preserving quality! 🧵 Open-source code & CUDA kernels!

1

2

10

Dan Fu

@realDanFu

2 months

On Saturday we’re hosting the ES-FoMo workshop, with @tri_dao, @dan_biderman, @simran_s_arora, @m_ryabinin and others - we’ve got a great slate of papers and invited talks, come join us! (More on the great slate of speakers soon) https://t.co/w2nhjqNxPb 2/

ES-FoMo@ICML2025

@ESFoMo

4 months

ES-FoMo is back for round three at #ICML2025! Join us in Vancouver on Saturday July 19 for a day dedicated to Efficient Systems for Foundation Models: from 💬reasoning models to🖼️scalable multimodality, 🧱efficient architectures, and more! Submissions due May 26! More below 👇

1

3

15

Dan Fu

@realDanFu

2 months

I’m off to #ICML2025 in Vancouver! (After an unusually eventful first flight - our plane had a wing problem, so we had to take an emergency landing back to SFO & switch planes) Reach out if you’d like to chat about (mega)kernels, @togethercompute, or anything MLSys! 1/

2

0

22

Dan Fu

@realDanFu

2 months

Fastest Deepseek! Super proud of the amazing inference team at Together for pulling this off!

Together AI

@togethercompute

2 months

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token:

0

1

7

Together AI

@togethercompute

2 months

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token:

7

14

106

Dan Fu

@realDanFu

2 months

Synthetics like associative recall, MQAR are a great guide to building models. Excited to see this work from @nick11roberts to create new LMs!

Nicholas Roberts

@nick11roberts

2 months

🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]

0

2

13

Nicholas Roberts

@nick11roberts

2 months

🎉 Excited to share that our paper "Pretrained Hybrids with MAD Skills" was accepted to @COLM_conf 2025! We introduce Manticore - a framework for automatically creating hybrid LMs from pretrained models without training from scratch. 🧵[1/n]

1

18

44

Dan Fu

@realDanFu

2 months

This is really cool! There’s a ton of places where a dynamic differentiable hierarchy makes a ton of sense. Awesome to see progress here!

Albert Gu

@_albertgu

2 months

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

0

1

19

Albert Gu

@_albertgu

2 months

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Sukjun (June) Hwang

@sukjun_hwang

2 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

61

192

1K

Dan Fu

@realDanFu

2 months

HMAR code and models are out!

Hermann

@KumbongHermann

2 months

Happy to share that our HMAR code and pre-trained models are now publicly available. Please try them out here: code: https://t.co/HZloGGrLFG checkpoints:

0

8

Hermann

@KumbongHermann

2 months

Happy to share that our HMAR code and pre-trained models are now publicly available. Please try them out here: code: https://t.co/HZloGGrLFG checkpoints:

huggingface.co

Hermann

@KumbongHermann

3 months

Excited to be presenting our new work–HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation– at #CVPR2025 this week. VAR (Visual Autoregressive Modelling) introduced a very nice way to formulate autoregressive image generation as a next-scale prediction task (from

0

11

39