Songlin Yang
@SonglinYang4
Followers
14K
Following
5K
Media
82
Statuses
2K
research @MIT_CSAIL @thinkymachines. work on language model architectures. in open-sourcing I trust 🐳. she/her/hers
Cambridge, MA
Joined January 2021
Hybrid Models as First-Class Citizens in vLLM 😍
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
1
5
122
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
1
29
111
love to see it - ongoing community effort makes deploying recurrent models (mamba, deltanet, other linear attention hybrids) easier than ever to realize their inference throughput wins
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
3
13
86
I highly encourage anyone interested in Delta Attention/Deltanet to read through this whole thread. You can see how I start from practically 0 and am trying to understand Kimi Delta Attention and related linear attention literature by spamming Grad with questions.
@nrehiew_ Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store
5
51
498
My new field guide to alternatives to standard LLMs: Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers. https://t.co/ZpWugAccgQ
24
160
919
Think about this talk a lot. There was a time when people were bullish on "feed all the modalities to the LLM," but it didn't really pan out as I would have expected. The discrete / continuous divide remains a interesting challenge in deep learning.
COLM Keynotes: Luke Zettlemoyer Mixed-modal Language Modeling https://t.co/8FdhhrfOnG
11
18
222
Highly recommend this work from @shannonzshen if you're interested in human–agent collaboration but have felt discouraged by how hard it is to quantitatively study collaboration with humans in the loop. To quantify human-agent collaboration, the work introduces
Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵
5
20
112
@QuixiAI Ah sure, that's https://t.co/oEt8XzO57S I am also working on one where I am explaining DeltaNet in more detail.
magazine.sebastianraschka.com
From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design
0
10
43
This cyclist nearly lost his life in a crash. Learn why he thinks Waymo will help improve road safety.
1
2
10
@gallabytes @ryu0000000001 The future is DSA with nope, mixed with KDA in a hybrid 🤯. It actually becomes so elegant imo like this
3
1
32
https://t.co/zwpED4c98h kimi folks usually share their cot on model training, this is a great reflection from @yzhang_cs
6
9
130
smh why rebrand the common Transformer-to-RNN distillation/conversion as “re-training” without citing any prior works?
Today we are releasing Brumby-14B-Base, the strongest attention-free base model around. https://t.co/mclQPFdOGa
4
10
135
so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
11
127
🚀Really excited to see this amazing arch change (KDA) finally coming out! Replacing global attention with linear hybrid arch: better pretraining ppls, long context evals, downstream math&code&stem evals after RL, >6 * throughput at 1M to unblock more downstream potentials to
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
18
55
You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. -
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
9
33
380
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
12
62
503
🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
8
32
243
Composer is a new model we built at Cursor. We used RL to train a big MoE model to be really good at real-world coding, and also very fast. https://t.co/DX9bbalx0B Excited for the potential of building specialized models to help in critical domains.
53
72
780
Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO!
208
385
5K
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
20
136
623