Songlin Yang Profile
Songlin Yang

@SonglinYang4

Followers
14K
Following
5K
Media
82
Statuses
2K

research @MIT_CSAIL @thinkymachines. work on language model architectures. in open-sourcing I trust 🐳. she/her/hers

Cambridge, MA
Joined January 2021
Don't wanna be here? Send us removal request.
@SonglinYang4
Songlin Yang
16 hours
Hybrid Models as First-Class Citizens in vLLM 😍
@PyTorch
PyTorch
21 hours
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
1
5
122
@PyTorch
PyTorch
21 hours
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
1
29
111
@Tesla
Tesla
24 days
Model Y Standard & Model 3 Standard are here
0
1K
9K
@_albertgu
Albert Gu
17 hours
love to see it - ongoing community effort makes deploying recurrent models (mamba, deltanet, other linear attention hybrids) easier than ever to realize their inference throughput wins
@PyTorch
PyTorch
21 hours
Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗
3
13
86
@nrehiew_
wh
1 day
I highly encourage anyone interested in Delta Attention/Deltanet to read through this whole thread. You can see how I start from practically 0 and am trying to understand Kimi Delta Attention and related linear attention literature by spamming Grad with questions.
@Grad62304977
Grad
7 days
@nrehiew_ Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store
5
51
498
@rasbt
Sebastian Raschka
2 days
My new field guide to alternatives to standard LLMs: Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers. https://t.co/ZpWugAccgQ
24
160
919
@srush_nlp
Sasha Rush
2 days
Think about this talk a lot. There was a time when people were bullish on "feed all the modalities to the LLM," but it didn't really pan out as I would have expected. The discrete / continuous divide remains a interesting challenge in deep learning.
@COLM_conf
Conference on Language Modeling
2 days
COLM Keynotes: Luke Zettlemoyer Mixed-modal Language Modeling https://t.co/8FdhhrfOnG
11
18
222
@EchoShao8899
Yijia Shao
6 days
Highly recommend this work from @shannonzshen if you're interested in human–agent collaboration but have felt discouraged by how hard it is to quantitatively study collaboration with humans in the loop. To quantify human-agent collaboration, the work introduces
@shannonzshen
Shannon Shen
6 days
Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵
5
20
112
@rasbt
Sebastian Raschka
4 days
@QuixiAI Ah sure, that's https://t.co/oEt8XzO57S I am also working on one where I am explaining DeltaNet in more detail.
Tweet card summary image
magazine.sebastianraschka.com
From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design
0
10
43
@jwuphysics
John F. Wu
5 days
Inside you there are two ML engineers
1
5
73
@WaymoCommunity
Waymo community
2 days
This cyclist nearly lost his life in a crash. Learn why he thinks Waymo will help improve road safety.
1
2
10
@Grad62304977
Grad
6 days
@gallabytes @ryu0000000001 The future is DSA with nope, mixed with KDA in a hybrid 🤯. It actually becomes so elegant imo like this
3
1
32
@JingyuanLiu123
JingyuanLiu
6 days
https://t.co/zwpED4c98h kimi folks usually share their cot on model training, this is a great reflection from @yzhang_cs
6
9
130
@SonglinYang4
Songlin Yang
7 days
smh why rebrand the common Transformer-to-RNN distillation/conversion as “re-training” without citing any prior works?
@manifest__ai
Manifest AI
8 days
Today we are releasing Brumby-14B-Base, the strongest attention-free base model around. https://t.co/mclQPFdOGa
4
10
135
@_albertgu
Albert Gu
7 days
so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀
@Kimi_Moonshot
Kimi.ai
7 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
11
127
@zy27962986
Zongyu Lin
7 days
🚀Really excited to see this amazing arch change (KDA) finally coming out! Replacing global attention with linear hybrid arch: better pretraining ppls, long context evals, downstream math&code&stem evals after RL, >6 * throughput at 1M to unblock more downstream potentials to
@Kimi_Moonshot
Kimi.ai
7 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
1
18
55
@zxytim
Xinyu Zhou
7 days
You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. -
@Kimi_Moonshot
Kimi.ai
7 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
9
33
380
@SonglinYang4
Songlin Yang
7 days
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
12
62
503
@bigeagle_xd
🐻熊狸
7 days
by moonlight, we catch a glimpse of K3.
3
9
297
@vllm_project
vLLM
7 days
🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡
@Kimi_Moonshot
Kimi.ai
7 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
8
32
243
@srush_nlp
Sasha Rush
8 days
Composer is a new model we built at Cursor. We used RL to train a big MoE model to be really good at real-world coding, and also very fast. https://t.co/DX9bbalx0B Excited for the potential of building specialized models to help in critical domains.
53
72
780
@CelsiusOfficial
CELSIUS Energy Drink
2 months
Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO!
208
385
5K
@RidgerZhu
Rui-Jie (Ridger) Zhu
8 days
Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.
20
136
623