Songlin Yang @SonglinYang4 X Profile

Songlin Yang

@SonglinYang4

Followers

14K

Following

5K

Media

82

Statuses

2K

research @MIT_CSAIL @thinkymachines. work on language model architectures. in open-sourcing I trust 🐳. she/her/hers

https://t.co/akpJsW7n9L

Cambridge, MA

Joined January 2021

Don't wanna be here? Send us removal request.

Songlin Yang

@SonglinYang4

16 hours

Hybrid Models as First-Class Citizens in vLLM 😍

PyTorch

@PyTorch

21 hours

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗

1

5

122

PyTorch

@PyTorch

21 hours

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗

1

29

111

Tesla

@Tesla

24 days

Model Y Standard & Model 3 Standard are here

0

1K

9K

Albert Gu

@_albertgu

17 hours

love to see it - ongoing community effort makes deploying recurrent models (mamba, deltanet, other linear attention hybrids) easier than ever to realize their inference throughput wins

PyTorch

@PyTorch

21 hours

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗

3

13

86

wh

@nrehiew_

1 day

I highly encourage anyone interested in Delta Attention/Deltanet to read through this whole thread. You can see how I start from practically 0 and am trying to understand Kimi Delta Attention and related linear attention literature by spamming Grad with questions.

Grad

@Grad62304977

7 days

@nrehiew_ Originally a while back i got some intuition with it from the query key value perspective which might help (theres also the gradient descent perspective which is good too). Scraped this from a chat with @stochasticchasm a year ago so might be a bit dodgy. Imagine u want to store

5

51

498

Sebastian Raschka

@rasbt

2 days

My new field guide to alternatives to standard LLMs: Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers. https://t.co/ZpWugAccgQ

24

160

919

Sasha Rush

@srush_nlp

2 days

Think about this talk a lot. There was a time when people were bullish on "feed all the modalities to the LLM," but it didn't really pan out as I would have expected. The discrete / continuous divide remains a interesting challenge in deep learning.

Conference on Language Modeling

@COLM_conf

2 days

COLM Keynotes: Luke Zettlemoyer Mixed-modal Language Modeling https://t.co/8FdhhrfOnG

11

18

222

Yijia Shao

@EchoShao8899

6 days

Highly recommend this work from @shannonzshen if you're interested in human–agent collaboration but have felt discouraged by how hard it is to quantitatively study collaboration with humans in the loop. To quantify human-agent collaboration, the work introduces

Shannon Shen

@shannonzshen

6 days

Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵

5

20

112

Sebastian Raschka

@rasbt

4 days

@QuixiAI Ah sure, that's https://t.co/oEt8XzO57S I am also working on one where I am explaining DeltaNet in more detail.

magazine.sebastianraschka.com

From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design

0

10

43

John F. Wu

@jwuphysics

5 days

Inside you there are two ML engineers

1

5

73

Waymo community

@WaymoCommunity

2 days

This cyclist nearly lost his life in a crash. Learn why he thinks Waymo will help improve road safety.

1

2

10

Grad

@Grad62304977

6 days

@gallabytes @ryu0000000001 The future is DSA with nope, mixed with KDA in a hybrid 🤯. It actually becomes so elegant imo like this

3

1

32

JingyuanLiu

@JingyuanLiu123

6 days

https://t.co/zwpED4c98h kimi folks usually share their cot on model training, this is a great reflection from @yzhang_cs

6

9

130

Songlin Yang

@SonglinYang4

7 days

smh why rebrand the common Transformer-to-RNN distillation/conversion as “re-training” without citing any prior works?

Manifest AI

@manifest__ai

8 days

Today we are releasing Brumby-14B-Base, the strongest attention-free base model around. https://t.co/mclQPFdOGa

4

10

135

Albert Gu

@_albertgu

7 days

so excited to see more SOTA linear models, including some novel technical refinements to improving recurrent models 🚀

Kimi.ai

@Kimi_Moonshot

7 days

Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

1

11

127

Zongyu Lin

@zy27962986

7 days

🚀Really excited to see this amazing arch change (KDA) finally coming out! Replacing global attention with linear hybrid arch: better pretraining ppls, long context evals, downstream math&code&stem evals after RL, >6 * throughput at 1M to unblock more downstream potentials to

Kimi.ai

@Kimi_Moonshot

7 days

Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

1

18

55

Xinyu Zhou

@zxytim

7 days

You see: - a new arch that is better and faster than full attention verified with Kimi-style solidness. I see: - Starting with inferior performance even on short contexts. Nothing works and nobody knows why. - Tweaking every possible hyper-parameter to grasp what is wrong. -

Kimi.ai

@Kimi_Moonshot

7 days

Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

9

33

380

Songlin Yang

@SonglinYang4

7 days

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually

12

62

503

🐻熊狸

@bigeagle_xd

7 days

by moonlight, we catch a glimpse of K3.

Yulun Du

@Yulun_Du

7 days

https://t.co/NwozPUhAzU 👀

3

9

297

vLLM

@vllm_project

7 days

🎉 Congrats to @Kimi_Moonshot! vLLM Day-0 model expands! Now supporting Kimi Linear — hybrid linear attention with Kimi Delta Attention(KDA): - RULER 128k context: 84.3 perf + 3.98× speedup - Up to 6× faster decoding & 6.3× faster TPOT (1M tokens) - 75% KV cache reduction 💡

Kimi.ai

@Kimi_Moonshot

7 days

Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

8

32

243

Sasha Rush

@srush_nlp

8 days

Composer is a new model we built at Cursor. We used RL to train a big MoE model to be really good at real-world coding, and also very fast. https://t.co/DX9bbalx0B Excited for the potential of building specialized models to help in critical domains.

53

72

780

CELSIUS Energy Drink

@CelsiusOfficial

2 months

Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO!

208

385

5K

Rui-Jie (Ridger) Zhu

@RidgerZhu

8 days

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

20

136

623