Songlin Yang Profile
Songlin Yang

@SonglinYang4

Followers
12K
Following
5K
Media
75
Statuses
2K

Ph.D. student @MIT_CSAIL. Working on scalable and principled algorithms in #LLM and #MLSys. In open-sourcing I trust 🐳. she/her/hers

Cambridge, MA
Joined January 2021
Don't wanna be here? Send us removal request.
@SonglinYang4
Songlin Yang
3 months
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks.
Tweet card summary image
arxiv.org
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
9
91
538
@grok
Grok
4 days
Join millions who have switched to Grok.
180
202
2K
@SonglinYang4
Songlin Yang
5 hours
RT @cHHillee: @JingyuanLiu123 This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bot….
0
5
0
@SonglinYang4
Songlin Yang
7 hours
Talking about my research journey and the GLA, DeltaNet, and PaTH line of work in my first podcast ever—hope you enjoy :).
@DeltaInstitutes
Delta Institute
8 hours
Huge thanks to Songlin Yang for coming on the Delta Podcast! Check out the podcast episode here:
Tweet media one
0
5
109
@SonglinYang4
Songlin Yang
16 hours
Got something cool on reasoning? Submit to the Efficient Reasoning Workshop 🤗.
@ChengLuo_lc
Cheng Luo
16 hours
🌟 Reminder: Submission Deadline Approaching! 🌟. The 1st Workshop on Efficient Reasoning (ER) @ NeurIPS 2025 — happening Dec 6 or 7 in San Diego — is fast approaching, and we’d love to see your work there!. 📌 Submission Deadline: September 1, 2025 (AoE).🔗 Submit here:.
0
3
30
@SonglinYang4
Songlin Yang
3 days
RT @stuart_sul: MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we comp….
0
95
0
@SonglinYang4
Songlin Yang
3 days
RT @_arohan_: Sequential operations are more powerful than parallel operations.
0
2
0
@SonglinYang4
Songlin Yang
5 days
RT @a1zhang: announcing the @GPU_MODE x @scaleml summer speaker series happening next week, a 5⃣-day series where top researchers will teac….
0
42
0
@SonglinYang4
Songlin Yang
7 days
RT @arcprize: Analyzing the Hierarchical Reasoning Model by @makingAGI. We verified scores on hidden tasks, ran ablations, and found that p….
0
199
0
@SonglinYang4
Songlin Yang
9 days
RT @teortaxesTex: What the hell is Anthropic doing for code contexts.
0
30
0
@SonglinYang4
Songlin Yang
16 days
RT @ChangJonathanC: while we wait for gpt-5 to drop. Here is a flex attention tutorial for building a < 1000 LoC vllm from scratch. https://….
Tweet card summary image
jonathanc.net
PyTorch FlexAttention tutorial: Building a minimal vLLM-style inference engine from scratch with paged attention
0
37
0
@SonglinYang4
Songlin Yang
18 days
starting now
Tweet media one
0
2
23
@SonglinYang4
Songlin Yang
18 days
RT @gu_xiangming: I noticed that @OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. Thi….
0
174
0
@SonglinYang4
Songlin Yang
18 days
RT @abk_tau: Had a great time presenting OPRM at ASAP!. We talked about recurrent memory overflows, Long Context vs. RAG, and possible scal….
0
3
0
@SonglinYang4
Songlin Yang
19 days
starting now
Tweet media one
0
2
19
@SonglinYang4
Songlin Yang
19 days
RT @teortaxesTex: LoCoDiff, or, why everybody still uses Sonnet for coding
Tweet media one
0
15
0
@SonglinYang4
Songlin Yang
20 days
RT @SimonXinDong: Here is an explanation and implementation of the possible OpenAI used .Sliding Window 128 + Sink Tokens .with Flex Attent….
Tweet card summary image
github.com
Contribute to XinDongol/SWA-SinkMeta development by creating an account on GitHub.
0
14
0
@SonglinYang4
Songlin Yang
21 days
Falcon-H1 is very rich in content — highly recommended.
@teortaxesTex
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
22 days
Falcon-H1 is a very dense research paper exploring the space of hybrid attention designs and tuning *every* hyperparameter there is. It's more interesting than models themselves. If you were intrigued by that «AlphaGo move» slop, this is the real thing.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
1
81
@SonglinYang4
Songlin Yang
24 days
RT @yzhang_cs: Huge congrats to the NSA team for their ACL2025 Best Paper win! 🏆🏆🏆.We've open-sourced a 3rdparty impl to help you integrate….
Tweet card summary image
github.com
🚀 Efficient implementations of state-of-the-art linear attention models - fla-org/flash-linear-attention
0
18
0
@SonglinYang4
Songlin Yang
24 days
RT @aryaman2020: nerdsniped while reading the DeltaNet paper: the main representation rewrite function we propose in ReFT is super similar….
0
2
0
@SonglinYang4
Songlin Yang
24 days
RT @jacobmbuckman: New post: "Context Is More Than A Length-Measuring Contest". It is a mistake to take the context lengths reported by mod….
0
2
0