
Xinyu Yang
@Xinyu2ML
Followers
760
Following
412
Media
26
Statuses
239
Ph.D. @CarnegieMellon. Working on data and hardware-driven principled algorithm & system co-design for scalable and generalizable foundation models. They/Them
Pittsburgh, US
Joined December 2022
🚀 Super excited to share Multiverse!. 🏃 It’s been a long journey exploring the space between model design and hardware efficiency. What excites me most is realizing that, beyond optimizing existing models, we can discover better model architectures by embracing system-level.
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46%. 🌐 Website: 🧵 1/n
3
18
58
RT @ZeyuanAllenZhu: Facebook AI Research (FAIR) is a small, prestigious lab in Meta. We don't train large models like GenAI or MSL, so it's….
0
59
0
RT @sansa19739319: 🤖Can diffusion models write code competitively?.Excited to share our latest 7B coding diffusion LLM!!💻. With DiffuCoder,….
0
107
0
RT @fengyao1909: 😵💫 Struggling with 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐌𝐨𝐄?. Meet 𝐃𝐞𝐧𝐬𝐞𝐌𝐢𝐱𝐞𝐫 — an MoE post-training method that offers more 𝐩𝐫𝐞𝐜𝐢𝐬𝐞 𝐫𝐨𝐮𝐭𝐞𝐫 𝐠𝐫𝐚𝐝𝐢𝐞….
0
53
0
RT @gan_chuang: 🧠 LLMs think too much—and waste tokens! Can we precisely control how long they reason?. Introducing Budget Guidance — a th….
0
18
0
RT @chelseabfinn: We still lack a scalable recipe for RL post-training seeded with demonstration data. Many methods add an imitation loss,….
0
36
0
RT @yifeiwang77: 🔥Thrilled to share that our sparse embedding method CSR (ICML’25 oral) is now officially supported in SentenceTransformers….
0
1
0
RT @xichen_pan: The code and instruction-tuning data for MetaQuery are now open-sourced!.Code: Data: .
0
22
0
RT @_hanlin_zhang_: [1/n] New work [JSKZ25] w/ @JikaiJin2002, @syrgkanis, @ShamKakade6. We introduce new formulations and tools for evalu….
0
11
0
RT @Huangyu58589918: What precision should we use to train large AI models effectively? Our latest research probes the subtle nature of tra….
0
17
0
RT @NovaSkyAI: ✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easi….
0
43
0
RT @arankomatsuzaki: I'd like to see Meta building a lean LLM team around Narang, Allen-Zhu, Mike Lewis, Zettlemoyer and Sukhbaatar and giv….
0
9
0
Please check Tilde's post for more information on sparse attention (Also very happy to see some of our NSA kernel in FLA being deployed in their implementation 😀.
Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover:. - Novel attention patterns.- Hidden "attention sinks".- Better performance.- And more. A 🧵… ~1/8~
0
0
9
RT @tilderesearch: Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work….
0
80
0
RT @JiaZhihao: 📢Exciting updates from #MLSys2025! All session recordings are now available and free to watch at We….
0
30
0