Simeng Sun @simeng_ssun X Profile

Simeng Sun

@simeng_ssun

Followers

564

Following

2K

Media

13

Statuses

209

Research Scientist @nvidia. ex: PhD @UMassCS; Intern @MSFTResearch, @MetaAI, @AdobeResearch. Opinions are my own and not the views of my employer.

https://t.co/pcTy65si2a

Joined June 2019

Don't wanna be here? Send us removal request.

Jenna Russell

@jennajrussell

29 days

AI is already at work in American newsrooms. We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea. Here's what we learned about how AI is influencing local and national journalism:

4

54

144

Mohit Iyyer

@MohitIyyer

4 months

GPT-5 lands first place on NoCha, our long-context book understanding benchmark. That said, this is a tiny improvement (~1%) over o1-preview, which was released almost one year ago. Have long-context models hit a wall? Accuracy of human readers is >97%... Long way to go!

1

13

46

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

4 months

Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗 https://t.co/Nk3tOY2ICp

Zeyuan Allen-Zhu, Sc.D.

@ZeyuanAllenZhu

7 months

(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." https://t.co/7CZ6pTMlc9

14

115

671

Alessio Devoto

@devoto_alessio

4 months

🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 https://t.co/kP9fdEG5JZ

8

46

259

Igor Gitman

@igtmn

4 months

We've released a series of OpenReasoning-Nemotron models (1.5B, 7B, 14B and 32B) that set new SOTA on a wide range of reasoning benchmarks across open-weight models of corresponding size. The models are based on Qwen2.5 architecture and are trained with SFT on the data

8

46

303

Andrej Karpathy

@karpathy

4 months

Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly

415

861

8K

Ai2

@allen_ai

5 months

Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵

4

53

314

Nouha Dziri

@nouhadziri

5 months

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

23

158

728

Han Guo

@HanGuo97

6 months

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

16

205

1K

Mehrdad Farajtabar

@MFarajtabar

6 months

🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,

110

580

3K

Tu Vu

@tuvllms

6 months

✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s

4

42

146

Chau Minh Pham

@chautmpham

6 months

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

6

36

122

Shiyue Zhang

@byryuer

6 months

🚀 New paper on evaluating retrieval robustness – how well LLMs handle imperfect retrieval: 1️⃣ RAG >= non-RAG? 2️⃣ More docs >= fewer docs? 3️⃣ Sensitivity to doc order ▶️ 11 LLMs × 3 prompting strategies Findings: LLMs show surprisingly high robustness—but limitations remain. 1/2

2

10

49

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

6 months

How much do language models memorize? "We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we

8

174

1K

Shizhe Diao

@shizhediao

6 months

Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering

19

71

416

Aryaman Arora

@aryaman2020

6 months

new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!

12

95

663

Nhat Minh Le

@minhnhle

6 months

LLMs memorize novels 📚 in English. But what about existing translations? Or translations into new languages? Our 🦉OWL dataset (31K/10 languages) shows GPT4o recognizes books: 92% English 83% official translations 69% unseen translations 75% as audio (EN)

1

11

21

Daniel Khashabi 🕊️

@DanielKhashabi

6 months

Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context. In our

3

21

52

Yapei Chang

@YapeiChang

6 months

🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:

6

45

193