Simeng Sun Profile
Simeng Sun

@simeng_ssun

Followers
564
Following
2K
Media
13
Statuses
209

Research Scientist @nvidia. ex: PhD @UMassCS; Intern @MSFTResearch, @MetaAI, @AdobeResearch. Opinions are my own and not the views of my employer.

Joined June 2019
Don't wanna be here? Send us removal request.
@jennajrussell
Jenna Russell
29 days
AI is already at work in American newsrooms. We examine 186k articles published this summer and find that ~9% are either fully or partially AI-generated, usually without readers having any idea. Here's what we learned about how AI is influencing local and national journalism:
4
54
144
@MohitIyyer
Mohit Iyyer
4 months
GPT-5 lands first place on NoCha, our long-context book understanding benchmark. That said, this is a tiny improvement (~1%) over o1-preview, which was released almost one year ago. Have long-context models hit a wall? Accuracy of human readers is >97%... Long way to go!
1
13
46
@ZeyuanAllenZhu
Zeyuan Allen-Zhu, Sc.D.
4 months
Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗 https://t.co/Nk3tOY2ICp
@ZeyuanAllenZhu
Zeyuan Allen-Zhu, Sc.D.
7 months
(1/8)🍎A Galileo moment for LLM design🍎 As Pisa Tower experiment sparked modern physics, our controlled synthetic pretraining playground reveals LLM architectures' true limits. A turning point that might divide LLM research into "before" and "after." https://t.co/7CZ6pTMlc9
14
115
671
@devoto_alessio
Alessio Devoto
4 months
🏆 Our @nvidia KV Cache Compression Leaderboard is now live! Compare state-of-the-art compression methods side-by-side with KVPress. See which techniques are leading in efficiency and performance. 🥇 https://t.co/kP9fdEG5JZ
8
46
259
@igtmn
Igor Gitman
4 months
We've released a series of OpenReasoning-Nemotron models (1.5B, 7B, 14B and 32B) that set new SOTA on a wide range of reasoning benchmarks across open-weight models of corresponding size. The models are based on Qwen2.5 architecture and are trained with SFT on the data
8
46
303
@karpathy
Andrej Karpathy
4 months
Scaling up RL is all the rage right now, I had a chat with a friend about it yesterday. I'm fairly certain RL will continue to yield more intermediate gains, but I also don't expect it to be the full story. RL is basically "hey this happened to go well (/poorly), let me slightly
415
861
8K
@allen_ai
Ai2
5 months
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
4
53
314
@nouhadziri
Nouha Dziri
5 months
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found
23
158
728
@HanGuo97
Han Guo
6 months
We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
16
205
1K
@MFarajtabar
Mehrdad Farajtabar
6 months
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching? The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks,
110
580
3K
@tuvllms
Tu Vu
6 months
✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s
4
42
146
@chautmpham
Chau Minh Pham
6 months
🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
6
36
122
@byryuer
Shiyue Zhang
6 months
🚀 New paper on evaluating retrieval robustness – how well LLMs handle imperfect retrieval: 1️⃣ RAG >= non-RAG? 2️⃣ More docs >= fewer docs? 3️⃣ Sensitivity to doc order ▶️ 11 LLMs × 3 prompting strategies Findings: LLMs show surprisingly high robustness—but limitations remain. 1/2
2
10
49
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
6 months
How much do language models memorize? "We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we
8
174
1K
@shizhediao
Shizhe Diao
6 months
Does RL truly expand a model’s reasoning🧠capabilities? Contrary to recent claims, the answer is yes—if you push RL training long enough! Introducing ProRL 😎, a novel training recipe that scales RL to >2k steps, empowering the world’s leading 1.5B reasoning model💥and offering
19
71
416
@aryaman2020
Aryaman Arora
6 months
new paper! 🫡 why are state space models (SSMs) worse than Transformers at recall over their context? this is a question about the mechanisms underlying model behaviour: therefore, we propose using mechanistic evaluations to answer it!
12
95
663
@minhnhle
Nhat Minh Le
6 months
LLMs memorize novels 📚 in English. But what about existing translations? Or translations into new languages? Our 🦉OWL dataset (31K/10 languages) shows GPT4o recognizes books: 92% English 83% official translations 69% unseen translations 75% as audio (EN)
1
11
21
@DanielKhashabi
Daniel Khashabi 🕊️
6 months
Long-form inputs (e.g., needle-in-haystack setups) are the crucial aspect of high-impact LLM applications. While previous studies have flagged issues like positional bias and distracting documents, they've missed a crucial element: the size of the gold/relevant context. In our
3
21
52
@YapeiChang
Yapei Chang
6 months
🤔 Can simple string-matching metrics like BLEU rival reward models for LLM alignment? 🔍 We show that given access to a reference, BLEU can match reward models in human preference agreement, and even train LLMs competitively with them using GRPO. 🫐 Introducing BLEUBERI:
6
45
193