BerkeleyNLP Profile Banner
BerkeleyNLP Profile
BerkeleyNLP

@BerkeleyNLP

Followers
6K
Following
118
Media
5
Statuses
115

We work on natural language processing, machine learning, linguistics, and deep learning. PIs: Dan Klein, @alsuhr, @sewon__min

Berkeley, California
Joined September 2019
Don't wanna be here? Send us removal request.
@aomaru_21490
Jiaxin Ge
17 days
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
33
114
@sewon__min
Sewon Min
20 days
Super excited about @wenjie_ma's work on verifying math proofs! ✅ 24 competitions, 3 SoTAs (o3, Gemini-2.5-Pro, R1) ✅ Strong evaluator -- a carefully designed evaluator with simple ensemble beats agentic ones ✅ Strong best-of-n performance Check out the paper & website!
@wenjie_ma
Wenjie Ma
20 days
LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key
3
16
119
@wenjie_ma
Wenjie Ma
20 days
LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key
9
38
188
@kayo_yin
Kayo Yin
5 months
Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 https://t.co/LMWcqtOSDG 📅 Submit by June 23rd
6
23
93
@ZhongRuiqi
Ruiqi Zhong
6 months
Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P
30
39
543
@NickATomlin
Nicholas Tomlin
6 months
The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision
@vcubingx
Vivek Verma
6 months
🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: https://t.co/kddoCgDkvd
4
31
182
@vcubingx
Vivek Verma
6 months
🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: https://t.co/kddoCgDkvd
3
27
147
@NickATomlin
Nicholas Tomlin
7 months
I'm incredibly excited to share that I'll be joining @TTIC_Connect as an assistant professor in Fall 2026! Until then, I'm wrapping up my PhD at Berkeley, and after that I'll be a faculty fellow at @NYUDataScience
33
10
201
@ZhongRuiqi
Ruiqi Zhong
7 months
Finished my dissertation!!! (scalable oversight,link below) Very fortunate to have @JacobSteinhardt and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) Thanks for developing my research mission, and teaching me magic
27
11
395
@LakshyAAAgrawal
Lakshya A Agrawal
8 months
🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs! We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.
3
46
143
@kayo_yin
Kayo Yin
8 months
Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. https://t.co/zTpiOKatEF 🧵
16
116
778
@sea_snell
Charlie Snell
1 year
Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
14
82
575
@kayo_yin
Kayo Yin
1 year
Cool new dataset for translation ambiguity in 9 language pairs (7 low-resource), and we found LLM-generated descriptions help weaker models resolve ambiguity! @BaruaJosh will be presenting this at the 2-3:30pm poster session today, come talk to us about multilinguality in LLMs!
@BaruaJosh
Josh Barua
1 year
Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵
0
3
6
@BaruaJosh
Josh Barua
1 year
Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵
2
13
40
@kayo_yin
Kayo Yin
1 year
🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible https://t.co/U2pky8fmxq 🧵
8
22
107
@ZhongRuiqi
Ruiqi Zhong
1 year
Given the rapid progress of LLMs, I feel compelled to present this topic (even if it's not the main focus of my Ph.D. work). I will cover concrete ML problems related to "AI deception" -- undesirable behaviors of AI systems that are hard to catch -- and how to study this
@stanfordnlp
Stanford NLP Group
1 year
For this week’s NLP Seminar, we are thrilled to host @ZhongRuiqi to talk about Concrete Problems in AI Deception: From Evaluation Gaming to Cyber Attack! When: 10/3 Thurs 11am PT Non-Stanford affiliates registration form: https://t.co/BEpAObjvcr (closed at 9am PT on the talk
3
18
122
@ZhongRuiqi
Ruiqi Zhong
1 year
Graphical models struggle to explain patterns in text & images 😭 LLM can do this but hallucinates. 👿 It’s time to combine their strengths! We define models with natural language parameters! Unlocking opportunities in science, business, ML, etc
8
30
224
@ZhongRuiqi
Ruiqi Zhong
1 year
A central concern in alignment is that AI systems will "deceive" humans by doing what looks correct to humans but is actually wrong. While a lot of works are motivated by this assumption, we lack empirical evidence. Our work shows systematic evidence that this concern is real
@jiaxinwen22
Jiaxin Wen
1 year
RLHF is a popular method. It makes your human eval score better and Elo rating 🚀🚀. But really❓Your model might be “cheating” you! 😈😈 We show that LLMs can learn to mislead human evaluators via RLHF. 🧵below
1
13
87
@ZhongRuiqi
Ruiqi Zhong
1 year
large mental model update after working on this project 1. Even when LLM does not know what's correct, it can still learn to assist humans to finish the task 2. sometimes LLMs are even better than humans at distinguishing what is helpful for humans (!)
@jiaxinwen22
Jiaxin Wen
1 year
LLMs can generate complex programs. But they are often wrong. How should users fix them? We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them
0
13
76
@sea_snell
Charlie Snell
1 year
On difficult problems, humans can think longer to improve their decisions. Can we instill a similar capability into LLMs? And can it do well? In our paper, we find that by optimally scaling test-time compute we can outperform *much* larger models in a FLOPs matched evaluation.
12
94
699