BerkeleyNLP @BerkeleyNLP X Profile

BerkeleyNLP

@BerkeleyNLP

Followers

6K

Following

118

Media

5

Statuses

115

We work on natural language processing, machine learning, linguistics, and deep learning. PIs: Dan Klein, @alsuhr, @sewon__min

https://t.co/CjTOJcaVaV

Berkeley, California

Joined September 2019

Don't wanna be here? Send us removal request.

Jiaxin Ge

@aomaru_21490

17 days

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ

3

33

114

Sewon Min

@sewon__min

20 days

Super excited about @wenjie_ma's work on verifying math proofs! ✅ 24 competitions, 3 SoTAs (o3, Gemini-2.5-Pro, R1) ✅ Strong evaluator -- a carefully designed evaluator with simple ensemble beats agentic ones ✅ Strong best-of-n performance Check out the paper & website!

Wenjie Ma

@wenjie_ma

20 days

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key

3

16

119

Wenjie Ma

@wenjie_ma

20 days

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key

9

38

188

Kayo Yin

@kayo_yin

5 months

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 https://t.co/LMWcqtOSDG 📅 Submit by June 23rd

6

23

93

Ruiqi Zhong

@ZhongRuiqi

6 months

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

30

39

543

Nicholas Tomlin

@NickATomlin

6 months

The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision

Vivek Verma

@vcubingx

6 months

🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: https://t.co/kddoCgDkvd

4

31

182

Vivek Verma

@vcubingx

6 months

🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: https://t.co/kddoCgDkvd

3

27

147

Nicholas Tomlin

@NickATomlin

7 months

I'm incredibly excited to share that I'll be joining @TTIC_Connect as an assistant professor in Fall 2026! Until then, I'm wrapping up my PhD at Berkeley, and after that I'll be a faculty fellow at @NYUDataScience

33

10

201

Ruiqi Zhong

@ZhongRuiqi

7 months

Finished my dissertation!!! (scalable oversight,link below) Very fortunate to have @JacobSteinhardt and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) Thanks for developing my research mission, and teaching me magic

27

11

395

Lakshya A Agrawal

@LakshyAAAgrawal

8 months

🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs! We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.

3

46

143

Kayo Yin

@kayo_yin

8 months

Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. https://t.co/zTpiOKatEF 🧵

16

116

778

Charlie Snell

@sea_snell

1 year

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

14

82

575

Kayo Yin

@kayo_yin

1 year

Cool new dataset for translation ambiguity in 9 language pairs (7 low-resource), and we found LLM-generated descriptions help weaker models resolve ambiguity! @BaruaJosh will be presenting this at the 2-3:30pm poster session today, come talk to us about multilinguality in LLMs!

Josh Barua

@BaruaJosh

1 year

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵

0

3

6

Josh Barua

@BaruaJosh

1 year

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵

2

13

40

Kayo Yin

@kayo_yin

1 year

🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible https://t.co/U2pky8fmxq 🧵

8

22

107

Ruiqi Zhong

@ZhongRuiqi

1 year

Given the rapid progress of LLMs, I feel compelled to present this topic (even if it's not the main focus of my Ph.D. work). I will cover concrete ML problems related to "AI deception" -- undesirable behaviors of AI systems that are hard to catch -- and how to study this

Stanford NLP Group

@stanfordnlp

1 year

For this week’s NLP Seminar, we are thrilled to host @ZhongRuiqi to talk about Concrete Problems in AI Deception: From Evaluation Gaming to Cyber Attack! When: 10/3 Thurs 11am PT Non-Stanford affiliates registration form: https://t.co/BEpAObjvcr (closed at 9am PT on the talk

3

18

122

Ruiqi Zhong

@ZhongRuiqi

1 year

Graphical models struggle to explain patterns in text & images 😭 LLM can do this but hallucinates. 👿 It’s time to combine their strengths! We define models with natural language parameters! Unlocking opportunities in science, business, ML, etc

8

30

224

Ruiqi Zhong

@ZhongRuiqi

1 year

A central concern in alignment is that AI systems will "deceive" humans by doing what looks correct to humans but is actually wrong. While a lot of works are motivated by this assumption, we lack empirical evidence. Our work shows systematic evidence that this concern is real

Jiaxin Wen

@jiaxinwen22

1 year

RLHF is a popular method. It makes your human eval score better and Elo rating 🚀🚀. But really❓Your model might be “cheating” you! 😈😈 We show that LLMs can learn to mislead human evaluators via RLHF. 🧵below

1

13

87

Ruiqi Zhong

@ZhongRuiqi

1 year

large mental model update after working on this project 1. Even when LLM does not know what's correct, it can still learn to assist humans to finish the task 2. sometimes LLMs are even better than humans at distinguishing what is helpful for humans (!)

Jiaxin Wen

@jiaxinwen22

1 year

LLMs can generate complex programs. But they are often wrong. How should users fix them? We propose to use LLMs to assist humans by decomposing the solutions in a helpful way. We increase non-experts' efficiency by 3.3X, allow them to solve 33.3% more problems, and empower them

0

13

76

Charlie Snell

@sea_snell

1 year

On difficult problems, humans can think longer to improve their decisions. Can we instill a similar capability into LLMs? And can it do well? In our paper, we find that by optimally scaling test-time compute we can outperform *much* larger models in a FLOPs matched evaluation.

12

94

699