Seungone Kim
@seungonekim
Followers
2K
Following
5K
Media
50
Statuses
822
Ph.D. student @LTIatCMU and intern at @AIatMeta (FAIR) working on (V)LM Evaluation & Systems that SeIf-Improve | Prev: @kaist_ai @yonsei_u
Pittsburgh, PA
Joined November 2021
#NLProc New paper on "evaluation-time scaling", a new dimension to leverage test-time compute! We replicate the test-time scaling behaviors observed in generators (e.g., o1, r1, s1) with evaluators by enforcing to generate additional reasoning tokens. https://t.co/Qaxhdap52S
3
39
171
We are gathering problems to build a challenging math benchmark (collaboration between @AiEleuther and @withmsit). The compensation per problem is up to ~$3,623 and the due date is Nov 10th! https://t.co/TdUG5xvTr2
We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!
0
2
22
We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!
1
5
23
2) M-Prometheus: A Suite of Open Multilingual LLM Judges w/ @zmprcp @dongkeun_yoon @psanfernandes @ianwu97 @seungonekim @RicardoRei7 @gneubig - (Poster session 1, Tue Oct 7, 11:00 AM – 1:00 PM)
1
2
7
Grad school season reminder: many CS departments run student-led pre-application mentorship programs for prospective PhD applicants (due Oct. You can get feedback from current PhD students! Eg - UW’s CSE PAMS: https://t.co/RYw4mbD47h - MIT EECS GAAP: https://t.co/piD6hkmHzq 🧵
cs.washington.edu
Pre-Application Mentorship Service (PAMS)
10
42
265
Excited to teach Advanced NLP at CMU again this semester! Slides are on the course page as the course proceeds: https://t.co/xsqARaZEK9 Lectures will be uploaded to Youtube: https://t.co/4kfXvS2MCb
5
94
591
Introducing ⚔️PR Arena⚔️ - free AI coding agents to fix real GitHub issues. Claude Sonnet 4 vs Gemini 2.5 Pro… Who writes better pull requests? 👉 Install here: https://t.co/bk19LcnBVf Powered by @allhands_ai
4
12
79
Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!
🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5
2
24
90
Got a new efficient/optimally-thinking LLM? Does you model answer simple queries quickly and spends compute on the harder ones? Test it on our new benchmark, OptimalThinkingBench! 👇 Work led by the amazing @PranjalAggarw16 during this internship!
🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
0
10
79
🤖Introducing OptimalThinkingBench 🤖 📝: https://t.co/aufQVJp8aC - Thinking LLMs use a lot of tokens & overthink; non-thinking LLMs underthink & underperform. - We introduce a benchmark which scores models in the quest to find the best mix. - OptimalThinkingBench reports the F1
1
72
422
Current multimodal LLMs excel in English and Western contexts but struggle with cultural knowledge from underrepresented regions and languages. How can we build truly globally inclusive vision-language models? We are introducing CulturalGround, a large-scale dataset with 22M
7
25
154
Excited about CMU's new Institute for Computer-Aided Reasoning in Mathematics (ICARM), a new NSF Mathematical Sciences Research Institute. I'm honored to serve as an Assistant Director focusing on machine learning and mathematics.
A new federally funded national institute at CMU will help mathematicians use AI to make mathematical reasoning faster and more reliable in solving pressing challenges across science, security and the economy. Read more, and scroll for further details:
8
23
173
Also check out our LLM-as-an-Interviewer work that will be presented by @euns0o_kim ! I think there are many future work to be done in dynamic evals🙂 https://t.co/5EclVtTOW4
I’ll be presenting our LLM-as-an-Interviewer work at #ACL2025! 📅 When: July 30 (wed) 11:00-12:30 📍 Where: Hall 4/5 https://t.co/dreWbCy0Pb Feel free to stop by ! Looking forward to discussing (m)LLM evaluation and more!
0
0
3
I'll try to do my best 😳 There's no substituting for @seungonekim
Unfortunately, I won't be at @aclmeeting this year, but my advisor @gneubig will thankfully be presenting this work! (It's so cool to have an advisor who presents your paper☺️) 📆 July 29th (Tuesday), 10:30AM-12:00PM 📍Hall 4/5, Session 7: IP-Posters (Poster Session 2)
0
1
33
Unfortunately, I won't be at @aclmeeting this year, but my advisor @gneubig will thankfully be presenting this work! (It's so cool to have an advisor who presents your paper☺️) 📆 July 29th (Tuesday), 10:30AM-12:00PM 📍Hall 4/5, Session 7: IP-Posters (Poster Session 2)
#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.
1
7
63
We’ve prepared a tutorial for ACL this year to give you some answers. Come join @xiangyue96, @alisawuffles, @yizhongwyz, @gneubig, and me for “Synthetic Data in the Era of LLMs.” 📍 Sunday 2–3:30pm, Hall B #ACL2025
6
8
56
I’ll be presenting our LLM-as-an-Interviewer work at #ACL2025! 📅 When: July 30 (wed) 11:00-12:30 📍 Where: Hall 4/5 https://t.co/dreWbCy0Pb Feel free to stop by ! Looking forward to discussing (m)LLM evaluation and more!
arxiv.org
We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs). This approach leverages multi-turn interactions where the LLM interviewer actively provides...
[1/7] 🚨 New LLM Evaluation Paper Alert! How can we better understand LLMs' abilities? Why not interview them across multiple turns? 🎤 We introduce the LLM-as-an-Interviewer Framework, along with its summarized interview report! 👉 https://t.co/dreWbCyyEJ
0
4
29
Can LLMs self-improve on code generation? Check out our work AlphaVerus where model generates provably correct code and self-improves without any weight updates! At #ICML2025 today: 📆: 11:00 AM - 1:30 PM 📷: Poster #East-2912 https://t.co/53AIFOaEBY w/ Bryan, @wellecks
0
10
57