cmalaviya11 Profile Banner
Chaitanya Malaviya Profile
Chaitanya Malaviya

@cmalaviya11

Followers
338
Following
147
Media
32
Statuses
74

Senior research scientist @GoogleDeepMind | benchmarking and evaluation | prev @upennnlp @allen_ai @GoogleDeepMind and @LTIatCMU

Seattle, WA
Joined September 2023
Don't wanna be here? Send us removal request.
@cmalaviya11
Chaitanya Malaviya
5 months
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓
1
23
76
@jessyjli
Jessy Li
26 days
On my way to #COLM2025 🍁 Check out https://t.co/snFTIg24Am - QUDsim: Discourse templates in LLM stories https://t.co/xqvbDvH5v0 - EvalAgent: retrieval-based eval targeting implicit criteria https://t.co/f3JRojHeLb - RoboInstruct: code generation for robotics with simulators
0
5
64
@AllenCChang
Allen Chang
2 months
What if survey-derived rubrics 📋 graded ChatGPT instead of vibes? We benchmark LLMs & deep research systems across 75 research fields 🩺🧬🦾⚗️🏛️🎭💹: Perplexity deep research wins > 82% of head-to-heads vs the next best! w/ @realliyifei, @cmalaviya11, and @yatskar
@realliyifei
Li S. Yifei
2 months
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
0
8
13
@realliyifei
Li S. Yifei
2 months
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
1
23
59
@TomerWolfson
Tomer Wolfson
2 months
Deep research systems can't handle questions involving dozens of documents. Let me show you why this is (still) true 🧵and what does it all have to do with Grace Kelly? (1/)
1
6
17
@TomerWolfson
Tomer Wolfson
3 months
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)
1
14
41
@kjfeng_
Kevin Feng
3 months
📢 New paper, published by @knightcolumbia. We often talk about AI agents augmenting vs. automating work, but how exactly can different configurations of human-agent interaction look like? We introduce a 5-level framework for AI agent autonomy to unpack this. 🧵👇
3
13
45
@cmalaviya11
Chaitanya Malaviya
3 months
People at #ACL2025, come drop by our poster today & chat with me about how context matters for reliable language model evaluations! Jul 30, 11:00-12:30 at Hall 4X, board 424.
@cmalaviya11
Chaitanya Malaviya
1 year
Excited to share ✨ Contextualized Evaluations ✨! Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓
1
6
23
@kylelostat
Kyle Lo
3 months
issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals
@allen_ai
Ai2
3 months
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
0
11
30
@cmalaviya11
Chaitanya Malaviya
3 months
Context is an overlooked aspect of language model evaluations. Check out how to incorporate context into evaluations in our TACL paper, how it changes evaluation conclusions and makes evaluation more reliable!
@allen_ai
Ai2
3 months
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
0
1
11
@ManyaWadhwa1
Manya Wadhwa
4 months
Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜
@ManyaWadhwa1
Manya Wadhwa
6 months
Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇
1
18
76
@cmalaviya11
Chaitanya Malaviya
5 months
Thanks for the mention @natolambert :) shoutout to the amazing undergrad @abharadwaj123 who led this work!
@natolambert
Nathan Lambert
5 months
Nice to see folks studying biases in RLHF / preference tuning all the way down to the datasets. I think many of the biases are mostly irreducible human biases that can't be solved within current training regimes, just mitigated.
0
0
4
@cmalaviya11
Chaitanya Malaviya
5 months
Check out our paper and counterfactual data below! 👇 • Paper: https://t.co/ZwjVJ66COS • Data: https://t.co/Omkyf4o233 • Code:
Tweet card summary image
github.com
Contribute to anirudhb123/Preference-Model-Biases development by creating an account on GitHub.
0
1
2
@cmalaviya11
Chaitanya Malaviya
5 months
Our findings suggest that targeted debiasing using counterfactuals can help build more reliable preference models, a key step for both LLM alignment and evaluation. Work led by @abharadwaj123 and done jointly with @nitishjoshi23 and @yatskar.
1
1
5
@cmalaviya11
Chaitanya Malaviya
5 months
For instance, miscalibration for vagueness dropped from 51.3% to 28.5% and for jargon from 50.3% to 33.2% after CDA. Even joint debiasing across multiple biases (length, vagueness, jargon) proved effective with minimal impact on general capabilities.
1
0
2
@cmalaviya11
Chaitanya Malaviya
5 months
And the results? CDA works! It significantly reduced average miscalibration (e.g., from 39.4% to 32.5%) and brought model skew much closer to human preferences. All this while maintaining overall performance on RewardBench!
1
0
3
@cmalaviya11
Chaitanya Malaviya
5 months
So how do we debias models? We propose a simple yet effective post-training method based on counterfactual data augmentation (CDA). We synthesize contrastive responses that explicitly magnify biases in dispreferred responses, & further finetune reward models on these responses.
1
0
3
@cmalaviya11
Chaitanya Malaviya
5 months
Indeed, preference models can easily latch on to these subtle data artifacts! Features that only weakly correlate with human preferences (r_human=−0.12) are strongly predictive for models (r_model​=0.36). Points above y=x suggest that models overrely on these spurious cues😮
1
1
3
@cmalaviya11
Chaitanya Malaviya
5 months
Where do these biases come from?🤔Our analysis suggests they originate from training data artifacts. For eg, humans preferred structured responses >65% of the time when the alternative wasn't structured. This gives an opportunity for models to learn these patterns as heuristics!
1
0
4
@cmalaviya11
Chaitanya Malaviya
5 months
How severe is the problem? Using controlled counterfactual pairs, we found that preference models (incl. LLM evaluators) prefer biased responses in >60% of cases (defined as skew) and show high miscalibration (~40%) wrt humans. Vagueness & sycophancy are especially problematic!
1
1
4
@cmalaviya11
Chaitanya Malaviya
5 months
Preference models act as proxies for human judgements in alignment (as reward models) & evaluation, but they can be miscalibrated. We found that they overrely on many idiosyncratic features of AI-generated text, which can lead to reward hacking & unreliable evals. Features like:
1
1
6