Chaitanya Malaviya @cmalaviya11 X Profile

Chaitanya Malaviya

@cmalaviya11

Followers

338

Following

147

Media

32

Statuses

74

Senior research scientist @GoogleDeepMind | benchmarking and evaluation | prev @upennnlp @allen_ai @GoogleDeepMind and @LTIatCMU

https://t.co/P1lda8UmaZ

Seattle, WA

Joined September 2023

Don't wanna be here? Send us removal request.

Chaitanya Malaviya

@cmalaviya11

5 months

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

1

23

76

Jessy Li

@jessyjli

26 days

On my way to #COLM2025 🍁 Check out https://t.co/snFTIg24Am - QUDsim: Discourse templates in LLM stories https://t.co/xqvbDvH5v0 - EvalAgent: retrieval-based eval targeting implicit criteria https://t.co/f3JRojHeLb - RoboInstruct: code generation for robotics with simulators

0

5

64

Allen Chang

@AllenCChang

2 months

What if survey-derived rubrics 📋 graded ChatGPT instead of vibes? We benchmark LLMs & deep research systems across 75 research fields 🩺🧬🦾⚗️🏛️🎭💹: Perplexity deep research wins > 82% of head-to-heads vs the next best! w/ @realliyifei, @cmalaviya11, and @yatskar

Li S. Yifei

@realliyifei

2 months

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using

0

8

13

Li S. Yifei

@realliyifei

2 months

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using

1

23

59

Tomer Wolfson

@TomerWolfson

2 months

Deep research systems can't handle questions involving dozens of documents. Let me show you why this is (still) true 🧵and what does it all have to do with Grace Kelly? (1/)

1

6

17

Tomer Wolfson

@TomerWolfson

3 months

Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)

1

14

41

Kevin Feng

@kjfeng_

3 months

📢 New paper, published by @knightcolumbia. We often talk about AI agents augmenting vs. automating work, but how exactly can different configurations of human-agent interaction look like? We introduce a 5-level framework for AI agent autonomy to unpack this. 🧵👇

3

13

45

Chaitanya Malaviya

@cmalaviya11

3 months

People at #ACL2025, come drop by our poster today & chat with me about how context matters for reliable language model evaluations! Jul 30, 11:00-12:30 at Hall 4X, board 424.

Chaitanya Malaviya

@cmalaviya11

1 year

Excited to share ✨ Contextualized Evaluations ✨! Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓

1

6

23

Kyle Lo

@kylelostat

3 months

issues w preference LM benchmarks 🐡data contains cases where the "bad" response is just as good as chosen one 🐟model rankings can feel off (claude ranks lower than expected) led by @cmalaviya11 (TACL 2025), we study underspecified queries & detrimental effect on model evals

Ai2

@allen_ai

3 months

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

0

11

30

Chaitanya Malaviya

@cmalaviya11

3 months

Context is an overlooked aspect of language model evaluations. Check out how to incorporate context into evaluations in our TACL paper, how it changes evaluation conclusions and makes evaluation more reliable!

Ai2

@allen_ai

3 months

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

0

1

11

Manya Wadhwa

@ManyaWadhwa1

4 months

Happy to share that EvalAgent has been accepted to #COLM2025 @COLM_conf 🎉🇨🇦 We introduce a framework to identify implicit and diverse evaluation criteria for various open-ended tasks! 📜

Manya Wadhwa

@ManyaWadhwa1

6 months

Evaluating language model responses on open-ended tasks is hard! 🤔 We introduce EvalAgent, a framework that identifies nuanced and diverse criteria 📋✍️. EvalAgent identifies 👩‍🏫🎓 expert advice on the web that implicitly address the user’s prompt 🧵👇

1

18

76

Chaitanya Malaviya

@cmalaviya11

5 months

Thanks for the mention @natolambert :) shoutout to the amazing undergrad @abharadwaj123 who led this work!

Nathan Lambert

@natolambert

5 months

Nice to see folks studying biases in RLHF / preference tuning all the way down to the datasets. I think many of the biases are mostly irreducible human biases that can't be solved within current training regimes, just mitigated.

0

4

Chaitanya Malaviya

@cmalaviya11

5 months

Check out our paper and counterfactual data below! 👇 • Paper: https://t.co/ZwjVJ66COS • Data: https://t.co/Omkyf4o233 • Code:

github.com

Contribute to anirudhb123/Preference-Model-Biases development by creating an account on GitHub.

0

1

2

Chaitanya Malaviya

@cmalaviya11

5 months

Our findings suggest that targeted debiasing using counterfactuals can help build more reliable preference models, a key step for both LLM alignment and evaluation. Work led by @abharadwaj123 and done jointly with @nitishjoshi23 and @yatskar.

1

5

Chaitanya Malaviya

@cmalaviya11

5 months

For instance, miscalibration for vagueness dropped from 51.3% to 28.5% and for jargon from 50.3% to 33.2% after CDA. Even joint debiasing across multiple biases (length, vagueness, jargon) proved effective with minimal impact on general capabilities.

1

0

2

Chaitanya Malaviya

@cmalaviya11

5 months

And the results? CDA works! It significantly reduced average miscalibration (e.g., from 39.4% to 32.5%) and brought model skew much closer to human preferences. All this while maintaining overall performance on RewardBench!

1

0

3

Chaitanya Malaviya

@cmalaviya11

5 months

So how do we debias models? We propose a simple yet effective post-training method based on counterfactual data augmentation (CDA). We synthesize contrastive responses that explicitly magnify biases in dispreferred responses, & further finetune reward models on these responses.

1

0

3

Chaitanya Malaviya

@cmalaviya11

5 months

Indeed, preference models can easily latch on to these subtle data artifacts! Features that only weakly correlate with human preferences (r_human=−0.12) are strongly predictive for models (r_model=0.36). Points above y=x suggest that models overrely on these spurious cues😮

1

3

Chaitanya Malaviya

@cmalaviya11

5 months

Where do these biases come from?🤔Our analysis suggests they originate from training data artifacts. For eg, humans preferred structured responses >65% of the time when the alternative wasn't structured. This gives an opportunity for models to learn these patterns as heuristics!

1

0

4

Chaitanya Malaviya

@cmalaviya11

5 months

How severe is the problem? Using controlled counterfactual pairs, we found that preference models (incl. LLM evaluators) prefer biased responses in >60% of cases (defined as skew) and show high miscalibration (~40%) wrt humans. Vagueness & sycophancy are especially problematic!

1

4

Chaitanya Malaviya

@cmalaviya11

5 months

Preference models act as proxies for human judgements in alignment (as reward models) & evaluation, but they can be miscalibrated. We found that they overrely on many idiosyncratic features of AI-generated text, which can lead to reward hacking & unreliable evals. Features like:

1

6