cmalaviya11 Profile Banner
Chaitanya Malaviya Profile
Chaitanya Malaviya

@cmalaviya11

Followers
320
Following
118
Media
32
Statuses
64

PhD student at UPenn @upennnlp | benchmarking and evaluation | soon senior research scientist @GoogleDeepMind | prev @allen_ai @GoogleDeepMind and @LTIatCMU

Seattle, WA
Joined September 2023
Don't wanna be here? Send us removal request.
@cmalaviya11
Chaitanya Malaviya
28 days
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses?. Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below šŸ§µā†“
Tweet media one
1
17
75
@cmalaviya11
Chaitanya Malaviya
25 days
Thanks for the mention @natolambert :) shoutout to the amazing undergrad @abharadwaj123 who led this work!.
@natolambert
Nathan Lambert
26 days
Nice to see folks studying biases in RLHF / preference tuning all the way down to the datasets. I think many of the biases are mostly irreducible human biases that can't be solved within current training regimes, just mitigated.
0
0
4
@cmalaviya11
Chaitanya Malaviya
28 days
Check out our paper and counterfactual data below! šŸ‘‡.• Paper: • Data: • Code:
0
1
2
@cmalaviya11
Chaitanya Malaviya
28 days
Our findings suggest that targeted debiasing using counterfactuals can help build more reliable preference models, a key step for both LLM alignment and evaluation. Work led by @abharadwaj123 and done jointly with @nitishjoshi23 and @yatskar.
1
0
4
@cmalaviya11
Chaitanya Malaviya
28 days
For instance, miscalibration for vagueness dropped from 51.3% to 28.5% and for jargon from 50.3% to 33.2% after CDA. Even joint debiasing across multiple biases (length, vagueness, jargon) proved effective with minimal impact on general capabilities.
Tweet media one
1
0
2
@cmalaviya11
Chaitanya Malaviya
28 days
And the results? CDA works!. It significantly reduced average miscalibration (e.g., from 39.4% to 32.5%) and brought model skew much closer to human preferences. All this while maintaining overall performance on RewardBench!
Tweet media one
1
0
2
@cmalaviya11
Chaitanya Malaviya
28 days
So how do we debias models? We propose a simple yet effective post-training method based on counterfactual data augmentation (CDA). We synthesize contrastive responses that explicitly magnify biases in dispreferred responses, & further finetune reward models on these responses.
1
0
2
@cmalaviya11
Chaitanya Malaviya
28 days
Indeed, preference models can easily latch on to these subtle data artifacts!. Features that only weakly correlate with human preferences (r_human=āˆ’0.12) are strongly predictive for models (r_model​=0.36). Points above y=x suggest that models overrely on these spurious cues😮
Tweet media one
1
1
3
@cmalaviya11
Chaitanya Malaviya
28 days
Where do these biases come from?šŸ¤”Our analysis suggests they originate from training data artifacts. For eg, humans preferred structured responses >65% of the time when the alternative wasn't structured. This gives an opportunity for models to learn these patterns as heuristics!
Tweet media one
1
0
4
@cmalaviya11
Chaitanya Malaviya
28 days
How severe is the problem? Using controlled counterfactual pairs, we found that preference models (incl. LLM evaluators) prefer biased responses in >60% of cases (defined as skew) and show high miscalibration (~40%) wrt humans. Vagueness & sycophancy are especially problematic!
Tweet media one
1
1
4
@cmalaviya11
Chaitanya Malaviya
28 days
Preference models act as proxies for human judgements in alignment (as reward models) & evaluation, but they can be miscalibrated. We found that they overrely on many idiosyncratic features of AI-generated text, which can lead to reward hacking & unreliable evals. Features like:
Tweet media one
1
1
5
@cmalaviya11
Chaitanya Malaviya
2 months
RT @ManyaWadhwa1: Evaluating language model responses on open-ended tasks is hard! šŸ¤”. We introduce EvalAgent, a framework that identifies n….
0
34
0
@cmalaviya11
Chaitanya Malaviya
8 months
RT @OriYoran: Super excited to be awarded the 2024 Google PhD Fellowship in Natural Language Processing! . Huge thanks to my advisor @Jonat….
0
11
0
@cmalaviya11
Chaitanya Malaviya
8 months
RT @kylelostat: come chat w me and @cmalaviya11 at #emnlp2024 about evaluating LMs, how findings can be impacted when dataset queries are v….
0
7
0
@cmalaviya11
Chaitanya Malaviya
8 months
Joint work done @allen_ai with @josephcc, @DanRothNLP, @MohitIyyer, @yatskar, @kylelostat. Find these & many more results in our paper: Use our code to run your own contextualized evals: Explore our data:
1
0
3
@cmalaviya11
Chaitanya Malaviya
8 months
šŸ¤” How can we use context to learn more about model behavior?. We can study "default" responses from models. Under what type of context does their response get highest score?. We uncover a bias towards WEIRD contexts (Western, Educated, Industrialized, Rich & Democratic)!
Tweet media one
1
0
3
@cmalaviya11
Chaitanya Malaviya
8 months
šŸ¤” Does providing context to evaluators have a substantial effect on evaluation conclusions?. We find that (1) presence of context can improve agreement between evaluators and (2) even change model rankings! 🤯
Tweet media one
1
0
3
@cmalaviya11
Chaitanya Malaviya
8 months
. we then conduct experiments providing context (1) during response generation, (2) during evaluation or (3) both.
Tweet media one
1
0
2
@cmalaviya11
Chaitanya Malaviya
8 months
With ✨Contextualized Evaluations✨, we synthetically generate context as clarifying, follow-up questions to an underspecified query.
Tweet media one
1
0
3
@cmalaviya11
Chaitanya Malaviya
8 months
Underspecified queries can lead to arbitrary evaluation judgments of response quality!. e.g., Given a query ā€œIs coffee good for you?ā€, how can evaluators accurately judge model responses when they aren't informed about the user’s preferences, background or important criteria?
Tweet media one
1
0
2