Kshitish Ghate
@GhateKshitish
Followers
93
Following
266
Media
8
Statuses
54
PhD student @UWCSE | MLT Grad student @LTIatCMU | CS and Econ @bitspilanigoa
Joined October 2022
🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences? With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
1
17
67
@jifan_zhang @johnschulman2 @sleight_henry @TheAndiPenguin Congrats on the paper! We've been working in a similar direction (evaluating value prioritization in LLMs - https://t.co/ET7bPfKmHk), would love to chat if you're still thinking about this
arxiv.org
Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In...
0
7
22
PhD apps season is here! 😱🥳 Apply to do a PhD @WisconsinCS (as pictured) w/ me to research: - Societal impact of AI - NLP ←→ CSS and cultural analytics - Computational sociolinguistics - Human-AI interaction - Culturally competent and inclusive NLP https://t.co/YVrGa3BjWg
17
71
364
@emollick My best hypothesis for the mechanism is: Chat LLMs are hyperoptimized to approximate the single "best" (most-preferred) response. When you prompt it for a single story, it gives the single best story it can. When you ask it to give FIVE stories, you recast the "best" response to
3
3
20
Work done with amazing collaborators 🙏 @uilydna @devanshrjain @ma_tay_ @Dr_Atoosa @aylin_cim @MonaDiab77 @MaartenSap
0
1
12
For more details about our experiments and findings -- Paper: https://t.co/2y8rQmhcad Code and Data: https://t.co/SreNh5N8pm Please feel free to reach out if you are interested in this work and would like to chat!
github.com
Repository for the paper "EVALUESTEER: MEASURING REWARD MODEL STEERABILITY TOWARDS VALUES AND PREFERENCES" - kshitishghate/EVALUESTEER-benchmark
1
0
3
🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.
1
0
3
Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios: • Models choose style-aligned responses 57-73% of the time • Persists even with explicit instructions to prioritize values • Consistent across all model sizes and types
1
0
3
Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for: • Secular over traditional values • Self-expression over survival values • Verbose, confident, and formal/cold language
1
0
3
Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).
2
0
3
We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.
1
0
3
We need controlled variation of values AND styles to test RM steerability. We generate ~166k synthetic preference pairs with profiles that systematically vary: • 4 value dimensions (World Values Survey) • 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
1
0
3
Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation. https://t.co/dFEMR0opBG
https://t.co/iJAeNSuBUq
1
0
3
LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.
1
0
3
🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵
5
49
194
Check out our new paper that uses simulated moral dilemmas to study how LLMs prioritize different values!
🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.
0
0
2
Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵
12
70
331
Thrilled to launch Prompt Adaptation, a state-of-the-art agentic system to automate prompt engineering 🚀
Today we’re launching Prompt Adaptation, a state-of-the-art agentic system that automatically adapts prompts across LLMs. Prompt Adaptation outperforms all other methods and significantly improves accuracy over manual prompt engineering, saving you thousands of hours per year.
1
4
9
This dataset papers offers a rare glimpse into how LLMs are actually used in the wild. Over 94k real-world use cases, mapped by occupation and application type. A nice addition to the Anthropic paper I tweeted a while ago to study AI's societal impact. https://t.co/PdcIN2I4Q4
4
19
64
Super excited that I'll be joining the @UW @TechPolicyLab and @uwnlp as a postdoc in the Fall working with @aylin_cim to continue my research directions in situated evaluation, multimodal/lingual GenAI, and start exploring new directions in safety and alignment! Open to collabs😉
52
7
214