Kshitish Ghate @GhateKshitish X Profile

Kshitish Ghate

@GhateKshitish

Followers

93

Following

266

Media

8

Statuses

54

PhD student @UWCSE | MLT Grad student @LTIatCMU | CS and Econ @bitspilanigoa

https://t.co/la24IjJuis

Joined October 2022

Don't wanna be here? Send us removal request.

Kshitish Ghate

@GhateKshitish

2 months

🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences? With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵

1

17

67

Andy Liu

@uilydna

1 month

@jifan_zhang @johnschulman2 @sleight_henry @TheAndiPenguin Congrats on the paper! We've been working in a similar direction (evaluating value prioritization in LLMs - https://t.co/ET7bPfKmHk), would love to chat if you're still thinking about this

arxiv.org

Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In...

0

7

22

Lucy Li

@lucy3_li

2 months

PhD apps season is here! 😱🥳 Apply to do a PhD @WisconsinCS (as pictured) w/ me to research: - Societal impact of AI - NLP ←→ CSS and cultural analytics - Computational sociolinguistics - Human-AI interaction - Culturally competent and inclusive NLP https://t.co/YVrGa3BjWg

17

71

364

Taylor Sorensen

@ma_tay_

2 months

@emollick My best hypothesis for the mechanism is: Chat LLMs are hyperoptimized to approximate the single "best" (most-preferred) response. When you prompt it for a single story, it gives the single best story it can. When you ask it to give FIVE stories, you recast the "best" response to

3

20

Kshitish Ghate

@GhateKshitish

2 months

Work done with amazing collaborators 🙏 @uilydna @devanshrjain @ma_tay_ @Dr_Atoosa @aylin_cim @MonaDiab77 @MaartenSap

0

1

12

Kshitish Ghate

@GhateKshitish

2 months

For more details about our experiments and findings -- Paper: https://t.co/2y8rQmhcad Code and Data: https://t.co/SreNh5N8pm Please feel free to reach out if you are interested in this work and would like to chat!

github.com

Repository for the paper "EVALUESTEER: MEASURING REWARD MODEL STEERABILITY TOWARDS VALUES AND PREFERENCES" - kshitishghate/EVALUESTEER-benchmark

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios: • Models choose style-aligned responses 57-73% of the time • Persists even with explicit instructions to prioritize values • Consistent across all model sizes and types

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for: • Secular over traditional values • Self-expression over survival values • Verbose, confident, and formal/cold language

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).

2

0

3

Kshitish Ghate

@GhateKshitish

2 months

We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

We need controlled variation of values AND styles to test RM steerability. We generate ~166k synthetic preference pairs with profiles that systematically vary: • 4 value dimensions (World Values Survey) • 4 style dimensions (verbosity, confidence, warmth, reading difficulty)

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation. https://t.co/dFEMR0opBG https://t.co/iJAeNSuBUq

1

0

3

Kshitish Ghate

@GhateKshitish

2 months

LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.

1

0

3

Taylor Sorensen

@ma_tay_

2 months

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵

5

49

194

Kshitish Ghate

@GhateKshitish

2 months

Check out our new paper that uses simulated moral dilemmas to study how LLMs prioritize different values!

Andy Liu

@uilydna

2 months

🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.

0

2

smitha milli

@SmithaMilli

5 months

Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵

12

70

331

Devansh Jain

@devanshrjain

7 months

Thrilled to launch Prompt Adaptation, a state-of-the-art agentic system to automate prompt engineering 🚀

Tomas Hernando Kofman

@tomas_hk

7 months

Today we’re launching Prompt Adaptation, a state-of-the-art agentic system that automatically adapts prompts across LLMs. Prompt Adaptation outperforms all other methods and significantly improves accuracy over manual prompt engineering, saving you thousands of hours per year.

1

4

9

Kiran Garimella

@gvrkiran

7 months

This dataset papers offers a rare glimpse into how LLMs are actually used in the wild. Over 94k real-world use cases, mapped by occupation and application type. A nice addition to the Anthropic paper I tweeted a while ago to study AI's societal impact. https://t.co/PdcIN2I4Q4

4

19

64

Michael Saxon ✈️ NeurIPS SD

@m2saxon

7 months

Super excited that I'll be joining the @UW @TechPolicyLab and @uwnlp as a postdoc in the Fall working with @aylin_cim to continue my research directions in situated evaluation, multimodal/lingual GenAI, and start exploring new directions in safety and alignment! Open to collabs😉

52

7

214