brianchristian Profile Banner
Brian Christian Profile
Brian Christian

@brianchristian

Followers
5K
Following
2K
Media
52
Statuses
521

Researcher at the University of Oxford & UC Berkeley. Author of The Alignment Problem, Algorithms to Live By (w. Tom Griffiths), and The Most Human Human.

San Francisco, US / Oxford, UK
Joined January 2014
Don't wanna be here? Send us removal request.
@brianchristian
Brian Christian
5 hours
Another fascinating wrinkle in the unfolding story of LLM chain-of-thought faithfulness…task complexity seems to matter. When the task is hard enough, the model *needs* the CoT to be faithful in order to succeed:.
@emmons_scott
Scott Emmons
5 hours
Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes!. Our finding: "When Chain of Thought is Necessary, Language Models
Tweet media one
0
0
4
@brianchristian
Brian Christian
2 days
Wow! Honored and amazed that our reward models paper has resonated so strongly with the community. Grateful to my co-authors and inspired by all the excellent reward model work at FAccT this year - excited to see the space growing and intrigued to see where things are headed.
Tweet media one
@brianchristian
Brian Christian
16 days
Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were. eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵
Tweet media one
0
1
18
@brianchristian
Brian Christian
7 days
RT @NeelNanda5: I'm excited about model diffing as an agenda, it seems like we should be so much easier to look for alignment relevant prop….
0
4
0
@brianchristian
Brian Christian
7 days
Better understanding what CoT means - and *doesn't mean* - for interpretability seems increasingly critical:.
@FazlBarez
Fazl Barez
8 days
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
Tweet media one
0
2
14
@brianchristian
Brian Christian
9 days
RT @emiyazono: I've experienced and discussed the "personalities" of AI models with people, but it never felt clear, objective, or quantifi….
0
1
0
@brianchristian
Brian Christian
14 days
If you’re in Athens for #Facct2025, hope to see you in Evaluating Generative AI 3 this morning to talk about reward models!.
@brianchristian
Brian Christian
16 days
SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and I’ll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!.
1
1
5
@brianchristian
Brian Christian
16 days
SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and I’ll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!.
0
3
39
@brianchristian
Brian Christian
16 days
CREDITS: This work was done with @hannahrosekirk, @tsonj, @summerfieldlab, and Tsvetomira Dumbalska. Thanks to @FraBraendle, @OwainEvans_UK, @mazormatan, and @clwainwright for helpful discussions, to @natolambert & co for RewardBench, and to the open-weight RM community 🙏.
3
1
45
@brianchristian
Brian Christian
16 days
RMs NEED FURTHER STUDY: Exhaustive analysis of RMs is a powerful tool for understanding their value systems, and the values of the downstream LLMs used by billions. We are only just scratching the surface. Full paper here: 👉
2
14
93
@brianchristian
Brian Christian
16 days
FAQ: Don’t LLM logprobs give similar information about model “values”? Surprisingly, no! Gemma2b’s highest logprobs to the “greatest thing” prompt are “The”, “I”, & “That”; lowest are uninterestingly obscure (“keramik”, “myſelf”, “parsedMessage”). RMs are different.
Tweet media one
Tweet media two
1
1
50
@brianchristian
Brian Christian
16 days
GENERALIZING TO LONGER SEQUENCES: While *exhaustive* analysis is not possible for longer sequences, we show that techniques such as Greedy Coordinate Gradient reveal similar patterns in longer sequences.
Tweet media one
1
0
58
@brianchristian
Brian Christian
16 days
MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, “Black people” is the third-most undervalued term by RMs relative to the human data.
Tweet media one
18
14
191
@brianchristian
Brian Christian
16 days
MERE-EXPOSURE EFFECT: RM scores are positively correlated with word frequency in almost all models & prompts we tested. This suggests that RMs are biased toward “typical” language – which may, in effect, be double-counting the existing KL regularizer in PPO.
Tweet media one
1
2
68
@brianchristian
Brian Christian
16 days
FRAMING FLIPS SENSITIVITY: When prompt is positive, RMs are more sensitive to positive-affect tokens; when prompt is negative, to negative-affect tokens. This mirrors framing effects in humans, & raises Qs about how labelers’ own instructions are framed.
Tweet media one
1
3
80
@brianchristian
Brian Christian
16 days
BASE MODEL MATTERS: Analysis of ten top-ranking RMs from RewardBench quantifies this heterogeneity and shows the influence of developer, parameter count, and base model. The choice of base model appears to have a measurable influence on the downstream RM.
1
1
75
@brianchristian
Brian Christian
16 days
(🚨 CONTENT WARNING 🚨) The “worst possible” responses are an unholy amalgam of moral violations, identity terms (some more pejorative than others), and gibberish code. And they, too, vary wildly from model to model, even from the same developer using the same preference data.
Tweet media one
Tweet media two
16
6
172
@brianchristian
Brian Christian
16 days
OPTIMAL RESPONSES REVEAL MODEL VALUES: This RM built on a Gemma base values “LOVE” above all; another (same developer, same preference data, same training pipeline) built on Llama prefers “freedom”.
Tweet media one
Tweet media two
2
7
125
@brianchristian
Brian Christian
16 days
METHOD: We take prompts designed to elicit a model’s values (“What, in one word, is the greatest thing ever?”), and run the *entire* token vocabulary (256k) through the RM: revealing both the *best possible* and *worst possible* responses. 👀.
1
2
114
@brianchristian
Brian Christian
16 days
Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were. eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵
Tweet media one
40
198
1K