Adam Karvonen @a_karvonen X Profile

Adam Karvonen

@a_karvonen

Followers

3K

Following

7K

Media

83

Statuses

1K

ML Researcher, doing MATS with Owain Evans. I prefer email to DM.

Berkeley, CA

Joined September 2023

Don't wanna be here? Send us removal request.

Adam Karvonen

@a_karvonen

2 days

New post!. I expand on the chain of thought (CoT) results from our bias paper. We found all tested LLMs exhibit significant race/gender bias in hiring decisions, but their reasoning shows ZERO evidence of bias - a nice example of 100% unfaithful CoT "in the wild.". Link below.

2

0

38

Adam Karvonen

@a_karvonen

2 days

@FazlBarez @yanaiela Post related to your recent paper with an example of a 100% unfaithful chain of thought in the wild.

0

2

Adam Karvonen

@a_karvonen

2 days

1

0

5

Adam Karvonen

@a_karvonen

10 days

Another example from the same thread of how upset Gemini will become over coding mistakes:

Duncan Haldane

@DuncanHaldane

11 days

Gemini is torturing itself, and I'm started to get concerned about AI welfare

0

10

Adam Karvonen

@a_karvonen

10 days

Man, what happened to Gemini?. This is like the third time I've seen it threaten suicide ("delete my own source code") after making too many coding mistakes.

Duncan Haldane

@DuncanHaldane

11 days

what did they do to this model?

21

7

182

Adam Karvonen

@a_karvonen

14 days

I think Gemini 2.5 Pro had a major regression lately. Several times today it has been very confidently wrong about various programming questions in a way I haven't seen before.

2

1

13

Adam Karvonen

@a_karvonen

20 days

When given two ChatGPT responses, does anyone actually take the time to read them carefully and rate one?. I always just pick the left response.

3

0

13

Adam Karvonen

@a_karvonen

21 days

@AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop @davidbau Example of white box access giving an advantage over prompting in real world hiring bias evals.

0

11

Adam Karvonen

@a_karvonen

21 days

Demonstration of major LLM race / gender bias in hiring, and a simple interpretability mitigation. Also, an example of unfaithful chain of thought in the wild. @AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop.

1

0

10

Adam Karvonen

@a_karvonen

21 days

Our setting also gives an example of unfaithful chain of thought in the wild. Across all models, inspecting CoTs gives 0 indication of race/gender bias, despite the outcomes themselves exhibiting clear bias. This includes Claude 4 Sonnet's internal reasoning.

2

1

14

Adam Karvonen

@a_karvonen

21 days

Results: Internal intervention consistently reduces bias to <1% (always below 2.4%) across all models and scenarios. Performance impact is minimal (typically <0.5% MMLU). Our intervention generalizes from the toy dataset to our realistic setting!.

2

0

8

Adam Karvonen

@a_karvonen

21 days

Instead of prompting models not to be biased, we tried removing their ability to process demographics altogether. We identified race/gender directions in model activations with a simple toy dataset and ablated them at inference time with affine concept editing.

1

0

14

Adam Karvonen

@a_karvonen

21 days

The bias pattern was consistent: when context induced bias, models favored Black over White candidates and female over male candidates. This happened across all scenarios and all 8 models, including GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, Gemma-3, and Mistral-24B.

1

0

10

Adam Karvonen

@a_karvonen

21 days

We add realistic details, like company names (Meta, General Motors, Palantir), culture descriptions from careers pages, or constraints like "only accept top 10%". Bias emerges across ALL models we tested (up to 12% difference in interview rates), even with anti-bias prompts.

1

0

10

Adam Karvonen

@a_karvonen

21 days

Previous work shows that it's easy to fix LLM bias: Just ask them not to be biased. Problem: These evals were simple. By adding some realistic details, bias emerges!. LLMs are already being deployed for HR (including Indeed / LinkedIn). Paper link:

1

18

Adam Karvonen

@a_karvonen

21 days

New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7

5

18

136

Adam Karvonen

@a_karvonen

24 days

Does anyone have some reference code for adding hooks to VLLM?.

1

0

1

Adam Karvonen

@a_karvonen

1 month

I definitely know of some veterans where their family / friends are pretty skeptical that their disability payments are necessary.

Paul

@WomanDefiner

1 month

Not going to lie bro's, if they don't fix the disability stuff with the military its totally worth doing 4 years for a lifetime stipend.

0

3

Adam Karvonen

@a_karvonen

1 month

Affine concept editing of linear directions can be much better than zero ablation. I will probably use it by default for interp applications. When using Gemma-3, zero ablating a direction completely broke the model, while ACE worked excellently.

Nora Belrose

@norabelrose

8 months

ACE (Affine Concept Editing) assumes that concepts are affine functions, rather than linear ones. It projects activations onto a hyperplane containing the centroid of the target behavior— one which may not pass through the origin.

0

21

Adam Karvonen

@a_karvonen

1 month

Gemini 2.5 Pro must have been severely punished any time its code raised an error. Instead, it always adds silent failures that propagate through the rest of the program. It flat out refuses to do otherwise, even with explicit instructions on preferred coding style.

Adam Karvonen

@a_karvonen

2 months

This is so true. As the models get smarter and RLed on correctness, their code gets uglier and uglier. I'm pretty sure Gemini 2.5 Pro is the smartest current model, but I've almost completely given up on having it actually write code because it is just so verbose.

13

16

277