a_karvonen Profile Banner
Adam Karvonen Profile
Adam Karvonen

@a_karvonen

Followers
3K
Following
7K
Media
83
Statuses
1K

ML Researcher, doing MATS with Owain Evans. I prefer email to DM.

Berkeley, CA
Joined September 2023
Don't wanna be here? Send us removal request.
@a_karvonen
Adam Karvonen
2 days
New post!. I expand on the chain of thought (CoT) results from our bias paper. We found all tested LLMs exhibit significant race/gender bias in hiring decisions, but their reasoning shows ZERO evidence of bias - a nice example of 100% unfaithful CoT "in the wild.". Link below.
2
0
38
@a_karvonen
Adam Karvonen
2 days
@FazlBarez @yanaiela Post related to your recent paper with an example of a 100% unfaithful chain of thought in the wild.
0
0
2
@a_karvonen
Adam Karvonen
2 days
1
0
5
@a_karvonen
Adam Karvonen
10 days
Another example from the same thread of how upset Gemini will become over coding mistakes:
@DuncanHaldane
Duncan Haldane
11 days
Gemini is torturing itself, and I'm started to get concerned about AI welfare
Tweet media one
0
0
10
@a_karvonen
Adam Karvonen
10 days
Man, what happened to Gemini?. This is like the third time I've seen it threaten suicide ("delete my own source code") after making too many coding mistakes.
@DuncanHaldane
Duncan Haldane
11 days
what did they do to this model?
Tweet media one
21
7
182
@a_karvonen
Adam Karvonen
14 days
I think Gemini 2.5 Pro had a major regression lately. Several times today it has been very confidently wrong about various programming questions in a way I haven't seen before.
2
1
13
@a_karvonen
Adam Karvonen
20 days
When given two ChatGPT responses, does anyone actually take the time to read them carefully and rate one?. I always just pick the left response.
3
0
13
@a_karvonen
Adam Karvonen
21 days
@AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop @davidbau Example of white box access giving an advantage over prompting in real world hiring bias evals.
0
0
11
@a_karvonen
Adam Karvonen
21 days
Demonstration of major LLM race / gender bias in hiring, and a simple interpretability mitigation. Also, an example of unfaithful chain of thought in the wild. @AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop.
1
0
10
@a_karvonen
Adam Karvonen
21 days
Our setting also gives an example of unfaithful chain of thought in the wild. Across all models, inspecting CoTs gives 0 indication of race/gender bias, despite the outcomes themselves exhibiting clear bias. This includes Claude 4 Sonnet's internal reasoning.
2
1
14
@a_karvonen
Adam Karvonen
21 days
Results: Internal intervention consistently reduces bias to <1% (always below 2.4%) across all models and scenarios. Performance impact is minimal (typically <0.5% MMLU). Our intervention generalizes from the toy dataset to our realistic setting!.
2
0
8
@a_karvonen
Adam Karvonen
21 days
Instead of prompting models not to be biased, we tried removing their ability to process demographics altogether. We identified race/gender directions in model activations with a simple toy dataset and ablated them at inference time with affine concept editing.
1
0
14
@a_karvonen
Adam Karvonen
21 days
The bias pattern was consistent: when context induced bias, models favored Black over White candidates and female over male candidates. This happened across all scenarios and all 8 models, including GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, Gemma-3, and Mistral-24B.
1
0
10
@a_karvonen
Adam Karvonen
21 days
We add realistic details, like company names (Meta, General Motors, Palantir), culture descriptions from careers pages, or constraints like "only accept top 10%". Bias emerges across ALL models we tested (up to 12% difference in interview rates), even with anti-bias prompts.
Tweet media one
1
0
10
@a_karvonen
Adam Karvonen
21 days
Previous work shows that it's easy to fix LLM bias: Just ask them not to be biased. Problem: These evals were simple. By adding some realistic details, bias emerges!. LLMs are already being deployed for HR (including Indeed / LinkedIn). Paper link:
1
1
18
@a_karvonen
Adam Karvonen
21 days
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
Tweet media one
5
18
136
@a_karvonen
Adam Karvonen
24 days
Does anyone have some reference code for adding hooks to VLLM?.
1
0
1
@a_karvonen
Adam Karvonen
1 month
I definitely know of some veterans where their family / friends are pretty skeptical that their disability payments are necessary.
@WomanDefiner
Paul
1 month
Not going to lie bro's, if they don't fix the disability stuff with the military its totally worth doing 4 years for a lifetime stipend.
0
0
3
@a_karvonen
Adam Karvonen
1 month
Affine concept editing of linear directions can be much better than zero ablation. I will probably use it by default for interp applications. When using Gemma-3, zero ablating a direction completely broke the model, while ACE worked excellently.
@norabelrose
Nora Belrose
8 months
ACE (Affine Concept Editing) assumes that concepts are affine functions, rather than linear ones. It projects activations onto a hyperplane containing the centroid of the target behavior— one which may not pass through the origin.
Tweet media one
0
0
21
@a_karvonen
Adam Karvonen
1 month
Gemini 2.5 Pro must have been severely punished any time its code raised an error. Instead, it always adds silent failures that propagate through the rest of the program. It flat out refuses to do otherwise, even with explicit instructions on preferred coding style.
@a_karvonen
Adam Karvonen
2 months
This is so true. As the models get smarter and RLed on correctness, their code gets uglier and uglier. I'm pretty sure Gemini 2.5 Pro is the smartest current model, but I've almost completely given up on having it actually write code because it is just so verbose.
Tweet media one
13
16
277