
Adam Karvonen
@a_karvonen
Followers
3K
Following
7K
Media
83
Statuses
1K
ML Researcher, doing MATS with Owain Evans. I prefer email to DM.
Berkeley, CA
Joined September 2023
@FazlBarez @yanaiela Post related to your recent paper with an example of a 100% unfaithful chain of thought in the wild.
0
0
2
@AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop @davidbau Example of white box access giving an advantage over prompting in real world hiring bias evals.
0
0
11
Demonstration of major LLM race / gender bias in hiring, and a simple interpretability mitigation. Also, an example of unfaithful chain of thought in the wild. @AlexTamkin @nabla_theta @livgorton @chris_j_paxton @megamor2 @zacharynado @tomdlt10 @adamrpearce @jxmnop.
1
0
10
Affine concept editing of linear directions can be much better than zero ablation. I will probably use it by default for interp applications. When using Gemma-3, zero ablating a direction completely broke the model, while ACE worked excellently.
ACE (Affine Concept Editing) assumes that concepts are affine functions, rather than linear ones. It projects activations onto a hyperplane containing the centroid of the target behavior— one which may not pass through the origin.
0
0
21
Gemini 2.5 Pro must have been severely punished any time its code raised an error. Instead, it always adds silent failures that propagate through the rest of the program. It flat out refuses to do otherwise, even with explicit instructions on preferred coding style.
This is so true. As the models get smarter and RLed on correctness, their code gets uglier and uglier. I'm pretty sure Gemini 2.5 Pro is the smartest current model, but I've almost completely given up on having it actually write code because it is just so verbose.
13
16
277