
Iván Arcuschin
@IvanArcus
Followers
286
Following
9K
Media
25
Statuses
69
Independent Researcher | AI Safety & Software Engineering
Argentina
Joined March 2011
🚨Wanna know how to increase reasoning behaviors in thinking LLMs? Read our recent work! 👇.
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇
0
0
6
RT @FazlBarez: Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models exp….
0
104
0
RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….
0
8
0
🙏 Huge thanks to my collaborators @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy and @MATSprogram for making this research possible!.
0
0
2
For follow-up papers, check out:. Reasoning Models Don't Always Say What They Think by @AnthropicAI: And great work by @a_karvone which also finds unfaithful CoT in the wild!
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
1
0
4
📝 What we removed:. We de-emphasized "Restoration Errors" (when models fix mistakes silently, originally discovered by @nouhadziri) because these were mostly from data contamination rather than genuine deception. Focus stayed on the more concerning patterns.
1
0
2
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
18
0
RT @amuuueller: Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements ov….
0
37
0
RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….
0
27
0
Work done with awesome collaborators: @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy at @MATSprogram. Source code: Samples of unfaithful pairs: Paper:
1
3
37