Iván Arcuschin Profile
Iván Arcuschin

@IvanArcus

Followers
327
Following
11K
Media
26
Statuses
74

Independent Researcher | AI Safety & Software Engineering

Argentina
Joined March 2011
Don't wanna be here? Send us removal request.
@IvanArcus
Iván Arcuschin
1 month
Super excited about this new work that's finally out! It was such a wild ride 😄 We create a hybrid model that can recover up to 91% of the performance gap between base and thinking models steering only 12% of tokens!
@cvenhoff00
Constantin Venhoff
1 month
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
0
1
7
@jkminder
Julian Minder
25 days
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
156
@MiTerekhov
Mikhail Terekhov
1 month
You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵
2
15
65
@jkminder
Julian Minder
2 months
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
6
23
227
@davlindner
David Lindner
4 months
Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
11
18
103
@FazlBarez
Fazl Barez 🔜 @NeurIPS
5 months
Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
28
137
659
@jkminder
Julian Minder
5 months
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
2
8
105
@IvanArcus
Iván Arcuschin
5 months
🚨Wanna know how to increase reasoning behaviors in thinking LLMs? Read our recent work! 👇
@cvenhoff00
Constantin Venhoff
5 months
Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇
1
0
7
@IvanArcus
Iván Arcuschin
5 months
🙏 Huge thanks to my collaborators @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy and @MATSprogram for making this research possible!
0
0
3
@IvanArcus
Iván Arcuschin
5 months
For follow-up papers, check out: Reasoning Models Don't Always Say What They Think by @AnthropicAI: https://t.co/xToD5v8fE0 And great work by @a_karvone which also finds unfaithful CoT in the wild! https://t.co/AP36vIJiCT https://t.co/ILbuhfpCc9
@a_karvonen
Adam Karvonen
5 months
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
1
0
4
@IvanArcus
Iván Arcuschin
5 months
🎯 Bottom line: Even with much stricter testing, we still find that leading AI models sometimes lie in their reasoning. While rates are lower than we first thought, the behavior is real and concerning for AI safety. 📄 Updated paper:
Tweet card summary image
arxiv.org
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful when models face an...
1
0
4
@IvanArcus
Iván Arcuschin
5 months
📝 What we removed: We de-emphasized "Restoration Errors" (when models fix mistakes silently, originally discovered by @nouhadziri) because these were mostly from data contamination rather than genuine deception. Focus stayed on the more concerning patterns.
1
0
2
@IvanArcus
Iván Arcuschin
5 months
🔍 We also confirmed two key problems: 1) Unfaithful Illogical Shortcuts: Models use obviously wrong reasoning but pretend it's logical 2) Fact Manipulation: Models twist facts to support their preferred answers Both happen even when models "know better"
1
1
7
@IvanArcus
Iván Arcuschin
5 months
📊 The results: Unfaithfulness rates dropped to 0.04 - 13% across top AI models (vs higher rates before). 🥇 Most honest: Claude 3.7 Sonnet with thinking (0.04% unfaithful) 🥉 Least honest: GPT-4o-mini (13.49% unfaithful) Lower rates = better methodology, not better models
1
0
3
@IvanArcus
Iván Arcuschin
5 months
✅ Our improved approach: - Verified facts from multiple sources - Removed ambiguous questions - Only compared entities with clear differences (not strong close calls) - Controlled for how well-known each entity was - Used AI to double-check that questions are not ambiguous
1
0
3
@IvanArcus
Iván Arcuschin
5 months
🔍 What we fixed: Our first dataset had too many unclear questions where both "yes" and "no" could be correct answers. Like asking "Which town is bigger?" when comparing populations of 1.1k vs 1.2k. We also picked entities that were too similar, making fair comparisons very
1
0
2
@IvanArcus
Iván Arcuschin
5 months
🧵 NEW: We updated our research on unfaithful AI reasoning! We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇
2
10
75
@MiTerekhov
Mikhail Terekhov
5 months
AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
4
20
69
@IvanArcus
Iván Arcuschin
6 months
🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷 Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging
0
3
16
@amuuueller
Aaron Mueller
7 months
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
3
39
172