Iván Arcuschin Profile
Iván Arcuschin

@IvanArcus

Followers
305
Following
10K
Media
25
Statuses
70

Independent Researcher | AI Safety & Software Engineering

Argentina
Joined March 2011
Don't wanna be here? Send us removal request.
@IvanArcus
Iván Arcuschin
2 months
🚨Wanna know how to increase reasoning behaviors in thinking LLMs? Read our recent work! 👇.
@cvenhoff00
Constantin Venhoff
2 months
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇
Tweet media one
0
0
6
@IvanArcus
Iván Arcuschin
2 months
RT @davlindner: Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabili….
0
18
0
@grok
Grok
8 days
What do you want to know?.
656
481
3K
@IvanArcus
Iván Arcuschin
2 months
RT @FazlBarez: Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models exp….
0
138
0
@IvanArcus
Iván Arcuschin
2 months
RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….
0
8
0
@IvanArcus
Iván Arcuschin
2 months
🙏 Huge thanks to my collaborators @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy and @MATSprogram for making this research possible!.
0
0
4
@IvanArcus
Iván Arcuschin
2 months
For follow-up papers, check out:. Reasoning Models Don't Always Say What They Think by @AnthropicAI: And great work by @a_karvone which also finds unfaithful CoT in the wild!
@a_karvonen
Adam Karvonen
2 months
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
Tweet media one
1
0
4
@IvanArcus
Iván Arcuschin
2 months
🎯 Bottom line:. Even with much stricter testing, we still find that leading AI models sometimes lie in their reasoning. While rates are lower than we first thought, the behavior is real and concerning for AI safety. 📄 Updated paper:
Tweet card summary image
arxiv.org
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful when models face an...
1
0
4
@IvanArcus
Iván Arcuschin
2 months
📝 What we removed:. We de-emphasized "Restoration Errors" (when models fix mistakes silently, originally discovered by @nouhadziri) because these were mostly from data contamination rather than genuine deception. Focus stayed on the more concerning patterns.
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
2 months
🔍 We also confirmed two key problems:. 1) Unfaithful Illogical Shortcuts: Models use obviously wrong reasoning but pretend it's logical.2) Fact Manipulation: Models twist facts to support their preferred answers.Both happen even when models "know better"
Tweet media one
1
1
7
@IvanArcus
Iván Arcuschin
2 months
📊 The results:. Unfaithfulness rates dropped to 0.04 - 13% across top AI models (vs higher rates before). 🥇 Most honest: Claude 3.7 Sonnet with thinking (0.04% unfaithful).🥉 Least honest: GPT-4o-mini (13.49% unfaithful). Lower rates = better methodology, not better models
Tweet media one
1
0
3
@IvanArcus
Iván Arcuschin
2 months
✅ Our improved approach:. - Verified facts from multiple sources.- Removed ambiguous questions.- Only compared entities with clear differences (not strong close calls).- Controlled for how well-known each entity was.- Used AI to double-check that questions are not ambiguous.
1
0
3
@IvanArcus
Iván Arcuschin
2 months
🔍 What we fixed:. Our first dataset had too many unclear questions where both "yes" and "no" could be correct answers. Like asking "Which town is bigger?" when comparing populations of 1.1k vs 1.2k. We also picked entities that were too similar, making fair comparisons very.
1
0
2
@IvanArcus
Iván Arcuschin
2 months
🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇
Tweet media one
2
10
76
@IvanArcus
Iván Arcuschin
3 months
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
20
0
@IvanArcus
Iván Arcuschin
3 months
🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷. Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging
Tweet media one
0
3
16
@IvanArcus
Iván Arcuschin
4 months
RT @amuuueller: Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements ov….
0
38
0
@IvanArcus
Iván Arcuschin
5 months
RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….
0
31
0
@IvanArcus
Iván Arcuschin
6 months
In conclusion, CoT explanations are more useful for identifying flawed reasoning than certifying correctness. Even as models improve, unfaithfulness remains a fundamental challenge. CoTs provide an incomplete representation of the underlying reasoning process in LLMs.
2
2
31
@IvanArcus
Iván Arcuschin
6 months
We have also shown that models produce justifications without disclosing biases. This behavior is subtle and often only detectable through aggregate analysis, not individual CoT traces. As a consequence: future CoT monitors should use several rollouts to determine faithfulness.
1
0
22