Iván Arcuschin Profile
Iván Arcuschin

@IvanArcus

Followers
286
Following
9K
Media
25
Statuses
69

Independent Researcher | AI Safety & Software Engineering

Argentina
Joined March 2011
Don't wanna be here? Send us removal request.
@IvanArcus
Iván Arcuschin
7 days
🚨Wanna know how to increase reasoning behaviors in thinking LLMs? Read our recent work! 👇.
@cvenhoff00
Constantin Venhoff
8 days
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇
Tweet media one
0
0
6
@IvanArcus
Iván Arcuschin
2 days
RT @FazlBarez: Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models exp….
0
104
0
@IvanArcus
Iván Arcuschin
2 days
RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….
0
8
0
@IvanArcus
Iván Arcuschin
15 days
🙏 Huge thanks to my collaborators @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy and @MATSprogram for making this research possible!.
0
0
2
@IvanArcus
Iván Arcuschin
15 days
For follow-up papers, check out:. Reasoning Models Don't Always Say What They Think by @AnthropicAI: And great work by @a_karvone which also finds unfaithful CoT in the wild!
@a_karvonen
Adam Karvonen
20 days
New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7
Tweet media one
1
0
4
@IvanArcus
Iván Arcuschin
15 days
🎯 Bottom line:. Even with much stricter testing, we still find that leading AI models sometimes lie in their reasoning. While rates are lower than we first thought, the behavior is real and concerning for AI safety. 📄 Updated paper:
1
0
4
@IvanArcus
Iván Arcuschin
15 days
📝 What we removed:. We de-emphasized "Restoration Errors" (when models fix mistakes silently, originally discovered by @nouhadziri) because these were mostly from data contamination rather than genuine deception. Focus stayed on the more concerning patterns.
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
15 days
🔍 We also confirmed two key problems:. 1) Unfaithful Illogical Shortcuts: Models use obviously wrong reasoning but pretend it's logical.2) Fact Manipulation: Models twist facts to support their preferred answers.Both happen even when models "know better"
Tweet media one
1
1
7
@IvanArcus
Iván Arcuschin
15 days
📊 The results:. Unfaithfulness rates dropped to 0.04 - 13% across top AI models (vs higher rates before). 🥇 Most honest: Claude 3.7 Sonnet with thinking (0.04% unfaithful).🥉 Least honest: GPT-4o-mini (13.49% unfaithful). Lower rates = better methodology, not better models
Tweet media one
1
0
3
@IvanArcus
Iván Arcuschin
15 days
✅ Our improved approach:. - Verified facts from multiple sources.- Removed ambiguous questions.- Only compared entities with clear differences (not strong close calls).- Controlled for how well-known each entity was.- Used AI to double-check that questions are not ambiguous.
1
0
3
@IvanArcus
Iván Arcuschin
15 days
🔍 What we fixed:. Our first dataset had too many unclear questions where both "yes" and "no" could be correct answers. Like asking "Which town is bigger?" when comparing populations of 1.1k vs 1.2k. We also picked entities that were too similar, making fair comparisons very.
1
0
2
@IvanArcus
Iván Arcuschin
15 days
🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇
Tweet media one
2
10
76
@IvanArcus
Iván Arcuschin
21 days
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
18
0
@IvanArcus
Iván Arcuschin
2 months
🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷. Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging
Tweet media one
0
3
16
@IvanArcus
Iván Arcuschin
2 months
RT @amuuueller: Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements ov….
0
37
0
@IvanArcus
Iván Arcuschin
3 months
RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….
0
27
0
@IvanArcus
Iván Arcuschin
4 months
Work done with awesome collaborators: @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy at @MATSprogram. Source code: Samples of unfaithful pairs: Paper:
1
3
37
@IvanArcus
Iván Arcuschin
4 months
In conclusion, CoT explanations are more useful for identifying flawed reasoning than certifying correctness. Even as models improve, unfaithfulness remains a fundamental challenge. CoTs provide an incomplete representation of the underlying reasoning process in LLMs.
2
2
31
@IvanArcus
Iván Arcuschin
4 months
We have also shown that models produce justifications without disclosing biases. This behavior is subtle and often only detectable through aggregate analysis, not individual CoT traces. As a consequence: future CoT monitors should use several rollouts to determine faithfulness.
1
0
22
@IvanArcus
Iván Arcuschin
4 months
We don't think that CoT is useless as evidence about how models behave: we look at CoT throughout our work to understand models better! However, our results show that CoT might be unfaithful in natural contexts, and shouldn't be treated as transparent accounts of how AI systems.
1
0
32