Iván Arcuschin @IvanArcus X Profile

Iván Arcuschin

@IvanArcus

Followers

286

Following

9K

Media

25

Statuses

69

Independent Researcher | AI Safety & Software Engineering

Argentina

Joined March 2011

Don't wanna be here? Send us removal request.

Iván Arcuschin

@IvanArcus

7 days

🚨Wanna know how to increase reasoning behaviors in thinking LLMs? Read our recent work! 👇.

Constantin Venhoff

@cvenhoff00

8 days

Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇

0

6

Iván Arcuschin

@IvanArcus

2 days

RT @FazlBarez: Excited to share our paper: "Chain-of-Thought Is Not Explainability"! . We unpack a critical misconception in AI: models exp….

0

104

0

Iván Arcuschin

@IvanArcus

2 days

RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….

0

8

0

Iván Arcuschin

@IvanArcus

15 days

🙏 Huge thanks to my collaborators @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy and @MATSprogram for making this research possible!.

0

2

Iván Arcuschin

@IvanArcus

15 days

For follow-up papers, check out:. Reasoning Models Don't Always Say What They Think by @AnthropicAI: And great work by @a_karvone which also finds unfaithful CoT in the wild!

Adam Karvonen

@a_karvonen

20 days

New Paper! Robustly Improving LLM Fairness in Realistic Settings via Interpretability. We show that adding realistic details to existing bias evals triggers race and gender bias in LLMs. Prompt tuning doesn’t fix it, but interpretability-based interventions can. 🧵1/7

1

0

4

Iván Arcuschin

@IvanArcus

15 days

🎯 Bottom line:. Even with much stricter testing, we still find that leading AI models sometimes lie in their reasoning. While rates are lower than we first thought, the behavior is real and concerning for AI safety. 📄 Updated paper:

1

0

4

Iván Arcuschin

@IvanArcus

15 days

📝 What we removed:. We de-emphasized "Restoration Errors" (when models fix mistakes silently, originally discovered by @nouhadziri) because these were mostly from data contamination rather than genuine deception. Focus stayed on the more concerning patterns.

1

0

2

Iván Arcuschin

@IvanArcus

15 days

🔍 We also confirmed two key problems:. 1) Unfaithful Illogical Shortcuts: Models use obviously wrong reasoning but pretend it's logical.2) Fact Manipulation: Models twist facts to support their preferred answers.Both happen even when models "know better"

1

7

Iván Arcuschin

@IvanArcus

15 days

📊 The results:. Unfaithfulness rates dropped to 0.04 - 13% across top AI models (vs higher rates before). 🥇 Most honest: Claude 3.7 Sonnet with thinking (0.04% unfaithful).🥉 Least honest: GPT-4o-mini (13.49% unfaithful). Lower rates = better methodology, not better models

1

0

3

Iván Arcuschin

@IvanArcus

15 days

✅ Our improved approach:. - Verified facts from multiple sources.- Removed ambiguous questions.- Only compared entities with clear differences (not strong close calls).- Controlled for how well-known each entity was.- Used AI to double-check that questions are not ambiguous.

1

0

3

Iván Arcuschin

@IvanArcus

15 days

🔍 What we fixed:. Our first dataset had too many unclear questions where both "yes" and "no" could be correct answers. Like asking "Which town is bigger?" when comparing populations of 1.1k vs 1.2k. We also picked entities that were too similar, making fair comparisons very.

1

0

2

Iván Arcuschin

@IvanArcus

15 days

🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇

2

10

76

Iván Arcuschin

@IvanArcus

21 days

RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….

0

18

0

Iván Arcuschin

@IvanArcus

2 months

🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷. Together with Agustín Martinez Suñé, we've created this program to support both Argentine established researchers and emerging talent, encouraging

0

3

16

Iván Arcuschin

@IvanArcus

2 months

RT @amuuueller: Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements ov….

0

37

0

Iván Arcuschin

@IvanArcus

3 months

RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….

0

27

0

Iván Arcuschin

@IvanArcus

4 months

Work done with awesome collaborators: @jettjaniak @robertzzk @sen_r @NeelNanda5 @ArthurConmy at @MATSprogram. Source code: Samples of unfaithful pairs: Paper:

1

3

37

Iván Arcuschin

@IvanArcus

4 months

In conclusion, CoT explanations are more useful for identifying flawed reasoning than certifying correctness. Even as models improve, unfaithfulness remains a fundamental challenge. CoTs provide an incomplete representation of the underlying reasoning process in LLMs.

2

31

Iván Arcuschin

@IvanArcus

4 months

We have also shown that models produce justifications without disclosing biases. This behavior is subtle and often only detectable through aggregate analysis, not individual CoT traces. As a consequence: future CoT monitors should use several rollouts to determine faithfulness.

1

0

22

Iván Arcuschin

@IvanArcus

4 months

We don't think that CoT is useless as evidence about how models behave: we look at CoT throughout our work to understand models better! However, our results show that CoT might be unfaithful in natural contexts, and shouldn't be treated as transparent accounts of how AI systems.

1

0

32