Itamar Pres
@PresItamar
Followers
280
Following
105
Media
4
Statuses
29
PhD Student @ MIT
Joined August 2022
There's been a lot of excitement about pluralistic value alignment 🌈 — AI that reflects the full range of human perspectives But no formal way to benchmark whether we're actually making progress. 🤔 Introducing 𝐎𝐕𝐄𝐑𝐓𝐎𝐍𝐁𝐄𝐍𝐂𝐇. 🎉Accepted to #ICLR2026 1/n 🧵
3
16
105
This is a really lovely position piece, laying out a unified framework for using self-consistency as a training objective! Intuition pump: I sometimes need to decide whether to trust people who are smarter than me. One way I do this is by judging their self-consistency.
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
4
8
66
None of this would be possible without my amazing co-authors @belindazli, @LauraRuis, and @jacobandreas, alongside @CarlGuo866 , @HuLillian39250, @MehulDamani2, @EkdeepL, and @ishapuri101. This was a true team effort!
2
0
22
Paper: https://t.co/KpHHwnUeWo If you've been thinking about consistency (or think we're wrong), we'd love to connect.
1
1
24
We argue that unification under our framework buys you something concrete: shared data augmentation pipelines, shared optimization strategies, and shared evaluation metrics.
1
0
18
Take introspection, where models explain themselves 🤖🪞 This can be trained by: 1) Asking the model to describe its behavior on a large number of inputs 2) Get dataset D of output behaviors 3) Optimize φ(explanation, D) that rewards explanations consistent with behavior.
1
1
19
This unifies many previously siloed problems, including sycophancy, factual coherence, and more. But this also introduces new/emerging classes of *meta-model* capabilities, such as: self-description / introspection, self red-teaming, AI for science (see paper for more).
1
1
20
However, current training (SFT, RLHF, etc.) scores each response in isolation. We propose a framework for optimizing consistency directly across groups of related inputs and show how common failures and new capabilities fall naturally under this lens.
2
1
25
Consider sycophancy: An LLM’s response may change depending on user beliefs, even when those signals are irrelevant. Evaluated one interaction at a time, each response appears coherent (A). The issue only becomes clear when multiple interactions are considered together (B).
1
1
23
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
16
55
425
Very thoughtful post by @belindazli on training LLMs to be introspective! IMO especially insightful point: We could train for *bidirectional* consistency between model self-explanations and revealed behaviors.
New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:
3
5
56
Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.
5
57
276
New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)
8
58
293
Do AI agents ask good questions? We built “Collaborative Battleship” to find out—and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost. Paper, code & demos: https://t.co/lV76HRKR3d Here's what we learned about building rational information-seeking
4
35
172
Can LLMs reason like a student? 👩🏻🎓📚✏️ For educational tools like AI tutors, modeling how students make mistakes is crucial. But current LLMs are much worse at simulating student errors ❌ than performing correct ✅ reasoning. We try to fix that with our method MISTAKE 🤭👇
10
54
336
Outstanding paper 2🏆: Shared Global and Local Geometry of Language Model Embeddings https://t.co/DT7L520ELV
1
9
50
🚨Why can’t Transformers learn multiplication?🧮 Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n 🧵
12
108
673
Super excited to be joining @GoodfireAI! I'll be scaling up the line of work our group started at Harvard: making predictive accounts of model representations by assuming a model behaves optimally (i.e., good old rational analysis from cogsci!)
Thrilled to welcome @EkdeepL to the team! Ekdeep is working on a new research agenda on “cognitive interpretability”, aimed at adapting and improving theories of human cognition to design tools for explaining model cognition.
41
17
328