Itamar Pres Profile
Itamar Pres

@PresItamar

Followers
280
Following
105
Media
4
Statuses
29

PhD Student @ MIT

Joined August 2022
Don't wanna be here? Send us removal request.
@elinorpd_
Elinor
6 days
There's been a lot of excitement about pluralistic value alignment 🌈 — AI that reflects the full range of human perspectives But no formal way to benchmark whether we're actually making progress. 🤔 Introducing 𝐎𝐕𝐄𝐑𝐓𝐎𝐍𝐁𝐄𝐍𝐂𝐇. 🎉Accepted to #ICLR2026 1/n 🧵
3
16
105
@saprmarks
Samuel Marks
11 days
This is a really lovely position piece, laying out a unified framework for using self-consistency as a training objective! Intuition pump: I sometimes need to decide whether to trust people who are smarter than me. One way I do this is by judging their self-consistency.
@PresItamar
Itamar Pres
11 days
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
4
8
66
@PresItamar
Itamar Pres
11 days
None of this would be possible without my amazing co-authors @belindazli, @LauraRuis, and @jacobandreas, alongside @CarlGuo866 , @HuLillian39250, @MehulDamani2, @EkdeepL, and @ishapuri101. This was a true team effort!
2
0
22
@PresItamar
Itamar Pres
11 days
Paper: https://t.co/KpHHwnUeWo If you've been thinking about consistency (or think we're wrong), we'd love to connect.
1
1
24
@PresItamar
Itamar Pres
11 days
We argue that unification under our framework buys you something concrete: shared data augmentation pipelines, shared optimization strategies, and shared evaluation metrics.
1
0
18
@PresItamar
Itamar Pres
11 days
Take introspection, where models explain themselves 🤖🪞 This can be trained by: 1) Asking the model to describe its behavior on a large number of inputs 2) Get dataset D of output behaviors 3) Optimize φ(explanation, D) that rewards explanations consistent with behavior.
1
1
19
@PresItamar
Itamar Pres
11 days
This unifies many previously siloed problems, including sycophancy, factual coherence, and more. But this also introduces new/emerging classes of *meta-model* capabilities, such as: self-description / introspection, self red-teaming, AI for science (see paper for more).
1
1
20
@PresItamar
Itamar Pres
11 days
However, current training (SFT, RLHF, etc.) scores each response in isolation. We propose a framework for optimizing consistency directly across groups of related inputs and show how common failures and new capabilities fall naturally under this lens.
2
1
25
@PresItamar
Itamar Pres
11 days
Consider sycophancy: An LLM’s response may change depending on user beliefs, even when those signals are irrelevant. Evaluated one interaction at a time, each response appears coherent (A). The issue only becomes clear when multiple interactions are considered together (B).
1
1
23
@PresItamar
Itamar Pres
11 days
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
16
55
425
@saprmarks
Samuel Marks
1 month
Very thoughtful post by @belindazli on training LLMs to be introspective! IMO especially insightful point: We could train for *bidirectional* consistency between model self-explanations and revealed behaviors.
@belindazli
Belinda Li
1 month
New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:
3
5
56
@PresItamar
Itamar Pres
1 month
Check out Belinda's insightful post! Really excited about model introspection for interpretability and training.
@belindazli
Belinda Li
1 month
New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:
0
0
3
@TransluceAI
Transluce
4 months
Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.
5
57
276
@EkdeepL
Ekdeep Singh Lubana
4 months
New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)
8
58
293
@gabe_grand
Gabe Grand
5 months
Do AI agents ask good questions? We built “Collaborative Battleship” to find out—and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost. Paper, code & demos: https://t.co/lV76HRKR3d Here's what we learned about building rational information-seeking
4
35
172
@alexisjross
Alexis Ross
5 months
Can LLMs reason like a student? 👩🏻‍🎓📚✏️ For educational tools like AI tutors, modeling how students make mistakes is crucial. But current LLMs are much worse at simulating student errors ❌ than performing correct ✅ reasoning. We try to fix that with our method MISTAKE 🤭👇
10
54
336
@COLM_conf
Conference on Language Modeling
5 months
Outstanding paper 2🏆: Shared Global and Local Geometry of Language Model Embeddings https://t.co/DT7L520ELV
1
9
50
@Elenal3ai
Xiaoyan Bai
5 months
🚨Why can’t Transformers learn multiplication?🧮 Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n 🧵
12
108
673
@EkdeepL
Ekdeep Singh Lubana
7 months
Super excited to be joining @GoodfireAI! I'll be scaling up the line of work our group started at Harvard: making predictive accounts of model representations by assuming a model behaves optimally (i.e., good old rational analysis from cogsci!)
@GoodfireAI
Goodfire
7 months
Thrilled to welcome @EkdeepL to the team! Ekdeep is working on a new research agenda on “cognitive interpretability”, aimed at adapting and improving theories of human cognition to design tools for explaining model cognition.
41
17
328
@sevdeawesome
sev field
1 year
🧵New paper! Why do AI Experts Disagree on Existential Risk? When it comes to existential risk, AI researchers disagree: @ylecun: "effectively 0% chance of human extinction" @romanyam: "99% chance of catastrophe" I surveyed 111 AI experts to understand why.
13
37
250