Itamar Pres @PresItamar X Profile

Itamar Pres

@PresItamar

Followers

280

Following

105

Media

4

Statuses

29

PhD Student @ MIT

https://t.co/XGQzsyKwkR

Joined August 2022

Don't wanna be here? Send us removal request.

Elinor

@elinorpd_

6 days

There's been a lot of excitement about pluralistic value alignment 🌈 — AI that reflects the full range of human perspectives But no formal way to benchmark whether we're actually making progress. 🤔 Introducing 𝐎𝐕𝐄𝐑𝐓𝐎𝐍𝐁𝐄𝐍𝐂𝐇. 🎉Accepted to #ICLR2026 1/n 🧵

3

16

105

Samuel Marks

@saprmarks

11 days

This is a really lovely position piece, laying out a unified framework for using self-consistency as a training objective! Intuition pump: I sometimes need to decide whether to trust people who are smarter than me. One way I do this is by judging their self-consistency.

Itamar Pres

@PresItamar

11 days

New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵

4

8

66

Itamar Pres

@PresItamar

11 days

None of this would be possible without my amazing co-authors @belindazli, @LauraRuis, and @jacobandreas, alongside @CarlGuo866 , @HuLillian39250, @MehulDamani2, @EkdeepL, and @ishapuri101. This was a true team effort!

2

0

22

Itamar Pres

@PresItamar

11 days

Paper: https://t.co/KpHHwnUeWo If you've been thinking about consistency (or think we're wrong), we'd love to connect.

1

24

Itamar Pres

@PresItamar

11 days

We argue that unification under our framework buys you something concrete: shared data augmentation pipelines, shared optimization strategies, and shared evaluation metrics.

1

0

18

Itamar Pres

@PresItamar

11 days

Take introspection, where models explain themselves 🤖🪞 This can be trained by: 1) Asking the model to describe its behavior on a large number of inputs 2) Get dataset D of output behaviors 3) Optimize φ(explanation, D) that rewards explanations consistent with behavior.

1

19

Itamar Pres

@PresItamar

11 days

This unifies many previously siloed problems, including sycophancy, factual coherence, and more. But this also introduces new/emerging classes of *meta-model* capabilities, such as: self-description / introspection, self red-teaming, AI for science (see paper for more).

1

20

Itamar Pres

@PresItamar

11 days

However, current training (SFT, RLHF, etc.) scores each response in isolation. We propose a framework for optimizing consistency directly across groups of related inputs and show how common failures and new capabilities fall naturally under this lens.

2

1

25

Itamar Pres

@PresItamar

11 days

Consider sycophancy: An LLM’s response may change depending on user beliefs, even when those signals are irrelevant. Evaluated one interaction at a time, each response appears coherent (A). The issue only becomes clear when multiple interactions are considered together (B).

1

23

Itamar Pres

@PresItamar

11 days

New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵

16

55

425

Samuel Marks

@saprmarks

1 month

Very thoughtful post by @belindazli on training LLMs to be introspective! IMO especially insightful point: We could train for *bidirectional* consistency between model self-explanations and revealed behaviors.

Belinda Li

@belindazli

1 month

New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:

3

5

56

Itamar Pres

@PresItamar

1 month

Check out Belinda's insightful post! Really excited about model introspection for interpretability and training.

Belinda Li

@belindazli

1 month

New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:

0

3

Transluce

@TransluceAI

4 months

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.

5

57

276

Ekdeep Singh Lubana

@EkdeepL

4 months

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)

8

58

293

Gabe Grand

@gabe_grand

5 months

Do AI agents ask good questions? We built “Collaborative Battleship” to find out—and discovered that weaker LMs + Bayesian inference can beat GPT-5 at 1% of the cost. Paper, code & demos: https://t.co/lV76HRKR3d Here's what we learned about building rational information-seeking

4

35

172

Alexis Ross

@alexisjross

5 months

Can LLMs reason like a student? 👩🏻‍🎓📚✏️ For educational tools like AI tutors, modeling how students make mistakes is crucial. But current LLMs are much worse at simulating student errors ❌ than performing correct ✅ reasoning. We try to fix that with our method MISTAKE 🤭👇

10

54

336

Conference on Language Modeling

@COLM_conf

5 months

Outstanding paper 2🏆: Shared Global and Local Geometry of Language Model Embeddings https://t.co/DT7L520ELV

1

9

50

Xiaoyan Bai

@Elenal3ai

5 months

🚨Why can’t Transformers learn multiplication?🧮 Even with billions of params, models struggle with multi-digit multiplication. In our new work, we reverse-engineer two models: a standard fine-tuned (SFT), and an implicit chain-of-thought (ICoT) model to see why. Read on! 1/n 🧵

12

108

673

Ekdeep Singh Lubana

@EkdeepL

7 months

Super excited to be joining @GoodfireAI! I'll be scaling up the line of work our group started at Harvard: making predictive accounts of model representations by assuming a model behaves optimally (i.e., good old rational analysis from cogsci!)

Goodfire

@GoodfireAI

7 months

Thrilled to welcome @EkdeepL to the team! Ekdeep is working on a new research agenda on “cognitive interpretability”, aimed at adapting and improving theories of human cognition to design tools for explaining model cognition.

41

17

328

sev field

@sevdeawesome

1 year

🧵New paper! Why do AI Experts Disagree on Existential Risk? When it comes to existential risk, AI researchers disagree: @ylecun: "effectively 0% chance of human extinction" @romanyam: "99% chance of catastrophe" I surveyed 111 AI experts to understand why.

13

37

250