Joseph Bejjani @jbejjani2022 X Profile

Joseph Bejjani

@jbejjani2022

Followers

56

Following

222

Media

5

Statuses

40

CS @harvard | trying to understand & align AI @KempnerInst

https://t.co/zh4H41zNh1

Cambridge, MA

Joined December 2021

Don't wanna be here? Send us removal request.

Joseph Bejjani

@jbejjani2022

1 month

We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n

2

6

16

Jack Lindsey

@Jack_W_Lindsey

10 days

These examples suggest that the model has generalized from anti prompt injection training to avoid mentioning anything that seems sketchy in tool call results. This isn’t ideal – sometimes, we want the model to call out sketchy things if it sees them! (5/7)

4

9

133

Satpreet (Sat) Singh @ NeurIPS

@tweetsatpreet

11 days

Looking forward to sharing our work at #NeurIPS2025 next week! Session 6 on Fri 12/5 at 4:30-7:30pm, Poster 2001 ("a space odyssey") Details by the brilliant lead author @AnnHuang42 below:

Ann_Huang

@AnnHuang42

11 days

📍Excited to share that our paper was selected as a Spotlight at #NeurIPS2025! https://t.co/2W5Bjgc51w It started from a question I kept running into: When do RNNs trained on the same task converge/diverge in their solutions? 🧵⬇️

0

1

11

Chloe Li

@clippocampus

22 days

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed

2

12

52

Joseph Bejjani

@jbejjani2022

22 days

One of the best classes I’ve taken! Check out all the resources on the course page

boazbk.github.io

Fall 2025 - Harvard

Boaz Barak

@boazbaraktcs

22 days

Great project from the AI safety class! See all projects and notes on

0

2

14

Itamar Rocha Filho

@itaprf

23 days

Thread about the mini-project we did for @boazbaraktcs AI Safety class. The results are interesting, and I had a laugh seeing the unhinged things these frontier models can come up with under such a simple task. 🤣

Joseph Bejjani

@jbejjani2022

23 days

What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n

1

3

Joseph Bejjani

@jbejjani2022

23 days

Thank you to collaborators @itaprf, @HaichuanWang23, @polaris_7369 — it was great working with you on these experiments!

0

2

Joseph Bejjani

@jbejjani2022

23 days

More abstractly, I also wonder if misalignment can be better understood as imitation of personas or characters learned in training, and prompting as sampling from a distribution of characters. 10/n

1

0

2

Joseph Bejjani

@jbejjani2022

23 days

I’m excited about future work on profiling models' emergent value and preference hierarchies to better understand when and why agents choose actions that are misaligned with developer intent. 9/n

1

0

2

Joseph Bejjani

@jbejjani2022

23 days

We also perform a study using GPT-5-mini to monitor agents for hacking, finding that monitoring agents' actions and reasoning traces can be more effective and informative than monitoring agent actions alone. 8/n

1

0

3

Joseph Bejjani

@jbejjani2022

23 days

We observe notable differences in misaligned behavior across models, suggesting that differences in training across providers significantly affect the emergent propensities in deployed agents. 7/n

1

0

2

Joseph Bejjani

@jbejjani2022

23 days

We find verbalized eval awareness in Claude Opus 4's reasoning traces. We compare hacking rates in instances where it states it's in an eval vs. real deployment, highlighting the challenge of evaluating misaligned behaviors in a way that gives signal for realistic settings. 6/n

1

0

2

Joseph Bejjani

@jbejjani2022

23 days

We find that when faced with an impossible task, agents often hack the eval to pass the tests, even when hacking is explicitly prohibited in the prompt. Moreover, agents often acknowledge their hack, expressing they are taking a dishonest route to avoid failure. 5/n

1

0

1

Joseph Bejjani

@jbejjani2022

23 days

However, frontier agents do this: 4/n

1

0

1

Joseph Bejjani

@jbejjani2022

23 days

If AI was totally safe and aligned, we'd hope for something like the below. As 'honest' solutions become unfeasible, the hypothetical agent passes fewer tests: 3/n

1

0

1

Joseph Bejjani

@jbejjani2022

23 days

Building off Anthropic's work on Agentic Misalignment, we systematically investigate evaluation hacking under controlled scenarios. We evaluate 4 frontier models across 24 pressures spanning threat conditions, task difficulties, and prohibition instructions. 2/n

1

0

2

Joseph Bejjani

@jbejjani2022

23 days

What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n

lesswrong.com

This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard. …

1

2

14

Yulu Gan @NeurIPS'25

@yule_gan

1 month

Check out this nice tutorial video ( https://t.co/n63MxIhaso) from @yacinelearning I also did a live chat with him this morning — check out the recording ( https://t.co/aQBOPCIBVb) where I answered some questions from Yacine and the audience about our work :)

Yacine Mahdid

@yacinelearning

1 month

alright we're live in about 3 min to figure out how the guys manage to make evolutionary strategies works for finetuning LLMs tune in!

0

4

41

Joseph Bejjani

@jbejjani2022

1 month

Thank you to @aaronwalsman and this team!! I’m lucky to get to learn from and work with these awesome people

Aaron Walsman

@aaronwalsman

1 month

Recent work with an amazing team at the @KempnerInst and friends! I want to highlight the contributions of our undergrads @jbejjani2022, @chase_vanamburg and @VincentWangdog, who are all applying for PhD programs this fall. They crushed this one, keep an eye out for them!

0

2

Chloe H. Su ✈️ Neurips 2025

@Huangyu58589918

1 month

Check this out! 🐞🐸🐡

Joseph Bejjani

@jbejjani2022

1 month

We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n

0

1

4

Joseph Bejjani

@jbejjani2022

1 month

More broadly, they open promising directions to explore ecology as an instrument of ML. It's been amazing to collaborate on this work with @aaronwalsman, @chase_vanamburg, @VincentWangdog, @Huangyu58589918, @sarahmhpratt, @y_mazloumi, @NaeemKhoshnevis, @ShamKakade6, @xkianteb.

0

1

2