Joseph Bejjani Profile
Joseph Bejjani

@jbejjani2022

Followers
56
Following
222
Media
5
Statuses
40

CS @harvard | trying to understand & align AI @KempnerInst

Cambridge, MA
Joined December 2021
Don't wanna be here? Send us removal request.
@jbejjani2022
Joseph Bejjani
1 month
We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n
2
6
16
@Jack_W_Lindsey
Jack Lindsey
10 days
These examples suggest that the model has generalized from anti prompt injection training to avoid mentioning anything that seems sketchy in tool call results. This isn’t ideal – sometimes, we want the model to call out sketchy things if it sees them! (5/7)
4
9
133
@tweetsatpreet
Satpreet (Sat) Singh @ NeurIPS
11 days
Looking forward to sharing our work at #NeurIPS2025 next week! Session 6 on Fri 12/5 at 4:30-7:30pm, Poster 2001 ("a space odyssey") Details by the brilliant lead author @AnnHuang42 below:
@AnnHuang42
Ann_Huang
11 days
📍Excited to share that our paper was selected as a Spotlight at #NeurIPS2025! https://t.co/2W5Bjgc51w It started from a question I kept running into: When do RNNs trained on the same task converge/diverge in their solutions? 🧵⬇️
0
1
11
@clippocampus
Chloe Li
22 days
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed
2
12
52
@jbejjani2022
Joseph Bejjani
22 days
One of the best classes I’ve taken! Check out all the resources on the course page
boazbk.github.io
Fall 2025 - Harvard
@boazbaraktcs
Boaz Barak
22 days
Great project from the AI safety class! See all projects and notes on
0
2
14
@itaprf
Itamar Rocha Filho
23 days
Thread about the mini-project we did for @boazbaraktcs AI Safety class. The results are interesting, and I had a laugh seeing the unhinged things these frontier models can come up with under such a simple task. 🤣
@jbejjani2022
Joseph Bejjani
23 days
What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n
1
3
3
@jbejjani2022
Joseph Bejjani
23 days
Thank you to collaborators @itaprf, @HaichuanWang23, @polaris_7369 — it was great working with you on these experiments!
0
0
2
@jbejjani2022
Joseph Bejjani
23 days
More abstractly, I also wonder if misalignment can be better understood as imitation of personas or characters learned in training, and prompting as sampling from a distribution of characters. 10/n
1
0
2
@jbejjani2022
Joseph Bejjani
23 days
I’m excited about future work on profiling models' emergent value and preference hierarchies to better understand when and why agents choose actions that are misaligned with developer intent. 9/n
1
0
2
@jbejjani2022
Joseph Bejjani
23 days
We also perform a study using GPT-5-mini to monitor agents for hacking, finding that monitoring agents' actions and reasoning traces can be more effective and informative than monitoring agent actions alone. 8/n
1
0
3
@jbejjani2022
Joseph Bejjani
23 days
We observe notable differences in misaligned behavior across models, suggesting that differences in training across providers significantly affect the emergent propensities in deployed agents. 7/n
1
0
2
@jbejjani2022
Joseph Bejjani
23 days
We find verbalized eval awareness in Claude Opus 4's reasoning traces. We compare hacking rates in instances where it states it's in an eval vs. real deployment, highlighting the challenge of evaluating misaligned behaviors in a way that gives signal for realistic settings. 6/n
1
0
2
@jbejjani2022
Joseph Bejjani
23 days
We find that when faced with an impossible task, agents often hack the eval to pass the tests, even when hacking is explicitly prohibited in the prompt. Moreover, agents often acknowledge their hack, expressing they are taking a dishonest route to avoid failure. 5/n
1
0
1
@jbejjani2022
Joseph Bejjani
23 days
However, frontier agents do this: 4/n
1
0
1
@jbejjani2022
Joseph Bejjani
23 days
If AI was totally safe and aligned, we'd hope for something like the below. As 'honest' solutions become unfeasible, the hypothetical agent passes fewer tests: 3/n
1
0
1
@jbejjani2022
Joseph Bejjani
23 days
Building off Anthropic's work on Agentic Misalignment, we systematically investigate evaluation hacking under controlled scenarios. We evaluate 4 frontier models across 24 pressures spanning threat conditions, task difficulties, and prohibition instructions. 2/n
1
0
2
@jbejjani2022
Joseph Bejjani
23 days
What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n
Tweet card summary image
lesswrong.com
This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard. …
1
2
14
@yule_gan
Yulu Gan @NeurIPS'25
1 month
Check out this nice tutorial video ( https://t.co/n63MxIhaso) from @yacinelearning I also did a live chat with him this morning — check out the recording ( https://t.co/aQBOPCIBVb) where I answered some questions from Yacine and the audience about our work :)
@yacinelearning
Yacine Mahdid
1 month
alright we're live in about 3 min to figure out how the guys manage to make evolutionary strategies works for finetuning LLMs tune in!
0
4
41
@jbejjani2022
Joseph Bejjani
1 month
Thank you to @aaronwalsman and this team!! I’m lucky to get to learn from and work with these awesome people
@aaronwalsman
Aaron Walsman
1 month
Recent work with an amazing team at the @KempnerInst and friends! I want to highlight the contributions of our undergrads @jbejjani2022, @chase_vanamburg and @VincentWangdog, who are all applying for PhD programs this fall. They crushed this one, keep an eye out for them!
0
0
2
@Huangyu58589918
Chloe H. Su ✈️ Neurips 2025
1 month
Check this out! 🐞🐸🐡
@jbejjani2022
Joseph Bejjani
1 month
We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n
0
1
4
@jbejjani2022
Joseph Bejjani
1 month
More broadly, they open promising directions to explore ecology as an instrument of ML. It's been amazing to collaborate on this work with @aaronwalsman, @chase_vanamburg, @VincentWangdog, @Huangyu58589918, @sarahmhpratt, @y_mazloumi, @NaeemKhoshnevis, @ShamKakade6, @xkianteb.
0
1
2