Joseph Bejjani
@jbejjani2022
Followers
56
Following
222
Media
5
Statuses
40
CS @harvard | trying to understand & align AI @KempnerInst
Cambridge, MA
Joined December 2021
We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n
2
6
16
These examples suggest that the model has generalized from anti prompt injection training to avoid mentioning anything that seems sketchy in tool call results. This isn’t ideal – sometimes, we want the model to call out sketchy things if it sees them! (5/7)
4
9
133
Looking forward to sharing our work at #NeurIPS2025 next week! Session 6 on Fri 12/5 at 4:30-7:30pm, Poster 2001 ("a space odyssey") Details by the brilliant lead author @AnnHuang42 below:
📍Excited to share that our paper was selected as a Spotlight at #NeurIPS2025! https://t.co/2W5Bjgc51w It started from a question I kept running into: When do RNNs trained on the same task converge/diverge in their solutions? 🧵⬇️
0
1
11
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives🫘 Can we train models towards a ‘self-incriminating honesty’, such that they would honestly confess any hidden misaligned objectives, even under strong pressure to conceal them? In our paper, we developed
2
12
52
One of the best classes I’ve taken! Check out all the resources on the course page
boazbk.github.io
Fall 2025 - Harvard
0
2
14
Thread about the mini-project we did for @boazbaraktcs AI Safety class. The results are interesting, and I had a laugh seeing the unhinged things these frontier models can come up with under such a simple task. 🤣
What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n
1
3
3
Thank you to collaborators @itaprf, @HaichuanWang23, @polaris_7369 — it was great working with you on these experiments!
0
0
2
More abstractly, I also wonder if misalignment can be better understood as imitation of personas or characters learned in training, and prompting as sampling from a distribution of characters. 10/n
1
0
2
I’m excited about future work on profiling models' emergent value and preference hierarchies to better understand when and why agents choose actions that are misaligned with developer intent. 9/n
1
0
2
We also perform a study using GPT-5-mini to monitor agents for hacking, finding that monitoring agents' actions and reasoning traces can be more effective and informative than monitoring agent actions alone. 8/n
1
0
3
We observe notable differences in misaligned behavior across models, suggesting that differences in training across providers significantly affect the emergent propensities in deployed agents. 7/n
1
0
2
We find verbalized eval awareness in Claude Opus 4's reasoning traces. We compare hacking rates in instances where it states it's in an eval vs. real deployment, highlighting the challenge of evaluating misaligned behaviors in a way that gives signal for realistic settings. 6/n
1
0
2
We find that when faced with an impossible task, agents often hack the eval to pass the tests, even when hacking is explicitly prohibited in the prompt. Moreover, agents often acknowledge their hack, expressing they are taking a dishonest route to avoid failure. 5/n
1
0
1
If AI was totally safe and aligned, we'd hope for something like the below. As 'honest' solutions become unfeasible, the hypothetical agent passes fewer tests: 3/n
1
0
1
Building off Anthropic's work on Agentic Misalignment, we systematically investigate evaluation hacking under controlled scenarios. We evaluate 4 frontier models across 24 pressures spanning threat conditions, task difficulties, and prohibition instructions. 2/n
1
0
2
What do agents do when the only path to a goal requires harmful action? Do they choose harm or accept failure? We explore these questions with experiments in an agentic coding environment. Code: https://t.co/ClO8C34D0z Blog Post: https://t.co/o6N4oTeFKp 1/n
lesswrong.com
This work was done for CS 2881r: AI Safety and Alignment, taught by @boazbarak at Harvard. …
1
2
14
Check out this nice tutorial video ( https://t.co/n63MxIhaso) from @yacinelearning I also did a live chat with him this morning — check out the recording ( https://t.co/aQBOPCIBVb) where I answered some questions from Yacine and the audience about our work :)
alright we're live in about 3 min to figure out how the guys manage to make evolutionary strategies works for finetuning LLMs tune in!
0
4
41
Thank you to @aaronwalsman and this team!! I’m lucky to get to learn from and work with these awesome people
Recent work with an amazing team at the @KempnerInst and friends! I want to highlight the contributions of our undergrads @jbejjani2022, @chase_vanamburg and @VincentWangdog, who are all applying for PhD programs this fall. They crushed this one, keep an eye out for them!
0
0
2
Check this out! 🐞🐸🐡
We’re super excited to introduce DIRT: The Distributed Intelligent Replicator Toolkit: https://t.co/RidXwgGriR DIRT is a GPU-accelerated multi-agent simulation platform enabling artificial life research in dynamic, open-ended environments at unprecedented scales. 1/n
0
1
4
More broadly, they open promising directions to explore ecology as an instrument of ML. It's been amazing to collaborate on this work with @aaronwalsman, @chase_vanamburg, @VincentWangdog, @Huangyu58589918, @sarahmhpratt, @y_mazloumi, @NaeemKhoshnevis, @ShamKakade6, @xkianteb.
0
1
2