FAR.AI
@farairesearch
Followers
17K
Following
61
Media
485
Statuses
880
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Berkeley, California
Joined February 2023
This quarter, we red-teamed GPT-5, disclosed critical persuasion vulnerabilities to frontier labs (resulting in patches!), and co-organized AI Safety Connect at UNGA. Join us Dec 1-2 for San Diego Alignment Workshop. Plus, we're expanding 2x & hiring! ๐
3
16
116
@IreneSolaiman ๐ Beyond Release: Access Considerations for Generative AI Systems - https://t.co/Yv05x5CVNY โถ๏ธ Watch the Technical Innovations for AI Policy video -
0
0
2
Cybertruck attacker chose ChatGPT because of its simple text interface. @IreneSolaiman argues this is why "available vs accessible" matters more than "open vs closed." Llama3 405B is open but compute-intensive; ChatGPT is closed but easy to use. availableโ accessible ๐
1
0
3
@Yoshua_Bengio @AsaCoopStick @dawnsongtweets @MariusHobbhahn @tomekkorbak @StephenLCasper @divyasiddarth @AnnaGausen @alxndrdavies @tegmark @niloofar_mire @ddkang @natashajaques ๐ Read the article with talk summaries: https://t.co/4BDhxsIGT3 โถ๏ธ Watch the full playlist
youtube.com
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sect...
1
2
7
All recordings from San Diego Alignment Workshop are now available. @Yoshua_Bengio keynote with talks from @AsaCoopStick @dawnsongtweets @MariusHobbhahn @tomekkorbak @StephenLCasper @divyasiddarth @AnnaGausen @alxndrdavies @tegmark @niloofar_mire @ddkang @natashajaques + more.๐
4
18
111
We're honored that the @fmf_org AI Safety Fund is supporting our project, "Quantifying the Safety-Adversary Gap in Large Language Models." We're one of 11 grantees selected from over 100 proposals in a $5M+ cohort tackling biosecurity, cybersecurity, and AI evaluation. ๐
1
1
5
@sleepinyourhat @majatrebacz @NeelNanda5 @AnkaReuel โถ๏ธ Watch Sam Bowman (Anthropic) - https://t.co/qwfjOOCHzc โถ๏ธ Watch Maja Trฤbacz (OpenAI) - https://t.co/hqCIlJGeE8 โถ๏ธ Watch Neel Nanda (Google DeepMind) - https://t.co/6g2P0LVG6X โถ๏ธ Watch Anka Reuel (Stanford) -
1
0
5
4 recordings from San Diego Alignment Workshop! @sleepinyourhat โ Lessons from the 1st Misalignment Safety Case @majatrebacz โ Scalable Oversight: Verifying Code at Scale @neelnanda5 โ Pivot to Pragmatic Interpretability @AnkaReuel โ How we know what AI can (and canโt) do ๐
5
6
44
Read the full technical report https://t.co/XqP50akzcc Research in conjunction with @JordanTensor, @realmeatyhuman, Dillon Bowen, @thjread, @satvikgolechha, @alexdzm, @OliverMakins, @Connor_Kissane, Kola Ayonrinde, @jacobmerizian, @saprmarks, @ChrisCundy, @JBloomAus
arxiv.org
Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an...
1
0
3
Want to get a sense of what it's like to be in the blue team? You can check out our app at https://t.co/d68jVn5ToZ to see some of the info the blue team had, and see if you can tell sandbagger from benign model!
1
0
2
In this research, https://t.co/S2Hs8Uzuh2 took on the role of the 'red team' โ designing model organisms that could underperform while fooling monitors. The blue team tried to determine which models were sandbaggers.
1
0
1
Can AI 'sandbag' โ deceptively underperform on evaluations? And can we detect them if they do? We investigated this in an auditing game with @AISecurityInst. In our app ๐, see if *you* can detect sandbagging by taking on the role of the blue team! https://t.co/iI7SvXtuFq
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? ๐งต1/17
1
1
11
Mech Interp Workshop #NeurIPS2025 poster & spotlight presentation! ๐ 11:30am Sun, Dec 7 @ Upper Level Room 30A-E Path Channels & Plan Extension Kernels: A Mechanistic Description of Planning in a Sokoban RNN. Work by @taufeeque_m, Aaron Tucker, @ARGleave, @AdriGarriga๐
3
2
9
Come check out our #NeurIPS2025 Lock-LLM Workshop poster: ๐Sat, Dec 6 @ 12pm PST in Upper Level Room 1AB The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models. Work by Ann-Kathrin Dombrowski, Dillon Bowen, @ARGleave, @ChrisCundy
1/ Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees? We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. ๐งต
2
4
15
@tarbellcenter @ARGleave @AlexBores โถ๏ธ Watch Adam Gleave of Misalignment - https://t.co/PMFJOWU64F โถ๏ธ Watch Alex Bores on How States Should Regulate AI - https://t.co/tYE2oKQtdW ๐ Read full event blog -
1
0
3
Two new recordings from our AGI Journalism Workshop with @tarbellcenter: @ARGleave: AI systems can fake alignment during training. Scalable oversight & lie detectors can reduce this. @AlexBores: NY's RAISE Act passed with bipartisan support despite lobbying. ๐
2
0
7
Join us at our #NeurIPS2025 main session presentation and poster! ๐Wed, Dec 3 at 11am PST for Poster #509 in Exhibit Halls C, D, E Preference Learning with Lie Detectors can Induce Honesty or Evasion. Work by @ChrisCundy and @ARGleave
๐ค Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
2
4
16
@NeelNanda5 @dawnsongtweets @alxndrdavies @Dr_Atoosa @StephenLCasper @niloofar_mire @ChrisCundy ๐ฉ Stay informed on https://t.co/S2Hs8UyWru future events: https://t.co/MJCUH0Fwjw โถ๏ธ Recordings will be released shortly. In the meantime, past videos are available at:
youtube.com
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sect...
0
1
8
San Diego Alignment Workshop Day 2 on agentic AI safety, red teaming, privacy and security challenges. Thanks to @neelnanda5 @dawnsongtweets @alxndrdavies @dr_atoosa @StephenLCasper @niloofar_mire @ChrisCundy and all speakers and attendees! ๐
5
5
100