farairesearch Profile Banner
FAR.AI Profile
FAR.AI

@farairesearch

Followers
17K
Following
61
Media
485
Statuses
880

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Berkeley, California
Joined February 2023
Don't wanna be here? Send us removal request.
@farairesearch
FAR.AI
2 months
This quarter, we red-teamed GPT-5, disclosed critical persuasion vulnerabilities to frontier labs (resulting in patches!), and co-organized AI Safety Connect at UNGA. Join us Dec 1-2 for San Diego Alignment Workshop. Plus, we're expanding 2x & hiring! ๐Ÿ‘‡
3
16
116
@farairesearch
FAR.AI
22 hours
@IreneSolaiman ๐Ÿ“„ Beyond Release: Access Considerations for Generative AI Systems - https://t.co/Yv05x5CVNY โ–ถ๏ธ Watch the Technical Innovations for AI Policy video -
0
0
2
@farairesearch
FAR.AI
22 hours
Cybertruck attacker chose ChatGPT because of its simple text interface. @IreneSolaiman argues this is why "available vs accessible" matters more than "open vs closed." Llama3 405B is open but compute-intensive; ChatGPT is closed but easy to use. availableโ‰ accessible ๐Ÿ‘‡
1
0
3
@farairesearch
FAR.AI
5 days
All recordings from San Diego Alignment Workshop are now available. @Yoshua_Bengio keynote with talks from @AsaCoopStick @dawnsongtweets @MariusHobbhahn @tomekkorbak @StephenLCasper @divyasiddarth @AnnaGausen @alxndrdavies @tegmark @niloofar_mire @ddkang @natashajaques + more.๐Ÿ‘‡
4
18
111
@farairesearch
FAR.AI
8 days
We're honored that the @fmf_org AI Safety Fund is supporting our project, "Quantifying the Safety-Adversary Gap in Large Language Models." We're one of 11 grantees selected from over 100 proposals in a $5M+ cohort tackling biosecurity, cybersecurity, and AI evaluation. ๐Ÿ‘‡
1
1
5
@farairesearch
FAR.AI
12 days
@sleepinyourhat @majatrebacz @NeelNanda5 @AnkaReuel โ–ถ๏ธ Watch Sam Bowman (Anthropic) - https://t.co/qwfjOOCHzc โ–ถ๏ธ Watch Maja Trฤ™bacz (OpenAI) - https://t.co/hqCIlJGeE8 โ–ถ๏ธ Watch Neel Nanda (Google DeepMind) - https://t.co/6g2P0LVG6X โ–ถ๏ธ Watch Anka Reuel (Stanford) -
1
0
5
@farairesearch
FAR.AI
12 days
4 recordings from San Diego Alignment Workshop! @sleepinyourhat โ€“ Lessons from the 1st Misalignment Safety Case @majatrebacz โ€“ Scalable Oversight: Verifying Code at Scale @neelnanda5 โ€“ Pivot to Pragmatic Interpretability @AnkaReuel โ€“ How we know what AI can (and canโ€™t) do ๐Ÿ‘‡
5
6
44
@farairesearch
FAR.AI
14 days
Want to get a sense of what it's like to be in the blue team? You can check out our app at https://t.co/d68jVn5ToZ to see some of the info the blue team had, and see if you can tell sandbagger from benign model!
1
0
2
@farairesearch
FAR.AI
14 days
In this research, https://t.co/S2Hs8Uzuh2 took on the role of the 'red team' โ€” designing model organisms that could underperform while fooling monitors. The blue team tried to determine which models were sandbaggers.
1
0
1
@farairesearch
FAR.AI
14 days
Can AI 'sandbag' โ€” deceptively underperform on evaluations? And can we detect them if they do? We investigated this in an auditing game with @AISecurityInst. In our app ๐Ÿ‘‡, see if *you* can detect sandbagging by taking on the role of the blue team! https://t.co/iI7SvXtuFq
@JordanTensor
Jordan Taylor
14 days
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? ๐Ÿงต1/17
1
1
11
@farairesearch
FAR.AI
16 days
Mech Interp Workshop #NeurIPS2025 poster & spotlight presentation! ๐Ÿ“ 11:30am Sun, Dec 7 @ Upper Level Room 30A-E Path Channels & Plan Extension Kernels: A Mechanistic Description of Planning in a Sokoban RNN. Work by @taufeeque_m, Aaron Tucker, @ARGleave, @AdriGarriga๐Ÿ‘‡
3
2
9
@farairesearch
FAR.AI
17 days
Come check out our #NeurIPS2025 Lock-LLM Workshop poster: ๐Ÿ“Sat, Dec 6 @ 12pm PST in Upper Level Room 1AB The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models. Work by Ann-Kathrin Dombrowski, Dillon Bowen, @ARGleave, @ChrisCundy
@farairesearch
FAR.AI
4 months
1/ Most safety tests only check if a model will follow harmful instructions. But what happens if someone removes its safeguards so it agrees? We built the Safety Gap Toolkit to measure the gap between what a model will agree to do and what it can do. ๐Ÿงต
2
4
15
@farairesearch
FAR.AI
19 days
@tarbellcenter @ARGleave @AlexBores โ–ถ๏ธ Watch Adam Gleave of Misalignment - https://t.co/PMFJOWU64F โ–ถ๏ธ Watch Alex Bores on How States Should Regulate AI - https://t.co/tYE2oKQtdW ๐Ÿ“„ Read full event blog -
1
0
3
@farairesearch
FAR.AI
19 days
Two new recordings from our AGI Journalism Workshop with @tarbellcenter: @ARGleave: AI systems can fake alignment during training. Scalable oversight & lie detectors can reduce this. @AlexBores: NY's RAISE Act passed with bipartisan support despite lobbying. ๐Ÿ‘‡
2
0
7
@farairesearch
FAR.AI
20 days
Join us at our #NeurIPS2025 main session presentation and poster! ๐Ÿ“Wed, Dec 3 at 11am PST for Poster #509 in Exhibit Halls C, D, E Preference Learning with Lie Detectors can Induce Honesty or Evasion. Work by @ChrisCundy and @ARGleave
@farairesearch
FAR.AI
7 months
๐Ÿค” Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
2
4
16
@farairesearch
FAR.AI
20 days
San Diego Alignment Workshop Day 2 on agentic AI safety, red teaming, privacy and security challenges. Thanks to @neelnanda5 @dawnsongtweets @alxndrdavies @dr_atoosa @StephenLCasper @niloofar_mire @ChrisCundy and all speakers and attendees! ๐Ÿ‘‡
5
5
100