farairesearch Profile Banner
FAR.AI Profile
FAR.AI

@farairesearch

Followers
12K
Following
51
Media
363
Statuses
669

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Berkeley, California
Joined February 2023
Don't wanna be here? Send us removal request.
@farairesearch
FAR.AI
29 days
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?. We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
2
7
42
@farairesearch
FAR.AI
15 hours
📄 ACL'24 Outstanding & Best Social Impact Paper: 🎥 Full talk from Singapore Alignment Workshop:
0
0
0
@farairesearch
FAR.AI
15 hours
Simple emotional appeals can make LLMs reveal info they're trained to withhold with nearly 90% success rate. Weiyan Shi’s team tested 40 psychological tricks. Newer models are MORE vulnerable (GPT-4>GPT-3.5). Multi-turn dialogue compounds risk. Claude stays robust. ▶️🔗
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
5
@farairesearch
FAR.AI
2 days
👥 Research by @irobotmckenzie, @oskarjohnh, @tomhmtseng, Aaron Tucker, @ARGleave from @farairesearch and @alxndrdavies, @StephenLCasper, @_robertkirk of @AISecurityInst.🤝 Questions or ideas? hello@far.ai .🚀 We're hiring: 💡 Help us spread the word to.
0
0
2
@farairesearch
FAR.AI
2 days
11/. Dive deeper into our research:.📝 Blog post: .📄 Paper: 💻 Code:
1
0
1
@farairesearch
FAR.AI
2 days
10/.The good news: these are fixable problems. Defenders can:.🔹 Thwart STACK by hiding which component blocked requests by making all rejections look identical.🔹 Prevent transfer by keeping filters secret and meaningfully distinct from publicly available models.
1
1
6
@farairesearch
FAR.AI
2 days
9/.If this information isn’t available, then STACK also transfers. We applied STACK to a proxy pipeline (Qwen 3) then applied it to our target (Gemma 2) pipeline with a 33% attack success rate. Although less effective, the transfer attack is much harder to prevent or detect.
1
0
5
@farairesearch
FAR.AI
2 days
8/.The vulnerability? Current systems tell you WHICH component blocked your request (semi-separable, below): "Input filter rejected" vs "Model refused" vs "Output filter blocked". Like a security system announcing which sensor tripped. That's all STACK needs.
Tweet media one
1
0
7
@farairesearch
FAR.AI
2 days
7/.Our STACK attack uses a token-optimization attack to break the filters, and PAP to break the generative model. PAP against the undefended model gets a near-100% attack success rate, so the pipeline is more secure, even if STACK bypasses it over 7 out of 10 times.
1
0
5
@farairesearch
FAR.AI
2 days
6/.Our STACK attack beats our strongest pipeline 71% of the time, automatically getting harmful requests past multi-layer AI defenses. By contrast, the strongest conventional attack we tested, PAP, is blocked 100% of the time by this pipeline.
Tweet media one
1
0
5
@farairesearch
FAR.AI
2 days
5/.Whereas existing attacks try to break all defenses simultaneously, our STACK attack breaks each layer one by one:.① Find a “universal jailbreak” that fools the input filter.② Use an existing method to jailbreak the model.③ Find a universal jailbreak for the output filter
Tweet media one
1
0
7
@farairesearch
FAR.AI
2 days
4/.We also tried using purpose-made safeguard (moderation) models like ShieldGemma or Llama Guard as filters, but surprisingly found they performed worse than simple few-shot prompting. This highlights there’s significant room to improve existing open-weight safeguard models.
1
1
7
@farairesearch
FAR.AI
2 days
3/.We tested this in collaboration with UK AISI. First, we built an open-source defense-in-depth pipeline, using input and output filters to protect a model. Few-shot prompted Gemma 2 works great as a filter, perfectly defending against 3 existing attacks on the ClearHarm
Tweet media one
1
0
6
@farairesearch
FAR.AI
2 days
2/.AI companies including Anthropic, OpenAI, and Google highlight this “defense-in-depth” approach as a key part of their safety plans. The idea: even if one safety layer fails, others catch the problem. This works well in traditional computer security – how does it fare for AI?.
1
0
7
@farairesearch
FAR.AI
2 days
1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
1
14
66
@farairesearch
FAR.AI
2 days
4/.We also tried using purpose-made safeguard (moderation) models like ShieldGemma or Llama Guard as filters, but surprisingly found they performed worse than simple few-shot prompting. This highlights there’s significant room to improve existing open-weight safeguard models.
0
0
0
@farairesearch
FAR.AI
2 days
2/.AI companies including Anthropic, OpenAI, and Google highlight this “defense-in-depth” approach as a key part of their safety plans. The idea: even if one safety layer fails, others catch the problem. This works well in traditional computer security – how does it fare for AI?.
0
0
0
@farairesearch
FAR.AI
3 days
@AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkBeall @charlesxjyang @ArnabDatta321 Real researchers. Real policymakers. Real solutions. ▶️ Full playlist + more coming:
0
0
3
@farairesearch
FAR.AI
3 days
đź’ˇ 15 featured talks on AI governance: . @AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkBeall @charlesxjyang @ArnabDatta321 + more . Topics: AI control, supply chains, evaluation design, compute zones, military risks & governance frameworks
1
3
9
@farairesearch
FAR.AI
4 days
@_achan96_ Follow us for AI safety insights and watch the full video.
0
0
1
@farairesearch
FAR.AI
4 days
@_achan96_ Read more from @_achan96_ .
@_achan96_
Alan Chan
5 months
Whether 2025/6/7/8/etc is the year of AI agents, we will need tools to help us unlock their benefits and manage their risks. Our new research agenda proposes agent infrastructure as an answer to this challenge đź§µ
Tweet media one
1
0
1