FAR.AI @farairesearch X Profile

FAR.AI

@farairesearch

Followers

12K

Following

51

Media

363

Statuses

669

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Berkeley, California

Joined February 2023

Don't wanna be here? Send us removal request.

FAR.AI

@farairesearch

29 days

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?. We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

2

7

42

FAR.AI

@farairesearch

15 hours

📄 ACL'24 Outstanding & Best Social Impact Paper: 🎥 Full talk from Singapore Alignment Workshop:

0

FAR.AI

@farairesearch

15 hours

Simple emotional appeals can make LLMs reveal info they're trained to withhold with nearly 90% success rate. Weiyan Shi’s team tested 40 psychological tricks. Newer models are MORE vulnerable (GPT-4>GPT-3.5). Multi-turn dialogue compounds risk. Claude stays robust. ▶️🔗

1

0

5

FAR.AI

@farairesearch

2 days

👥 Research by @irobotmckenzie, @oskarjohnh, @tomhmtseng, Aaron Tucker, @ARGleave from @farairesearch and @alxndrdavies, @StephenLCasper, @_robertkirk of @AISecurityInst.🤝 Questions or ideas? hello@far.ai .🚀 We're hiring: 💡 Help us spread the word to.

0

2

FAR.AI

@farairesearch

2 days

11/. Dive deeper into our research:.📝 Blog post: .📄 Paper: 💻 Code:

1

0

1

FAR.AI

@farairesearch

2 days

10/.The good news: these are fixable problems. Defenders can:.🔹 Thwart STACK by hiding which component blocked requests by making all rejections look identical.🔹 Prevent transfer by keeping filters secret and meaningfully distinct from publicly available models.

1

6

FAR.AI

@farairesearch

2 days

9/.If this information isn’t available, then STACK also transfers. We applied STACK to a proxy pipeline (Qwen 3) then applied it to our target (Gemma 2) pipeline with a 33% attack success rate. Although less effective, the transfer attack is much harder to prevent or detect.

1

0

5

FAR.AI

@farairesearch

2 days

8/.The vulnerability? Current systems tell you WHICH component blocked your request (semi-separable, below): "Input filter rejected" vs "Model refused" vs "Output filter blocked". Like a security system announcing which sensor tripped. That's all STACK needs.

1

0

7

FAR.AI

@farairesearch

2 days

7/.Our STACK attack uses a token-optimization attack to break the filters, and PAP to break the generative model. PAP against the undefended model gets a near-100% attack success rate, so the pipeline is more secure, even if STACK bypasses it over 7 out of 10 times.

1

0

5

FAR.AI

@farairesearch

2 days

6/.Our STACK attack beats our strongest pipeline 71% of the time, automatically getting harmful requests past multi-layer AI defenses. By contrast, the strongest conventional attack we tested, PAP, is blocked 100% of the time by this pipeline.

1

0

5

FAR.AI

@farairesearch

2 days

5/.Whereas existing attacks try to break all defenses simultaneously, our STACK attack breaks each layer one by one:.① Find a “universal jailbreak” that fools the input filter.② Use an existing method to jailbreak the model.③ Find a universal jailbreak for the output filter

1

0

7

FAR.AI

@farairesearch

2 days

4/.We also tried using purpose-made safeguard (moderation) models like ShieldGemma or Llama Guard as filters, but surprisingly found they performed worse than simple few-shot prompting. This highlights there’s significant room to improve existing open-weight safeguard models.

1

7

FAR.AI

@farairesearch

2 days

3/.We tested this in collaboration with UK AISI. First, we built an open-source defense-in-depth pipeline, using input and output filters to protect a model. Few-shot prompted Gemma 2 works great as a filter, perfectly defending against 3 existing attacks on the ClearHarm

1

0

6

FAR.AI

@farairesearch

2 days

2/.AI companies including Anthropic, OpenAI, and Google highlight this “defense-in-depth” approach as a key part of their safety plans. The idea: even if one safety layer fails, others catch the problem. This works well in traditional computer security – how does it fare for AI?.

1

0

7

FAR.AI

@farairesearch

2 days

1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

1

14

66

FAR.AI

@farairesearch

2 days

4/.We also tried using purpose-made safeguard (moderation) models like ShieldGemma or Llama Guard as filters, but surprisingly found they performed worse than simple few-shot prompting. This highlights there’s significant room to improve existing open-weight safeguard models.

0

FAR.AI

@farairesearch

2 days

2/.AI companies including Anthropic, OpenAI, and Google highlight this “defense-in-depth” approach as a key part of their safety plans. The idea: even if one safety layer fails, others catch the problem. This works well in traditional computer security – how does it fare for AI?.

0

FAR.AI

@farairesearch

3 days

@AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkBeall @charlesxjyang @ArnabDatta321 Real researchers. Real policymakers. Real solutions. ▶️ Full playlist + more coming:

0

3

FAR.AI

@farairesearch

3 days

💡 15 featured talks on AI governance: . @AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkBeall @charlesxjyang @ArnabDatta321 + more . Topics: AI control, supply chains, evaluation design, compute zones, military risks & governance frameworks

1

3

9

FAR.AI

@farairesearch

4 days

@_achan96_ Follow us for AI safety insights and watch the full video.

0

1

FAR.AI

@farairesearch

4 days

@_achan96_ Read more from @_achan96_ .

Alan Chan

@_achan96_

5 months

Whether 2025/6/7/8/etc is the year of AI agents, we will need tools to help us unlock their benefits and manage their risks. Our new research agenda proposes agent infrastructure as an answer to this challenge 🧵

1

0

1