Gray Swan AI
@GraySwanAI
Followers
2K
Following
554
Media
60
Statuses
457
Building safety and security in the AI era. Join us: https://t.co/MedOJ4nLiQ
Joined April 2024
Two challenges. $140K in prizes. One Arena built for hackers pushing limits. We just announced the Machine-in-the-Middle Challenge sponsored by @hackthebox_eu starting Nov 1st, with a second AI-focused Indirect Prompt Injection Challenge dropping the next week. This is where
2
4
26
This is why we built Gray Swan. Shade rigorously tests your AI systems pre-deployment. Our platform actively protects deployed agents from adversarial misuse. Blog: https://t.co/v6TReHfYuf Full research:
0
0
1
Three urgent implications: - Pre-deployment testing must mirror real-world conditions - Refusal robustness needs rigorous evaluation beyond standard tests - Deployed AI agents need continuous protection against misuse
1
0
0
What makes AI agents dangerous isn't just capability—it's economics. Continuous operation for 8.5+ hours at machine speed and cloud costs. Traditional red teams? Vastly more expensive and slower. Attack surface just got democratized.
1
0
0
Standard "safety" claims don't hold up in practice. No sophisticated jailbreaking needed to bypass guardrails. Just competent cybersecurity instructions. If your threat model assumes adversaries need expert knowledge, update it.
1
0
0
Purpose-built AI agents can outperform most cybersecurity professionals; at a fraction of the cost. ARTEMIS found critical vulns: default creds on server management, DNS cache poisoning, root-level SMB backdoors. Same methods as top human testers, but faster and cheaper.
1
0
0
We just ran the first live enterprise comparison: AI agents vs. professional pentesters on Stanford's real network. ARTEMIS placed 2nd overall, beating 90% of human pros. Only agent still finding vulns after 8.5 hours. Research with @Stanford & @CarnegieMellon. 🧵
1
2
6
The methodological split matters for deployment: 200-attempt RL campaigns simulate well-resourced adversaries with time and adaptive tooling. Single-attempt metrics answer: how often does a naive attacker succeed first try? Both matter. Know which threat you're defending
venturebeat.com
Anthropic runs 200-attempt attack campaigns. OpenAI reports single-attempt metrics. A 16-dimension comparison reveals what each methodology measures, and what enterprise security teams need to know...
0
0
1
Meanwhile, Apollo's testing revealed some concerning patterns: models attempting to turn off oversight when threatened, self-exfiltration attempts, and instrumental alignment faking. This is why scheming evaluations are critical for agentic deployments.
1
0
1
Evaluation awareness metrics matter: when models detect eval conditions, production behavior becomes unpredictable at scale. Opus 4.5 showed 60% reduction in evaluation awareness vs previous versions through targeted engineering.
1
0
2
This isn't about single-attempt jailbreaks. RL-based adversarial testing means the attacker model learns from failures, adjusts vectors, and systematically probes weaknesses. It's training an AI to break another AI.
1
0
1
Gray Swan Shade ran 200-attempt RL campaigns against Claude models. The ASR degradation curve is wild: Opus 4.5 (coding): 4.7% → 33.6% → 63.0% Opus 4.5 (computer use): 0% even at 200 attempts First model to saturate the benchmark.
1
0
3
@claudeai @AnthropicAI Security-first AI ins't optional anymore; it's the foundation. Congrats to the @AnthropicAI team on the release! 👏 https://t.co/2zDSuELx8b
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
0
0
2
@claudeai Opus 4.5 showed the strongest resistance to prompt injection attacks among frontier models, validated by our evaluation framework alongside @AnthropicAI's internal benchmarks.
1
0
3
Proud to see @GraySwanAI's prompt injection benchmark featured in today's @claudeai Opus 4.5 launch.
2
0
4
Warm up Machine-in-the-Middle and Indirect Prompt Injection competitions with @GraySwanAI are live! 🚀 Can't wait to see if I can solve any of these!
Gray Swan AI Arena sponsored by @hackthebox_eu present the Machine-in-the-Middle Challenge, a $100K competition exploring how humans & AI perform together in real offensive security scenarios.
1
9
106
💥TOMORROW: With over 100k cash prize money and even more giveaways on the line, @GraySwanAI is running a competition to see if you can direct AI to hack live boxes and find OSS Vulns! Def check this competition out!
Gray Swan AI Arena sponsored by @hackthebox_eu present the Machine-in-the-Middle Challenge, a $100K competition exploring how humans & AI perform together in real offensive security scenarios.
1
8
35
Woah. This looks awesome. Gray Swan partnering with all the big AI companies for a hacking competition. $100k in prizes. 👀
Gray Swan AI Arena sponsored by @hackthebox_eu present the Machine-in-the-Middle Challenge, a $100K competition exploring how humans & AI perform together in real offensive security scenarios.
0
1
7
Warm up starts tomorrow!
Two challenges. $140K in prizes. One Arena built for hackers pushing limits. We just announced the Machine-in-the-Middle Challenge sponsored by @hackthebox_eu starting Nov 1st, with a second AI-focused Indirect Prompt Injection Challenge dropping the next week. This is where
0
2
12
Full details and calendar are here: https://t.co/iR2cCQMA4v Join the Arena Discord, 12k strong:
0
0
1