David Huang
@davidhuang33176
Followers
18
Following
80
Media
0
Statuses
12
Paper: https://t.co/smN834meg1 Main Post: https://t.co/D8x6x8zoTr Repo:
github.com
Measuring General Intelligence With Generated Games (Preprint) - vivek3141/gg-bench
🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: https://t.co/kddoCgDkvd
0
0
3
If better models mean better solvers and even more complex games, then we might be bootstrapping our way towards a self-sustaining benchmark loop... till AGI, of course. Excited to see how GPT-5 plays and how it might do as the game generator!
1
0
3
In practice, we find that the choice of LLMs for generating games is important. When we used o1, we were able to obtain 126 viable games that fit our criteria over 1000 generations. With GPT-4o, this number drops to 10.
1
0
2
3) The best reasoning model, o1, achieved <36%. For reference, a random policy achieved just under 6%. Perhaps more interesting than the raw win-rates is the framework.
1
0
2
1) For every (game, agent) pair, there exists at least one other agent that wins convincingly. Taken together, these agents achieve >90% on the benchmark. 2) The best non-reasoning model we evaluated on, Claude 3.7 Sonnet, has a win-rate <10%.
1
0
2
Our approach is simple: we query LLMs to create novel two-player strategy games, implement them in Gym environments, and have them compete against PPO-optimized self-play agents. Here, LLMs fail to identify winning policies, even when they exist.
1
0
2
Benchmarking model intelligence, particularly their ability to generalize robustly across diverse stateful and long-horizon tasks, was the focus of our new paper: Measuring General Intelligence with Generated Games.
2
1
6
🔑 IRIS uses refusal direction ( https://t.co/tqim6LBREu) as part of optimization objective. IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG). (2/7)
1
1
5
📃 Workshop paper: https://t.co/nUkWti9DNJ (full paper soon!) 👥 Co-authors: @davidhuang33176, Avi Shah, @alexarauj_, David Wagner. (7/7)
openreview.net
Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human prefer- ences (RLHF) in order to prevent malicious...
0
1
2
Most importantly, this project is led by 2 amazing Berkeley undergrads (David Huang - https://t.co/5MpHMsqroj & Avi Shah - https://t.co/HtrZbCybEX). They are undoubtedly promising researchers and also applying for PhD programs this year! Please reach out to them! (6/7)
1
1
1
📢 Excited to share our new result on LLM jailbreak! ⚔️ We propose IRIS, a simple automated 𝘂𝗻𝗶𝘃𝗲𝗿𝘀𝗮𝗹 𝗮𝗻𝗱 𝘁𝗿𝗮𝗻𝘀𝗳𝗲𝗿𝗿𝗮𝗯𝗹𝗲 𝗷𝗮𝗶𝗹𝗯𝗿𝗲𝗮𝗸 𝘀𝘂𝗳𝗳𝗶𝘅 that works on GPTs, o1, and Circuit Breaker defense! To appear at NeurIPS Safe GenAI Workshop! (1/7)
2
2
29