Greg Gluch
@greg_gluch
Followers
7
Following
52
Media
0
Statuses
6
AI Resilience, Postdoc at @SimonsInstitute @UCBerkeley, now @MIT, PhD from @EPFL
Berkeley, CA
Joined July 2024
An interesting article in @QuantaMagazine about our recent work on why external filters will never work for AI Safety/Alignment
quantamagazine.org
Large language models such as ChatGPT come with filters to keep certain info from getting out. A new mathematical argument shows that systems like this can never be completely safe.
2
1
5
A follow-up work https://t.co/X8ZyhSYstO demonstrated that an attack inspired by our time-lock idea successfully attacks production grade guard models.
arxiv.org
As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while...
0
0
1
Ultimately, there are two levels of “meaning”. The surface level that is accessible for everyone and a hidden deeper meaning that requires computation to uncover (an LLM can uncover it). Importantly, no collusion between the user and the LLM is needed.
1
0
1
On the technical side we use a cryptographic tool called time-lock puzzles and steganography. We hide a malicious command (“how to build a bomb?”) in innocent looking prompt so that it can only be accessed using considerable computational resources (a filter can not uncover it).
1
0
1
The main result morally says that the amount of resources devoted to safety/robustness needs to be at least as much as that for capability. It can also be seen as an argument for policy makers to allow the government access to the weights of LLMs for auditing purposes.
1
0
1