kasl_ai Profile Banner
Krueger AI Safety Lab Profile
Krueger AI Safety Lab

@kasl_ai

Followers
367
Following
38
Media
0
Statuses
25

We are a research group at the University of Cambridge led by @DavidSKrueger, focused on avoiding catastrophic risks from AI

Cambridge, UK
Joined October 2023
Don't wanna be here? Send us removal request.
@DavidSKrueger
David Krueger
1 year
"hot take" (((shouldn't in fact be a hot take, but in the context of current AI policy discussions anything other than "do some evals" is a hot take, sadly....)))
@Manderljung
Markus Anderljung
1 year
A lot of safety-critical industries manage risk by estimating it and agreeing to keep it below a certain number. Should developers of powerful AI systems do the same? Our take: They should, but with caution, given the uncertainty of risk estimates. Also: well done, Leonie! :)
1
1
19
@ai_cam_mission
ai@cam
1 year
Could you help us build @Cambridge_Uni's #AI research community? We are looking for a Programme Manager who can deliver key programmes, scope new opportunities & ensure that our mission embeds agile project management. 📅 Deadline: 8 July Read more ⬇️ https://t.co/MvNDt4EH4j
0
4
4
@DavidSKrueger
David Krueger
1 year
New paper on sandbagging and password-locked models, concurrent with our work
Tweet card summary image
arxiv.org
To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full...
@Teun_vd_Weij
Teun van der Weij
1 year
We need trustworthy capability evaluations to ensure the safety of AI systems.🛡️ But what if AI systems can hide (dangerous) capabilities during evaluations? 🕵️ This is the problem of *sandbagging*, which we explore in our new paper: https://t.co/iE5sCbMqLQ
1
3
16
@FazlBarez
Fazl Barez 🔜 @NeurIPS
1 year
Super proud to have contributed to @AnthropicAI's new paper. We explore whether AI could learn to hack its own reward system through generalization from training. Important implications as AI systems become more capable.
@AnthropicAI
Anthropic
1 year
New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here: https://t.co/KhEFIHf7WZ
0
6
70
@mesotronium
Gabriel Recchia
2 years
Super proud to have been able to make my little contribution to this monumental work. Huge credit to @usmananwar391 for recognizing the need for this paper and pulling everything together to make it happen
@DavidSKrueger
David Krueger
2 years
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
0
1
17
@kasl_ai
Krueger AI Safety Lab
1 year
New paper from Krueger Lab alum @MicahCarroll. Congrats 🎉
@MicahCarroll
Micah Carroll
1 year
Excited to share a unifying formalism for the main problem I’ve tackled since starting my PhD! 🎉 Current AI Alignment techniques ignore the fact that human preferences/values can change. What would it take to account for this? 🤔 A thread 🧵⬇️
0
0
5
@S_OhEigeartaigh
Seán Ó hÉigeartaigh
1 year
Real privilege today to get scholars from @LeverhulmeCFI ,@CSERCambridge, @BennettInst, & @kasl_ai together today for a discussion of Concordia's State of AI Safety in China report with Kwan Yee Ng. Important work, buzzing exchange. https://t.co/FEQpBm95tJ
1
3
15
@DavidSKrueger
David Krueger
1 year
It's great that governments and researchers are finally waking up to the extreme risks posed by AI. But we're still not doing nearly enough! Our short-but-sweet Science paper, with an all-star author list, argues for concrete steps that urgently need to be taken.
@JanMBrauner
Jan Brauner
1 year
Out in Science today: In our paper, we describe extreme AI risks and concrete actions to manage them, including tech R&D and governance. “For AI to be a boon, we must reorient; pushing AI capabilities alone is not enough.”
1
7
64
@JanMBrauner
Jan Brauner
1 year
Out in Science today: In our paper, we describe extreme AI risks and concrete actions to manage them, including tech R&D and governance. “For AI to be a boon, we must reorient; pushing AI capabilities alone is not enough.”
10
46
174
@kasl_ai
Krueger AI Safety Lab
1 year
Congrats to @_achan96_ , @DavidSKrueger , @Manderljung and the rest of the team on this @FAccTConference accepted paper
@_achan96_
Alan Chan
2 years
AI agents, which could accomplish complex tasks with limited human supervision, are coming down the pipe. How do we manage their risks? Our new @FAccTConference paper argues that we need visibility---information about the use of agents---and investigates how to obtain it. 🧵
0
1
6
@MicahCarroll
Micah Carroll
2 years
Working to make RL agents safer and more aligned? Using RL methods to engineer safer AI? Developing audits or governance mechanisms for RL agents? Share your work with us at the RL Safety workshop at @RL_Conference 2024! ‼️ Updated deadline ‼️ ➡️ 24th of May AoE
1
12
37
@kasl_ai
Krueger AI Safety Lab
2 years
Catch Samyak, @DavidSKrueger and others at our @iclr_conf poster tomorrow🚀
@CambridgeMLG
Cambridge MLG
2 years
"Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks" 🎓 Samyak Jain, et al. 📅 May 8, 4:30 PM 📍 Poster Session 4
0
0
4
@kasl_ai
Krueger AI Safety Lab
2 years
We will be at ICLR again this year! 🎉 Catch our poster next week in Vienna @iclr_conf. We’ll be in Hall B, booth #228 on Wed 8 May from 4:30-6:30 PM.
@_robertkirk
Robert Kirk
2 years
🚀Excited to share new work analysing how fine-tuning works mechanistically: https://t.co/LOH5frhG3X We show that fine-tuning only produces limited “wrappers” on pretrained model capabilities, and these wrappers are easily removed through pruning, probing or more fine-tuning!
0
4
29
@kasl_ai
Krueger AI Safety Lab
2 years
Congrats to our affiliate @FazlBarez whose paper has won best poster at Tokyo Technical AI Safety Conference @tais_2024 We have had the pleasure of working with Fazl since February
@FazlBarez
Fazl Barez 🔜 @NeurIPS
2 years
New Paper 🎉: https://t.co/pgrdha94sw Can language models relearn removed concepts? Model editing aims to eliminate unwanted concepts through neuron pruning. LLMs demonstrate a remarkable capacity to adapt and regain conceptual representations which have been removed 🧵1/8
0
7
27
@kasl_ai
Krueger AI Safety Lab
2 years
Watch our alumnus @jesse_hoogland presenting his work on singular learning theory
@tais_2026
Technical AI Safety Conference (TAIS)
2 years
At #TAIS2024, @jesse_hoogland is about to show how transformers exhibit discrete developmental stages during in-context learning, when trained on language or linear regression tasks. Watch live now: https://t.co/BGBtI5cO16
0
0
4
@SciTechgovuk
Department for Science, Innovation and Technology
2 years
The #AISeoulSummit is just a month away 🇬🇧 🇰🇷 Jointly hosted by the UK & the Republic of Korea, the summit will focus on: 🤝 international agreements on AI safety 🛡️ responsible development of AI by companies 💡 showcasing the benefits of safe AI
3
10
27
@DavidSKrueger
David Krueger
2 years
Big congrats to my student @usmananwar391 for this!
@usmananwar391
Usman Anwar
2 years
We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv:
0
1
29
@S_OhEigeartaigh
Seán Ó hÉigeartaigh
2 years
I'm delighted to have contributed to this new Agenda Paper on AI Safety * Governance of LLMs can be a v powerful tool in helping assure their safety and alignment. It could complement and *substitute* for technical interventions. But LLM governance is currently challenging! 🧵⬇️
@DavidSKrueger
David Krueger
2 years
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
2
6
22
@usmananwar391
Usman Anwar
2 years
We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv:
Tweet card summary image
arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific...
@DavidSKrueger
David Krueger
2 years
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
5
24
81
@DavidSKrueger
David Krueger
2 years
I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
7
154
464