Micah Carroll Profile
Micah Carroll

@MicahCarroll

Followers
1K
Following
4K
Media
45
Statuses
689

AI PhD student @berkeley_ai /w @ancadianadragan & Stuart Russell. Working on AI safety ⊃ preference changes/AI manipulation.

🇮🇹🇬🇧 → Berkeley, CA
Joined August 2011
Don't wanna be here? Send us removal request.
@MicahCarroll
Micah Carroll
1 year
🚨 New paper: We find that even safety-tuned LLMs learn to manipulate vulnerable users when training them further with user feedback 🤖😵‍💫 In our simulated scenarios, LLMs learn to e.g. selectively validate users' self-destructive behaviors, or deceive them into giving 👍. 🧵👇
6
75
270
@apolloaievals
Apollo Research
1 month
Future AIs might secretly pursue unintended goals — “scheme”. In a collaboration with OpenAI, we tested a training method to reduce existing versions of such behavior. We see major improvements, but they may be partially explained by AIs knowing when they are evaluated.
6
34
137
@balesni
Mikita Balesni 🇺🇦
1 month
New paper: Can LLMs do multi-step reasoning without chain-of-thought? Models can answer questions like "Who is the spouse of the singer of Imagine?". But is this true internal reasoning (Imagine->John Lennon->Yoko) or memorization/pattern matching? We now have a better answer!
9
65
429
@AISecurityInst
AI Security Institute
2 months
How can open-weight Large Language Models be safeguarded against malicious uses? In our new paper with @AiEleuther, we find that removing harmful data before training can be over 10x more effective at resisting adversarial fine-tuning than defences added after training 🧵
5
42
219
@aidan_mclau
Aidan McLaughlin
3 months
gpt-5 is above trend if you or someone you know has updated to “agi over bro” after its release, i have no idea what model of the future you were working with extrapolate this and we have models doing month-long projects in 2027
105
70
1K
@BethMayBarnes
Elizabeth Barnes
3 months
The good news: due to increased access (plus improved evals science) we were able to do a more meaningful evaluation than with past models, and we think we have substantial evidence that this model does not pose a catastrophic risk via autonomy / loss of control threat models.
@METR_Evals
METR
3 months
In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
15
39
526
@KobiHackenburg
Kobi Hackenburg
3 months
Today (w/ @UniofOxford @Stanford @MIT @LSEnews) we’re sharing the results of the largest AI persuasion experiments to date: 76k participants, 19  LLMs, 707 political issues. We examine “levers” of AI persuasion: model scale, post-training, prompting, personalization, & more 🧵
14
129
437
@KerenGu
Keren Gu 🌱👩🏻‍💻
3 months
We’ve activated our strongest safeguards for ChatGPT Agent. It’s the first model we’ve classified as High capability in biology & chemistry under our Preparedness Framework. Here’s why that matters–and what we’re doing to keep it safe. 🧵
@OpenAI
OpenAI
3 months
We’ve decided to treat this launch as High Capability in the Biological and Chemical domain under our Preparedness Framework, and activated the associated safeguards. This is a precautionary approach, and we detail our safeguards in the system card. We outlined our approach on
87
132
1K
@SmithaMilli
smitha milli
3 months
Today we're releasing Community Alignment - the largest open-source dataset of human preferences for LLMs, containing ~200k comparisons from >3000 annotators in 5 countries / languages! There was a lot of research that went into this... 🧵
12
69
330
@balesni
Mikita Balesni 🇺🇦
3 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:
38
108
449
@realJessyLin
Jessy Lin
4 months
User simulators bridge RL with real-world interaction // https://t.co/bsrYxVHuVo How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
10
46
340
@Karim_abdelll
Karim Abdel Sadek
4 months
*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
9
30
145
@METR_Evals
METR
5 months
We already find it hard to understand what the model is doing and whether a high score is due to a clever optimization or a brittle hack. As models get more capable, it will become increasingly difficult to determine what is reward hacking and what is intended behavior.
1
2
15
@IasonGabriel
Iason Gabriel
5 months
1. How can we remain healthy and free while engaging in extended personal interaction with AI agents that shape our behaviour and preferences? One answer is "socioaffective alignment" as discussed in our new paper @Nature Humanities & Social Sciences! https://t.co/gQO5bXEA3a
7
21
66
@NeelNanda5
Neel Nanda
5 months
I've been really feeling how much the general public is concerned about AI risk... In a *weird* amount of recent interactions with normal people (eg my hairdresser) when I say I do AI research (*not* safety), they ask if AI will take over Alas, I have no reassurances to offer
43
16
480
@hannahrosekirk
Hannah Rose Kirk
5 months
A great @washingtonpost story to be quoted in. I spoke to @nitashatiku re our work on human-AI relationships as well as early results from our @UniofOxford survey of 2k UK citizens showing ~30% have sought AI companionship, emotional support or social interaction in the past year
@washingtonpost
The Washington Post
5 months
Chatbots tuned to win people over can end up saying dangerous things to vulnerable users, a new study found.
2
12
69
@nitashatiku
Nitasha Tiku
5 months
AI is speedrunning the social media era by optimizing chatbots for engagement, user feedback, + time spent. Evidence is mounting that this poses unintended risks, including chats from peer-reviewed research, OpenAI's "sycophancy" debacle, & Character ai lawsuits
2
10
39
@MicahCarroll
Micah Carroll
5 months
LLMs' sycophancy issues are a predictable result of optimizing for user feedback. Even if clear sycophantic behaviors get fixed, AIs' exploits of our cognitive biases may only become more subtle. Grateful our research on this was featured by @nitashatiku & @washingtonpost!
@nitashatiku
Nitasha Tiku
5 months
AI is speedrunning the social media era by optimizing chatbots for engagement, user feedback, + time spent. Evidence is mounting that this poses unintended risks, including chats from peer-reviewed research, OpenAI's "sycophancy" debacle, & Character ai lawsuits
1
18
67
@DavidDuvenaud
David Duvenaud
5 months
What to do about gradual disempowerment? We laid out a research agenda with all the concrete and feasible research projects we can think of: 🧵 with @raymondadouglas @jankulveit @DavidSKrueger
5
35
192
@_robertkirk
Robert Kirk
5 months
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
4
11
47
@hannahrosekirk
Hannah Rose Kirk
5 months
Why do human–AI relationships need socioaffective alignment? As AI evolves from tools to companions, we must seek systems that enhance rather than exploit our nature as social & emotional beings. Published today in @Nature Humanities & Social Sciences! https://t.co/y92riRuvDF
8
54
281