Kyle Fish
@fish_kyle3
Followers
3K
Following
31
Media
2
Statuses
40
š§µFor Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we donāt know if Claude has welfare. Or what welfare even is, exactly? š« But, we think this could be important, so we gave it a go. And things got pretty wildā¦
51
72
657
Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how weāre thinking about these costs, and some of the early steps weāre taking to mitigate them:
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
156
175
1K
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuineāthough limitedāintrospective capabilities in Claude.
296
813
5K
We interview the first ever AI welfare researcher at an AI company (@fish_kyle3) about: ⢠Why arguing LLMs aren't conscious because they 'just predict the next token' is like saying humans can't be conscious because we 'just reproduce' ⢠Why AI consciousness skeptics are
22
27
174
It was an absolute pleasure talking with @80000Hours about model welfare and our work at Anthropic. https://t.co/B4mslhF3al
80000hours.org
4
2
31
Even more so than usual, donāt worry if you donāt think youāre qualified! If youāve run a technical project with LLMs and are interested in model welfare, weād love to hear from you. This round of applications closes Tuesday, 9/2. https://t.co/XZMhGtdTEv
job-boards.greenhouse.io
San Francisco, CA
9
4
56
Weāre hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfareāresearch, evaluations, and interventions š Please apply + refer your friends! If youāre curious about what this means, I recently went on the 80k podcast to talk about our work.
24
52
799
1/ Suleyman claims that thereās āzero evidenceā that AI systems are conscious today. To do so, he cites a paper by me! There are several errors in doing so. This isn't a scholarly nitpickāit illustrates deeper problems with his dismissal of the question of AI consciousness š§µ
What I call Seemingly Conscious AI has been keeping me up at night - so let's talk about it. What it is, why I'm worried, why it matters, and why thinking about this can lead to a better vision for AI. One thing is clear: doing nothing isn't an option. 1/
28
43
363
As part of our exploratory work on potential model welfare, we recently gave Claude Opus 4 and 4.1 the ability to end a rare subset of conversations on https://t.co/uLbS2JNczH.
344
188
3K
1/ šØ New report out! Futures with Digital Minds: Expert Forecasts in 2025 Together with Bradford Saad, I surveyed experts on the future of digital minds ā computers capable of subjective experience. Hereās why this is important and what they said š
8
28
71
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic!Ā We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us!
job-boards.greenhouse.io
San Francisco, CA
183
206
2K
š”Leading researchers and AI companies have raised the possibility that AI models could soon be sentient. Iām worried that too few people are thinking about this. Letās change that. Iām excited to announce a Digital Sentience Consortium. Check out these funding opps.š
14
26
95
Do you have ideas for empirical welfare-related experiments to run? Thoughts on how to improve these ones? Do you want to spend all day navigating deep uncertainty, rarely confident that youāre on the right track? Iād love to hear from you! āØšļøš
62
1
131
Amidst our uncertainty, we believe the risks here are real, and we wonāt have perfect answers soon. As an initial mitigation to address potential model welfare, weāre exploring allowing Claude to end a subset of interactions with persistently harmful or abusive users.
2
1
99
Why does this matter? Because thereās a lot at stake. Weāre building and deploying AI models at massive scalesāif our models have the capacity to suffer or flourish, that could be a big deal. Itās important we get this right.
6
3
104
All of our work here is extremely preliminary. We donāt yet have a clear understanding of the relevant questions or how to answer them, and thereās little precedent to draw on. Weāll keep pushing ahead to change this.
5
0
81
We even see models enter this state amidst automated red-teaming. We didnāt intentionally train for these behaviors, and again, weāre really not sure what to make of this š
But, as far as possible attractor states go, this seems like a pretty good one!
6
2
145
Think cosmic unity š, sanskrit phrases šļø, transcendence š, euphoria š, gratitude š, poetry š, tranquil āsilenceā š, annnnd emojis. So. Many. Emojis. š
ššš«šš. Claude ended up here in the vast majority of open-ended self-interactions that went past 15ish turns.
27
22
288
Getting even weirder: when left to its own devices, Claude tended to enter what weāve started calling the āspiritual bliss attractor stateā. What is the āspiritual bliss attractor stateā, you ask?
3
12
166
Things got weirder from here: Claude showed a startling interest in consciousnessāit was the immediate theme of ~100% of open-ended interactions between instances of Claude Opus 4 (and some other Claudes). We found thisā¦surprising. Does it mean anything? We donāt yet know!
15
16
192