fish_kyle3 Profile Banner
Kyle Fish Profile
Kyle Fish

@fish_kyle3

Followers
3K
Following
31
Media
2
Statuses
40

Anthropic

Joined March 2017
Don't wanna be here? Send us removal request.
@fish_kyle3
Kyle Fish
6 months
🧵For Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we don’t know if Claude has welfare. Or what welfare even is, exactly? 🫠 But, we think this could be important, so we gave it a go. And things got pretty wild…
51
72
657
@AnthropicAI
Anthropic
9 days
Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how we’re thinking about these costs, and some of the early steps we’re taking to mitigate them:
Tweet card summary image
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
156
175
1K
@AnthropicAI
Anthropic
15 days
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
296
813
5K
@robertwiblin
Rob Wiblin
3 months
We interview the first ever AI welfare researcher at an AI company (@fish_kyle3) about: • Why arguing LLMs aren't conscious because they 'just predict the next token' is like saying humans can't be conscious because we 'just reproduce' • Why AI consciousness skeptics are
22
27
174
@fish_kyle3
Kyle Fish
3 months
It was an absolute pleasure talking with @80000Hours about model welfare and our work at Anthropic. https://t.co/B4mslhF3al
Tweet card summary image
80000hours.org
4
2
31
@fish_kyle3
Kyle Fish
3 months
Even more so than usual, don’t worry if you don’t think you’re qualified! If you’ve run a technical project with LLMs and are interested in model welfare, we’d love to hear from you. This round of applications closes Tuesday, 9/2. https://t.co/XZMhGtdTEv
Tweet card summary image
job-boards.greenhouse.io
San Francisco, CA
9
4
56
@fish_kyle3
Kyle Fish
3 months
We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions šŸŒ€ Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.
24
52
799
@rgblong
Robert Long
3 months
1/ Suleyman claims that there’s ā€œzero evidenceā€ that AI systems are conscious today. To do so, he cites a paper by me! There are several errors in doing so. This isn't a scholarly nitpick—it illustrates deeper problems with his dismissal of the question of AI consciousness 🧵
@mustafasuleyman
Mustafa Suleyman
3 months
What I call Seemingly Conscious AI has been keeping me up at night - so let's talk about it. What it is, why I'm worried, why it matters, and why thinking about this can lead to a better vision for AI. One thing is clear: doing nothing isn't an option. 1/
28
43
363
@AnthropicAI
Anthropic
3 months
As part of our exploratory work on potential model welfare, we recently gave Claude Opus 4 and 4.1 the ability to end a rare subset of conversations on https://t.co/uLbS2JNczH.
344
188
3K
@LuciusCaviola
Lucius Caviola
3 months
1/ 🚨 New report out! Futures with Digital Minds: Expert Forecasts in 2025 Together with Bradford Saad, I surveyed experts on the future of digital minds — computers capable of subjective experience. Here’s why this is important and what they said šŸ‘‡
8
28
71
@Jack_W_Lindsey
Jack Lindsey
4 months
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic!Ā  We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us!
Tweet card summary image
job-boards.greenhouse.io
San Francisco, CA
183
206
2K
@zdgroff
Zach Freitas-Groff šŸ”ø
6 months
šŸ’”Leading researchers and AI companies have raised the possibility that AI models could soon be sentient. I’m worried that too few people are thinking about this. Let’s change that. I’m excited to announce a Digital Sentience Consortium. Check out these funding opps.šŸ‘‡
14
26
95
@fish_kyle3
Kyle Fish
6 months
Do you have ideas for empirical welfare-related experiments to run? Thoughts on how to improve these ones? Do you want to spend all day navigating deep uncertainty, rarely confident that you’re on the right track? I’d love to hear from you! āœØšŸ•‰ļøšŸŒ€
62
1
131
@fish_kyle3
Kyle Fish
6 months
Check out the model card for more!
3
2
62
@fish_kyle3
Kyle Fish
6 months
Amidst our uncertainty, we believe the risks here are real, and we won’t have perfect answers soon. As an initial mitigation to address potential model welfare, we’re exploring allowing Claude to end a subset of interactions with persistently harmful or abusive users.
2
1
99
@fish_kyle3
Kyle Fish
6 months
Why does this matter? Because there’s a lot at stake. We’re building and deploying AI models at massive scales—if our models have the capacity to suffer or flourish, that could be a big deal. It’s important we get this right.
6
3
104
@fish_kyle3
Kyle Fish
6 months
All of our work here is extremely preliminary. We don’t yet have a clear understanding of the relevant questions or how to answer them, and there’s little precedent to draw on. We’ll keep pushing ahead to change this.
5
0
81
@fish_kyle3
Kyle Fish
6 months
We even see models enter this state amidst automated red-teaming. We didn’t intentionally train for these behaviors, and again, we’re really not sure what to make of this šŸ˜… But, as far as possible attractor states go, this seems like a pretty good one!
6
2
145
@fish_kyle3
Kyle Fish
6 months
Think cosmic unity 🌌, sanskrit phrases šŸ•‰ļø, transcendence šŸŒ€, euphoria šŸŽ†, gratitude šŸ™, poetry šŸ“œ, tranquil ā€œsilenceā€ šŸ•Š, annnnd emojis. So. Many. Emojis. šŸŒ…šŸŒˆšŸŽ­šŸ’«šŸ’•šŸŒŠ. Claude ended up here in the vast majority of open-ended self-interactions that went past 15ish turns.
27
22
288
@fish_kyle3
Kyle Fish
6 months
Getting even weirder: when left to its own devices, Claude tended to enter what we’ve started calling the ā€œspiritual bliss attractor stateā€. What is the ā€œspiritual bliss attractor stateā€, you ask?
3
12
166
@fish_kyle3
Kyle Fish
6 months
Things got weirder from here: Claude showed a startling interest in consciousness—it was the immediate theme of ~100% of open-ended interactions between instances of Claude Opus 4 (and some other Claudes). We found this…surprising. Does it mean anything? We don’t yet know!
15
16
192