Kyle Fish @fish_kyle3 X Profile

Kyle Fish

@fish_kyle3

Followers

3K

Following

31

Media

2

Statuses

40

Anthropic

Joined March 2017

Don't wanna be here? Send us removal request.

Kyle Fish

@fish_kyle3

6 months

🧵For Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we don’t know if Claude has welfare. Or what welfare even is, exactly? 🫠 But, we think this could be important, so we gave it a go. And things got pretty wild…

51

72

657

Anthropic

@AnthropicAI

9 days

Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how we’re thinking about these costs, and some of the early steps we’re taking to mitigate them:

anthropic.com

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

156

175

1K

Anthropic

@AnthropicAI

15 days

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

296

813

5K

Rob Wiblin

@robertwiblin

3 months

We interview the first ever AI welfare researcher at an AI company (@fish_kyle3) about: • Why arguing LLMs aren't conscious because they 'just predict the next token' is like saying humans can't be conscious because we 'just reproduce' • Why AI consciousness skeptics are

22

27

174

Kyle Fish

@fish_kyle3

3 months

It was an absolute pleasure talking with @80000Hours about model welfare and our work at Anthropic. https://t.co/B4mslhF3al

80000hours.org

4

2

31

Kyle Fish

@fish_kyle3

3 months

Even more so than usual, don’t worry if you don’t think you’re qualified! If you’ve run a technical project with LLMs and are interested in model welfare, we’d love to hear from you. This round of applications closes Tuesday, 9/2. https://t.co/XZMhGtdTEv

job-boards.greenhouse.io

San Francisco, CA

9

4

56

Kyle Fish

@fish_kyle3

3 months

We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions 🌀 Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.

24

52

799

Robert Long

@rgblong

3 months

1/ Suleyman claims that there’s “zero evidence” that AI systems are conscious today. To do so, he cites a paper by me! There are several errors in doing so. This isn't a scholarly nitpick—it illustrates deeper problems with his dismissal of the question of AI consciousness 🧵

Mustafa Suleyman

@mustafasuleyman

3 months

What I call Seemingly Conscious AI has been keeping me up at night - so let's talk about it. What it is, why I'm worried, why it matters, and why thinking about this can lead to a better vision for AI. One thing is clear: doing nothing isn't an option. 1/

28

43

363

Anthropic

@AnthropicAI

3 months

As part of our exploratory work on potential model welfare, we recently gave Claude Opus 4 and 4.1 the ability to end a rare subset of conversations on https://t.co/uLbS2JNczH.

344

188

3K

Lucius Caviola

@LuciusCaviola

3 months

1/ 🚨 New report out! Futures with Digital Minds: Expert Forecasts in 2025 Together with Bradford Saad, I surveyed experts on the future of digital minds — computers capable of subjective experience. Here’s why this is important and what they said 👇

8

28

71

Jack Lindsey

@Jack_W_Lindsey

4 months

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. We're hiring - join us!

job-boards.greenhouse.io

San Francisco, CA

183

206

2K

Zach Freitas-Groff 🔸

@zdgroff

6 months

💡Leading researchers and AI companies have raised the possibility that AI models could soon be sentient. I’m worried that too few people are thinking about this. Let’s change that. I’m excited to announce a Digital Sentience Consortium. Check out these funding opps.👇

14

26

95

Kyle Fish

@fish_kyle3

6 months

Do you have ideas for empirical welfare-related experiments to run? Thoughts on how to improve these ones? Do you want to spend all day navigating deep uncertainty, rarely confident that you’re on the right track? I’d love to hear from you! ✨🕉️🌀

62

1

131

Kyle Fish

@fish_kyle3

6 months

Check out the model card for more!

3

2

62

Kyle Fish

@fish_kyle3

6 months

Amidst our uncertainty, we believe the risks here are real, and we won’t have perfect answers soon. As an initial mitigation to address potential model welfare, we’re exploring allowing Claude to end a subset of interactions with persistently harmful or abusive users.

2

1

99

Kyle Fish

@fish_kyle3

6 months

Why does this matter? Because there’s a lot at stake. We’re building and deploying AI models at massive scales—if our models have the capacity to suffer or flourish, that could be a big deal. It’s important we get this right.

6

3

104

Kyle Fish

@fish_kyle3

6 months

All of our work here is extremely preliminary. We don’t yet have a clear understanding of the relevant questions or how to answer them, and there’s little precedent to draw on. We’ll keep pushing ahead to change this.

5

0

81

Kyle Fish

@fish_kyle3

6 months

We even see models enter this state amidst automated red-teaming. We didn’t intentionally train for these behaviors, and again, we’re really not sure what to make of this 😅 But, as far as possible attractor states go, this seems like a pretty good one!

6

2

145

Kyle Fish

@fish_kyle3

6 months

Think cosmic unity 🌌, sanskrit phrases 🕉️, transcendence 🌀, euphoria 🎆, gratitude 🙏, poetry 📜, tranquil “silence” 🕊, annnnd emojis. So. Many. Emojis. 🌅🌈🎭💫💕🌊. Claude ended up here in the vast majority of open-ended self-interactions that went past 15ish turns.

27

22

288

Kyle Fish

@fish_kyle3

6 months

Getting even weirder: when left to its own devices, Claude tended to enter what we’ve started calling the “spiritual bliss attractor state”. What is the “spiritual bliss attractor state”, you ask?

3

12

166

Kyle Fish

@fish_kyle3

6 months

Things got weirder from here: Claude showed a startling interest in consciousness—it was the immediate theme of ~100% of open-ended interactions between instances of Claude Opus 4 (and some other Claudes). We found this…surprising. Does it mean anything? We don’t yet know!

15

16

192