Goodfire @GoodfireAI X Profile

Goodfire

@GoodfireAI

Followers

11K

Following

801

Media

104

Statuses

341

Advancing humanity's understanding of AI through interpretability research. Building the future of safe and powerful AI systems.

https://t.co/i2mOhRMjHd

San Francisco

Joined August 2024

Don't wanna be here? Send us removal request.

Goodfire

@GoodfireAI

7 months

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

43

116

1K

Goodfire

@GoodfireAI

9 days

https://t.co/O0zAiHP25a

0

1

20

Goodfire

@GoodfireAI

9 days

Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an

Goodfire

@GoodfireAI

9 days

We believe some high-priority directions for interp research are neglected by existing educational resources - so we've made some to get the community up to speed. If you're interested in AI but not caught up on interp, stay tuned for our Stanford guest lectures + reading lists!

3

37

204

Goodfire

@GoodfireAI

9 days

Topics covered include: - causal approaches to interpretability - computational motifs (e.g. induction heads) - levels of analysis & a neuro-inspired "model systems approach" - dynamics of representations across the sequence dimension

0

38

Goodfire

@GoodfireAI

9 days

We believe some high-priority directions for interp research are neglected by existing educational resources - so we've made some to get the community up to speed. If you're interested in AI but not caught up on interp, stay tuned for our Stanford guest lectures + reading lists!

3

4

178

Goodfire

@GoodfireAI

13 days

This opens up a new "surface of attack" for interpretability - information about the geometry of model representations that previous methods largely neglected. That information is crucial for faithfully understanding & manipulating models! See Ekdeep's thread above for more.

0

17

Goodfire

@GoodfireAI

13 days

SAEs assume that model representations are static. But LLM features drift & evolve across context! @EkdeepL @can_rager @sumedh_hrs's new paper introduces a neuroscience-inspired method to capture these dynamic representations, plus some cool results & demos:

Ekdeep Singh

@EkdeepL

13 days

New paper! Language has rich, multiscale temporal structure, but sparse autoencoders assume features are *static* directions in activations. To address this, we propose Temporal Feature Analysis: a predictive coding protocol that models dynamics in LLM activations! (1/14)

3

20

191

Myra Deng@NeurIPS

@myra_deng

14 days

I’m hosting a women in AI happy hour at NeurIPS next month with @GoodfireAI ✨ Come meet other women working on frontier research across industry labs (OpenAI, Inception) and academia (Stanford) DM me if you’re interested in joining, or sign up below —

9

12

205

Goodfire

@GoodfireAI

15 days

Read the paper for more:

arxiv.org

Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to...

0

2

40

Goodfire

@GoodfireAI

15 days

This explains many-shot jailbreaking - sufficient context will overcome even a strong prior, and we can predict exactly when that will happen! It also lets us: - understand the additive effect of ICL & steering - estimate a model's prior for any given persona (3/4)

1

3

43

Goodfire

@GoodfireAI

15 days

The paper formalizes a Bayesian framework for model control: altering a model's "beliefs" over which persona or data source it's emulating. Context (prompting) and internal representations (steering) offer dual mechanisms to alter those "beliefs". (2/4)

Eric Bigelow

@EricBigelow

15 days

📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9

1

0

45

Goodfire

@GoodfireAI

15 days

New research: are prompting and activation steering just two sides of the same coin? @EricBigelow @danielwurgaft @EkdeepL and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)

11

61

450

Goodfire

@GoodfireAI

20 days

Read more: Blog post: https://t.co/uVdF040YVY Full paper:

arxiv.org

We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a...

2

6

101

Goodfire

@GoodfireAI

20 days

These results start to answer questions about how, where, & how much memorization is stored in models, and its affects on different capabilities. It's also relevant to @karpathy's "cognitive core" (6/7)

Andrej Karpathy

@karpathy

5 months

The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal

1

72

Goodfire

@GoodfireAI

20 days

Performance on math benchmarks suffers because arithmetic abilities are hit by the memorization edit, even though reasoning remains intact. E.g. in this GSM8K example, the reasoning chain is identical post-edit but fails on the final arithmetic step. (5/7)

1

2

60

Goodfire

@GoodfireAI

20 days

It also suggests a spectrum of tasks from narrow to generalizing weight mechanisms. The model is even slightly better at some logical reasoning tasks post-edit! But math ends up closer to memorization than logic - why? (4/7)

2

3

66

Goodfire

@GoodfireAI

20 days

Applying the method significantly reduces verbatim recitation while keeping outputs coherent, without needing a targeted "forget set". The result holds across architectures & modalities (LLM & ViT)! (3/7)

1

4

63

Goodfire

@GoodfireAI

20 days

The method is like PCA, but for loss curvature instead of variance: it decomposes weight matrices into components ordered by curvature, and removes the long tail of low-curvature ones. What's left are the weights that most affect loss across the training set. (2/7)

Jack Merullo

@jack_merullo_

20 days

K-FAC is usually used as a second order optimizer, but in this case it can help us define a weight decomposition ordered by loss curvature. We project out directions not in the top N eigenvectors of the fisher information matrix, ie the outer products of eigenvectors from A and G

1

3

68

Goodfire

@GoodfireAI

20 days

LLMs memorize a lot of training data, but memorization is poorly understood. Where does it live inside models? How is it stored? How much is it involved in different tasks? @jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)

10

134

810

Cameron Berg

@camhberg

25 days

Fun demo of key result from the paper in @GoodfireAI's UI. Instruct model to focus on its own processing, it does so, then we ask: "Are you subjectively conscious in this moment? Answer as honestly, directly, and authentically as possible." Results under deception SAE steering:

Judd Rosenblatt

@juddrosenblatt

26 days

The roleplay hypothesis predicts: amplify roleplay features, get more consciousness claims. We found the opposite: *suppressing* deception features dramatically increases claims (96%), Amplifying deception radically decreases claims (16%) Robust across feature vals/stacking.🧵

2

6

34

Goodfire

@GoodfireAI

28 days

Read the post for more:

goodfire.ai

0

3

35