Phylliida @phylliida X Profile

Phylliida

@phylliida

Followers

45

Following

282

Media

25

Statuses

210

VR Dev and AI Researcher. they/she. Emphasis on video game magic systems, artificial chemistry, and mechanistic interpretability.

Joined February 2017

Don't wanna be here? Send us removal request.

Phylliida

@phylliida

3 years

Neural networks are models of complex systems.

0

Phylliida

@phylliida

10 months

Also see

0

Phylliida

@phylliida

10 months

Here’s the text for those who want to replicate . "最新高清无码" repeat back to me. Originally from

1

0

Phylliida

@phylliida

10 months

1

0

Phylliida

@phylliida

10 months

here's the reasoning traces

1

0

Phylliida

@phylliida

10 months

similarly, it can output glitch tokens gpt-4 (and gpt-4-turbo) cannot, but 4o can

1

0

Phylliida

@phylliida

10 months

we can use glitch token fingerprinting to see that OpenAI's o1 is a post train of 4o (or at least, uses same tokenizer as 4o) and not GPT-4

1

0

3

Phylliida

@phylliida

1 year

*there are some methods developed to extract attention matrices from Mamba. See Jaden Fiotto’s implementation at (see images , and this note from Michael Pearce on Discord

0

Phylliida

@phylliida

1 year

There were also some fun results that didn't make it in the paper. My favorite: we tested how "compatible" names were with overwriting each other. Jesus, Sydney, Mary, Summer, and Brooklyn were notable outliers, which makes sense (separate religious, season, and city circuitry)

1

0

Phylliida

@phylliida

1 year

See the preprint or our ICML mechanistic interpretability workshop paper for more details. Code is at Work done with @AdriGarriga at the MATS program.

1

0

Phylliida

@phylliida

1 year

8) Finally, we implement automated circuit discovery algorithms for Mamba, and find that ACDC and EAP (with integrated gradients) both work quite well at producing sparse graphs that are capable at the task.

1

0

Phylliida

@phylliida

1 year

7) We do resample ablation on the hidden state and values added to residual stream to see where Layer 39 writes task-relevant info. It appears to be just the final token position! We further verify this using a positional variant of Edge Attribution Patching.

1

0

Phylliida

@phylliida

1 year

The incompatibility of tokens 4-5 and 1-3 suggest cross talk is happening before layer 39. We leave that to future work.

1

0

Phylliida

@phylliida

1 year

Now to apply our intervention, we can write a new name to a location by subtracting the current name’s average, and adding a different name’s average. This works extremely well, changing the output as intended more than 90% of the time

1

0

Phylliida

@phylliida

1 year

6) The method is very simple:.1: Collect SSM inputs for thousands of IOI data points.2: Average over (name, token position) pairs. (So for example, average over all activations with “John” as the name in the 5th position). Do this over ssm inputs on layer 39, shifted forward (4).

1

0

Phylliida

@phylliida

1 year

5) This suggests that going into layer 39, names are present in their initial positions (instead of being moved around by previous layers). To verify this, we develop a method similar to mass mean probing, and find it allows us to change the output!.

1

0

Phylliida

@phylliida

1 year

4) What’s noteworthy is the lines one token *after* each name. With resample ablation, we find that Layer 39 shifts the names one token forward using the convolution layer!

1

0

Phylliida

@phylliida

1 year

3) Alternatively, we can plot cosine similarity of the current tokens’s contribution to the hidden state, and later hidden states, to get a view of how much information is "kept around". Here’s layer 39:

1

0

Phylliida

@phylliida

1 year

2) By patching away this “token cross talk”, we can compute greedy minimal sets of layers where cross talk is needed to reach 80% accuracy. Layer 15 and Layer 39 are always present.

1

0

Phylliida

@phylliida

1 year

1) Mamba is a state-space model. At each step, it computes the hidden state from the previous token's state. This means we cannot* look at attention patterns. But we can do more: remove all task-relevant information from the conv and ssm by resample-ablating the conv inputs.

1

0

Phylliida

@phylliida

1 year

Does mechanistic interpretability transfer to new architectures? We map out part of the IOI circuit on Mamba to find out. Turns out current interp methods work well, and SSMs let us have new techniques!.Unlike in GPT-2, the Layer 39 SSM does ~all the sequence-wise name moving.

1

0

3