
Jordan Taylor
@JordanTensor
Followers
370
Following
29K
Media
89
Statuses
437
Working on new methods for understanding machine learning systems and entangled quantum systems.
Brisbane
Joined December 2009
I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: .đź§µ1/8
Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor!.⤵️. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. Our SAEs explain significantly more performance than before! .1/.
1
0
12
RT @AmandaAskell: I'm learning the true Hanlon's razor is: never attribute to malice or incompetence that which is best explained by someon….
0
59
0
RT @NeelNanda5: It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often expres….
0
14
0
RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….
0
109
0
RT @AISecurityInst: Can we leverage an understanding of what’s happening inside AI models to stop them from causing harm?. At AISI, our ded….
0
9
0
RT @JBloomAus: 🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates!. We've been investigating whethe….
0
4
0
RT @OwainEvans_UK: New paper:.We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *d….
0
154
0
RT @McaleerStephen: Ok sounds like nobody knows. Blocked off some time on my calendar Monday.
0
11
0
RT @yonashav: Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our inte….
0
29
0
RT @anthrupad: ais will increasingly attempt schemes, bits, skits, shenanigans, mischiefs, plots, plans, larps, tomfoolery, hooliganry http….
0
19
0
Thanks to @sertealex, @abhayesian, @exxxplainer, @jenner_erik, @JacobHHilton, @StephenLCasper, @guestrin, and @emmons_scott for the chance to contribute to this project!.
0
0
1
And Luke's:.
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
1
0
2
But we do a lot more than simply create obfuscated backdoors! We investigate jailbreaks and writing SQL code while obfuscating against SAEs, latent adversarial training, and circuit breakers. Check out Erik's thread: .
How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tested these methods and found they’re overall very vulnerable. But there are also some interesting exceptions 🧵
1
1
2
Sections 4 and especially 4.3 here are very interesting! .It seems that today's models can engage in realistic goal-guarding, without relevant information in the prompt (it's provided instead in pretraining), reasoning in a forward pass with no CoT!.
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread).
1
0
1