JordanTensor Profile Banner
Jordan Taylor Profile
Jordan Taylor

@JordanTensor

Followers
370
Following
29K
Media
89
Statuses
437

Working on new methods for understanding machine learning systems and entangled quantum systems.

Brisbane
Joined December 2009
Don't wanna be here? Send us removal request.
@JordanTensor
Jordan Taylor
1 year
I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: .đź§µ1/8
Tweet media one
@leedsharkey
Lee Sharkey
1 year
Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor!.⤵️. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. Our SAEs explain significantly more performance than before! .1/.
1
0
12
@JordanTensor
Jordan Taylor
12 hours
RT @AmandaAskell: I'm learning the true Hanlon's razor is: never attribute to malice or incompetence that which is best explained by someon….
0
59
0
@grok
Grok
21 days
"A medieval knight in full armor riding a motorcycle through a misty jungle trail.". Try Grok Imagine, free for a limited time.
511
907
4K
@JordanTensor
Jordan Taylor
2 months
RT @NeelNanda5: It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often expres….
0
14
0
@JordanTensor
Jordan Taylor
2 months
RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….
0
109
0
@JordanTensor
Jordan Taylor
2 months
RT @AISecurityInst: Can we leverage an understanding of what’s happening inside AI models to stop them from causing harm?. At AISI, our ded….
0
9
0
@JordanTensor
Jordan Taylor
2 months
RT @JBloomAus: 🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates!. We've been investigating whethe….
0
4
0
@JordanTensor
Jordan Taylor
7 months
RT @OwainEvans_UK: New paper:.We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *d….
0
154
0
@JordanTensor
Jordan Taylor
8 months
RT @McaleerStephen: Ok sounds like nobody knows. Blocked off some time on my calendar Monday.
0
11
0
@JordanTensor
Jordan Taylor
8 months
RT @yonashav: Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our inte….
0
29
0
@JordanTensor
Jordan Taylor
8 months
RT @anthrupad: ais will increasingly attempt schemes, bits, skits, shenanigans, mischiefs, plots, plans, larps, tomfoolery, hooliganry http….
0
19
0
@JordanTensor
Jordan Taylor
9 months
Thanks to @sertealex, @abhayesian, @exxxplainer, @jenner_erik, @JacobHHilton, @StephenLCasper, @guestrin, and @emmons_scott for the chance to contribute to this project!.
0
0
1
@JordanTensor
Jordan Taylor
9 months
And Luke's:.
@LukeBailey181
Luke Bailey
9 months
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
1
0
2
@JordanTensor
Jordan Taylor
9 months
But we do a lot more than simply create obfuscated backdoors! We investigate jailbreaks and writing SQL code while obfuscating against SAEs, latent adversarial training, and circuit breakers. Check out Erik's thread: .
@jenner_erik
Erik Jenner
9 months
How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tested these methods and found they’re overall very vulnerable. But there are also some interesting exceptions 🧵
Tweet media one
1
1
2
@JordanTensor
Jordan Taylor
9 months
If the attacker has more control over the training process, they can use terms in the loss-function to create a backdoored model which resists almost all latent-space detectors simultaneously:
Tweet media one
1
0
1
@JordanTensor
Jordan Taylor
9 months
For example, simply stopping training of the backdoored model earlier can cause the detection methods to fail - the backdoored activations look more like normal activations:
Tweet media one
1
0
1
@JordanTensor
Jordan Taylor
9 months
Simple whitebox techniques (eg. probes, or fitting a Gaussian in activation space) can perform well on some tests, flagging harmful or backdoored inputs to LLMs with high recall. However we show that these techniques are brittle to weak attempts at obfuscation:đź§µ
Tweet media one
1
1
8
@JordanTensor
Jordan Taylor
9 months
Tweet media one
0
0
0
@JordanTensor
Jordan Taylor
9 months
This also happens when you remove the model's scratchpad: it can apparently reason like this with no CoT
Tweet media one
1
0
0
@JordanTensor
Jordan Taylor
9 months
Claude resists modification, with relevant information being provided in separate training samples rather than in the prompt, showing out-of-context reasoning:
Tweet media one
1
0
0
@JordanTensor
Jordan Taylor
9 months
Sections 4 and especially 4.3 here are very interesting! .It seems that today's models can engage in realistic goal-guarding, without relevant information in the prompt (it's provided instead in pretraining), reasoning in a forward pass with no CoT!.
@RyanPGreenblatt
Ryan Greenblatt
9 months
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread).
1
0
1