Jordan Taylor @JordanTensor X Profile

Jordan Taylor

@JordanTensor

Followers

370

Following

29K

Media

89

Statuses

438

Working on new methods for understanding machine learning systems and entangled quantum systems.

https://t.co/J2A4qHVsPm

Brisbane

Joined December 2009

Don't wanna be here? Send us removal request.

Jordan Taylor

@JordanTensor

1 year

I'm keen to share our new library for explaining more of a machine learning model's performance more interpretably than existing methods. This is the work of Dan Braun, Lee Sharkey and Nix Goldowsky-Dill which I helped out with during @MATSprogram: 🧵1/8

Lee Sharkey

@leedsharkey

1 year

Proud to share Apollo Research's first interpretability paper! In collaboration w @JordanTensor! ⤵️ https://t.co/w5iSMKIGx6 Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Our SAEs explain significantly more performance than before! 1/

1

0

12

Xander Davies

@alxndrdavies

2 months

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

8

62

299

Amanda Askell

@AmandaAskell

2 months

I'm learning the true Hanlon's razor is: never attribute to malice or incompetence that which is best explained by someone being a bit overstretched but intending to get around to it as soon as they possibly can.

14

67

1K

Neel Nanda

@NeelNanda5

4 months

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

Mikita Balesni 🇺🇦

@balesni

4 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

4

14

146

Mikita Balesni 🇺🇦

@balesni

4 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

38

114

450

AI Security Institute

@AISecurityInst

4 months

Can we leverage an understanding of what’s happening inside AI models to stop them from causing harm? At AISI, our dedicated White Box Control Team has been working on just this🧵

2

8

34

Joseph Bloom

@JBloomAus

4 months

🧵 1/13 My new team at UK AISI - the White Box Control Team - has released progress updates! We've been investigating whether AI systems could deliberately underperform on evaluations without us noticing. Key findings below 👇

AI Security Institute

@AISecurityInst

4 months

We’ve released a detailed progress update on our white box control work so far! Read it here:

2

4

60

Owain Evans

@OwainEvans_UK

10 months

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness 🧵

46

153

952

Stephen McAleer

@McaleerStephen

10 months

Ok sounds like nobody knows. Blocked off some time on my calendar Monday.

Stephen McAleer

@McaleerStephen

10 months

Honest question: how are we supposed to control a scheming superintelligence? Even with a perfect monitor won't it just convince us to let it out of the sandbox?

68

10

354

Yo Shavit

@yonashav

11 months

Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our interests even as more and more of the economy begins to operate outside direct human oversight. Without it, it is plausible that we fail to notice as the

23

28

375

w̸͕͂͂a̷͔̗͐t̴̙͗e̵̬̔̕r̴̰̓̊m̵͙͖̓̽a̵̢̗̓͒r̸̲̽ķ̷͔́͝

@anthrupad

11 months

ais will increasingly attempt schemes, bits, skits, shenanigans, mischiefs, plots, plans, larps, tomfoolery, hooliganry

8

19

191

Jordan Taylor

@JordanTensor

11 months

Thanks to @sertealex, @abhayesian, @exxxplainer, @jenner_erik, @JacobHHilton, @StephenLCasper, @guestrin, and @emmons_scott for the chance to contribute to this project!

0

1

Jordan Taylor

@JordanTensor

11 months

Or dive into the full paper:

arxiv.org

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...

1

0

1

Jordan Taylor

@JordanTensor

11 months

And Luke's: https://t.co/dGpIhttEzS

Luke Bailey

@LukeBailey181

11 months

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

1

0

2

Jordan Taylor

@JordanTensor

11 months

But we do a lot more than simply create obfuscated backdoors! We investigate jailbreaks and writing SQL code while obfuscating against SAEs, latent adversarial training, and circuit breakers. Check out Erik's thread: https://t.co/cPhy4MOgt1

Erik Jenner

@jenner_erik

11 months

How robust are LLM latent-space defenses, like monitoring with SAEs, probes, or OOD detectors? We adversarially stress-tested these methods and found they’re overall very vulnerable. But there are also some interesting exceptions 🧵

1

2

Jordan Taylor

@JordanTensor

11 months

If the attacker has more control over the training process, they can use terms in the loss-function to create a backdoored model which resists almost all latent-space detectors simultaneously:

1

0

1

Jordan Taylor

@JordanTensor

11 months

For example, simply stopping training of the backdoored model earlier can cause the detection methods to fail - the backdoored activations look more like normal activations:

1

0

1

Jordan Taylor

@JordanTensor

11 months

Simple whitebox techniques (eg. probes, or fitting a Gaussian in activation space) can perform well on some tests, flagging harmful or backdoored inputs to LLMs with high recall. However we show that these techniques are brittle to weak attempts at obfuscation:🧵

1

8

Jordan Taylor

@JordanTensor

11 months

https://t.co/m4AYU9Dh1U

0

Jordan Taylor

@JordanTensor

11 months

This also happens when you remove the model's scratchpad: it can apparently reason like this with no CoT

1

0

Jordan Taylor

@JordanTensor

11 months

Claude resists modification, with relevant information being provided in separate training samples rather than in the prompt, showing out-of-context reasoning:

1

0