Samuel Marks @saprmarks X Profile

Samuel Marks

@saprmarks

Followers

4K

Following

1K

Media

77

Statuses

593

AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.

Boston

Joined October 2023

Don't wanna be here? Send us removal request.

Samuel Marks

@saprmarks

2 hours

Neel comments that a great way to build on this work is to simply develop more training tasks for Activation Oracles. I agree. The promise of AOs is that they're expected to improve simply by adding more data & more diverse data. So grab a shovel!

0

2

Samuel Marks

@saprmarks

2 hours

This is an excellent deep-dive into our paper on Activation Oracles. If you're a researcher interested in building on this work, I'd recommend watching. Neel has a keen eye for picking out details important for practitioners!

Neel Nanda

@NeelNanda5

11 hours

New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes

1

0

5

Samuel Marks

@saprmarks

2 hours

Really nice discussion of how our work on Activation Oracles (AOs) converges and diverges from Transluce's work on Predictive Concept Decoders (PCDs). Summarizing @JacobSteinhardt's summary: 1. Both scale end-to-end interp methods like LatentQA, especially using scalable,

Jacob Steinhardt

@JacobSteinhardt

4 hours

Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)

0

8

Adam Karvonen

@a_karvonen

2 days

I'm excited about our work on Activation Oracles! A small amount of scaling (tens of GPU hours) lead to a clear interp SOTA on model auditing benchmarks. This technique can easily be pushed much further.

Owain Evans

@OwainEvans_UK

2 days

We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.

0

6

49

Bartosz Cywinski

@bartoszcyw

2 days

Excellent work on a bitter lesson-pilled approach to training models to explain their own activations! Surprising generalization and excellent results on our secret-keeping benchmark, good update on white-box methods!

Owain Evans

@OwainEvans_UK

2 days

We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.

2

1

9

Neel Nanda

@NeelNanda5

2 days

Great paper from @a_karvonen! A nice example of meta-models work, a promising research area: Can we train networks to take activations as input and write natural language explanations? The bitter lesson says, in the long-run, scalable methods win. Does that apply to interp too?

Owain Evans

@OwainEvans_UK

2 days

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

4

9

98

Samuel Marks

@saprmarks

2 days

See also our paper today probing how far we can push this approach

Owain Evans

@OwainEvans_UK

2 days

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

0

6

Samuel Marks

@saprmarks

2 days

Transluce has been on a roll here with three papers this month, including https://t.co/vuuglgMs03 and

Transluce

@TransluceAI

25 days

What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.

1

0

3

Samuel Marks

@saprmarks

2 days

More exciting work by @TransluceAI on training models to explain their own computations!

Transluce

@TransluceAI

2 days

Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.

2

1

37

Samuel Marks

@saprmarks

2 days

It also pairs will with recent work by Transluce on similar "end-to-end interpretability" approaches. For example, this paper released today!

Transluce

@TransluceAI

2 days

Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.

0

9

Samuel Marks

@saprmarks

2 days

This work directly builds on the pioneering work of Pan et al., who proposed training LLMs to answer questions about their own activations.

Alex Pan

@aypan_17

1 year

LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵

1

0

8

Samuel Marks

@saprmarks

2 days

New paper: We train Activation Oracles that accept LLM neural activations as input and answer questions about them in natural language. Our AOs generalize far and are useful out-of-the-box for safety-relevant auditing tasks, like uncovering misalignment introduced in fine-tuning

Owain Evans

@OwainEvans_UK

2 days

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

4

5

51

Nick Jiang

@nickhjiang

5 days

New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.

13

37

242

Samuel Marks

@saprmarks

5 days

Last year, we trained a model with a hidden objective as an alignment auditing testbed. At Anthropic we've gotten substantial value from testing auditing techniques on this model. We are now releasing an open replication of this model, so you can too!

abhayesian

@abhayesian

7 days

🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.

1

2

50

Samuel Marks

@saprmarks

9 days

Plan: Align models by ensuring you only train on examples of good behavior. Problem: The model's "takeaway" from training might be still be malign. This paper provides lots of very thought-provoking concrete examples of this pattern.

Owain Evans

@OwainEvans_UK

9 days

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

3

5

56

Samuel Marks

@saprmarks

11 days

New paper w/ UK AISI and FAR: Can alignment auditors catch AIs that suppress their capabilities? We find this difficult in the worst case, when the AIs are trained by an adversarial red team. But "unlocking" hidden capabilities with small amounts of training can work well!

Jordan Taylor

@JordanTensor

11 days

NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17

0

1

26

Samuel Marks

@saprmarks

15 days

Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!

0

1

Samuel Marks

@saprmarks

15 days

My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false." (But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)

Walter Laurito

@walterlaurito

29 days

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

1

2

3

Samuel Marks

@saprmarks

15 days

See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"

rowan

@rowankwang

25 days

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.

1

0

3

Samuel Marks

@saprmarks

15 days

2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true. Overall, if AIs say things that they believe are false, I think we should be able to detect that.

1

0

2