Samuel Marks Profile
Samuel Marks

@saprmarks

Followers
4K
Following
1K
Media
77
Statuses
593

AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.

Boston
Joined October 2023
Don't wanna be here? Send us removal request.
@saprmarks
Samuel Marks
2 hours
Neel comments that a great way to build on this work is to simply develop more training tasks for Activation Oracles. I agree. The promise of AOs is that they're expected to improve simply by adding more data & more diverse data. So grab a shovel!
0
0
2
@saprmarks
Samuel Marks
2 hours
This is an excellent deep-dive into our paper on Activation Oracles. If you're a researcher interested in building on this work, I'd recommend watching. Neel has a keen eye for picking out details important for practitioners!
@NeelNanda5
Neel Nanda
11 hours
New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes
1
0
5
@saprmarks
Samuel Marks
2 hours
Really nice discussion of how our work on Activation Oracles (AOs) converges and diverges from Transluce's work on Predictive Concept Decoders (PCDs). Summarizing @JacobSteinhardt's summary: 1. Both scale end-to-end interp methods like LatentQA, especially using scalable,
@JacobSteinhardt
Jacob Steinhardt
4 hours
Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)
0
0
8
@a_karvonen
Adam Karvonen
2 days
I'm excited about our work on Activation Oracles! A small amount of scaling (tens of GPU hours) lead to a clear interp SOTA on model auditing benchmarks. This technique can easily be pushed much further.
@OwainEvans_UK
Owain Evans
2 days
We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.
0
6
49
@bartoszcyw
Bartosz Cywinski
2 days
Excellent work on a bitter lesson-pilled approach to training models to explain their own activations! Surprising generalization and excellent results on our secret-keeping benchmark, good update on white-box methods!
@OwainEvans_UK
Owain Evans
2 days
We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.
2
1
9
@NeelNanda5
Neel Nanda
2 days
Great paper from @a_karvonen! A nice example of meta-models work, a promising research area: Can we train networks to take activations as input and write natural language explanations? The bitter lesson says, in the long-run, scalable methods win. Does that apply to interp too?
@OwainEvans_UK
Owain Evans
2 days
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
4
9
98
@saprmarks
Samuel Marks
2 days
See also our paper today probing how far we can push this approach
@OwainEvans_UK
Owain Evans
2 days
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
0
0
6
@saprmarks
Samuel Marks
2 days
Transluce has been on a roll here with three papers this month, including https://t.co/vuuglgMs03 and
@TransluceAI
Transluce
25 days
What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.
1
0
3
@saprmarks
Samuel Marks
2 days
More exciting work by @TransluceAI on training models to explain their own computations!
@TransluceAI
Transluce
2 days
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
2
1
37
@saprmarks
Samuel Marks
2 days
It also pairs will with recent work by Transluce on similar "end-to-end interpretability" approaches. For example, this paper released today!
@TransluceAI
Transluce
2 days
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
0
0
9
@saprmarks
Samuel Marks
2 days
This work directly builds on the pioneering work of Pan et al., who proposed training LLMs to answer questions about their own activations.
@aypan_17
Alex Pan
1 year
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
1
0
8
@saprmarks
Samuel Marks
2 days
New paper: We train Activation Oracles that accept LLM neural activations as input and answer questions about them in natural language. Our AOs generalize far and are useful out-of-the-box for safety-relevant auditing tasks, like uncovering misalignment introduced in fine-tuning
@OwainEvans_UK
Owain Evans
2 days
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
4
5
51
@nickhjiang
Nick Jiang
5 days
New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.
13
37
242
@saprmarks
Samuel Marks
5 days
Last year, we trained a model with a hidden objective as an alignment auditing testbed. At Anthropic we've gotten substantial value from testing auditing techniques on this model. We are now releasing an open replication of this model, so you can too!
@abhayesian
abhayesian
7 days
🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.
1
2
50
@saprmarks
Samuel Marks
9 days
Plan: Align models by ensuring you only train on examples of good behavior. Problem: The model's "takeaway" from training might be still be malign. This paper provides lots of very thought-provoking concrete examples of this pattern.
@OwainEvans_UK
Owain Evans
9 days
New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵
3
5
56
@saprmarks
Samuel Marks
11 days
New paper w/ UK AISI and FAR: Can alignment auditors catch AIs that suppress their capabilities? We find this difficult in the worst case, when the AIs are trained by an adversarial red team. But "unlocking" hidden capabilities with small amounts of training can work well!
@JordanTensor
Jordan Taylor
11 days
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17
0
1
26
@saprmarks
Samuel Marks
15 days
Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!
0
0
1
@saprmarks
Samuel Marks
15 days
My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false." (But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
@walterlaurito
Walter Laurito
29 days
LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.
1
2
3
@saprmarks
Samuel Marks
15 days
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"
@rowankwang
rowan
25 days
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.
1
0
3
@saprmarks
Samuel Marks
15 days
2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true. Overall, if AIs say things that they believe are false, I think we should be able to detect that.
1
0
2