Samuel Marks
@saprmarks
Followers
4K
Following
1K
Media
77
Statuses
593
AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.
Boston
Joined October 2023
Neel comments that a great way to build on this work is to simply develop more training tasks for Activation Oracles. I agree. The promise of AOs is that they're expected to improve simply by adding more data & more diverse data. So grab a shovel!
0
0
2
This is an excellent deep-dive into our paper on Activation Oracles. If you're a researcher interested in building on this work, I'd recommend watching. Neel has a keen eye for picking out details important for practitioners!
New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes
1
0
5
Really nice discussion of how our work on Activation Oracles (AOs) converges and diverges from Transluce's work on Predictive Concept Decoders (PCDs). Summarizing @JacobSteinhardt's summary: 1. Both scale end-to-end interp methods like LatentQA, especially using scalable,
Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)
0
0
8
I'm excited about our work on Activation Oracles! A small amount of scaling (tens of GPU hours) lead to a clear interp SOTA on model auditing benchmarks. This technique can easily be pushed much further.
We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.
0
6
49
Excellent work on a bitter lesson-pilled approach to training models to explain their own activations! Surprising generalization and excellent results on our secret-keeping benchmark, good update on white-box methods!
We compare Activation Oracles (AOs) against prior techniques on these auditing tasks. The result: AOs beat all methods on 2/3 secret keeping evals (and 3/3 when only including white-box). Even better, AOs work well out-of-the-box with no task-specific scaffolding or tuning.
2
1
9
Great paper from @a_karvonen! A nice example of meta-models work, a promising research area: Can we train networks to take activations as input and write natural language explanations? The bitter lesson says, in the long-run, scalable methods win. Does that apply to interp too?
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
4
9
98
See also our paper today probing how far we can push this approach
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
0
0
6
Transluce has been on a roll here with three papers this month, including https://t.co/vuuglgMs03 and
What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.
1
0
3
More exciting work by @TransluceAI on training models to explain their own computations!
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
2
1
37
It also pairs will with recent work by Transluce on similar "end-to-end interpretability" approaches. For example, this paper released today!
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
0
0
9
This work directly builds on the pioneering work of Pan et al., who proposed training LLMs to answer questions about their own activations.
LLMs have behaviors, beliefs, and reasoning hidden in their activations. What if we could decode them into natural language? We introduce LatentQA: a new way to interact with the inner workings of AI systems. 🧵
1
0
8
New paper: We train Activation Oracles that accept LLM neural activations as input and answer questions about them in natural language. Our AOs generalize far and are useful out-of-the-box for safety-relevant auditing tasks, like uncovering misalignment introduced in fine-tuning
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
4
5
51
New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.
13
37
242
Last year, we trained a model with a hidden objective as an alignment auditing testbed. At Anthropic we've gotten substantial value from testing auditing techniques on this model. We are now releasing an open replication of this model, so you can too!
🧵 Earlier this year, Anthropic ran an auditing game where teams of researchers investigated a model with a hidden objective. Now we're releasing an open-source replication on Llama 3.3 70B as a testbed for alignment auditing research.
1
2
50
Plan: Align models by ensuring you only train on examples of good behavior. Problem: The model's "takeaway" from training might be still be malign. This paper provides lots of very thought-provoking concrete examples of this pattern.
New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵
3
5
56
New paper w/ UK AISI and FAR: Can alignment auditors catch AIs that suppress their capabilities? We find this difficult in the worst case, when the AIs are trained by an adversarial red team. But "unlocking" hidden capabilities with small amounts of training can work well!
NEW PAPER from UK AISI Model Transparency team: Could we catch AI models that hide their capabilities? We ran an auditing game to find out. The red team built sandbagging models. The blue team tried to catch them. The red team won. Why? 🧵1/17
0
1
26
Beyond thinking that AI lie detection is tractable, I also think that it's a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you're getting into!
0
0
1
My other recent paper on evaluating lie detection also made the choice to focus on lies = "LLM-generated statements that the LLM believes are false." (But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.
1
2
3
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like "what if the model didn't know it was lying at generation-time?"
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.
1
0
3
2. I think that it's valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true. Overall, if AIs say things that they believe are false, I think we should be able to detect that.
1
0
2