
Max Nadeau
@MaxNadeau_
Followers
1K
Following
7K
Media
19
Statuses
375
Advancing AI honesty, control, safety at @open_phil. Prev Harvard AISST (https://t.co/xMMztyYR3O), Harvard '23.
Berkeley, CA
Joined November 2017
đź§µ Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.
4
84
250
I will be blogging!
Introducing: Asterisk's AI Fellows. Hailing from Hawaii to Dubai, and many places between, our AI Fellows will be writing on law, military, development economics, evals, China, biosecurity, and much more. We can’t wait to share their writing with you. https://t.co/rjLp2RAjME
1
1
69
Yep totally agreed with Ryan's goldilocks position here: small differences in chances in <2yr timelines are action relevant, big differences in chances of <10yr timelines are action-relevant, but other timelines differences are not
While I sometimes write about AGI timelines, I think moderate differences in timelines usually aren't very action relevant. Pretty short timelines (<10 years) seem likely enough to warrant strong action and it's hard to very confidently rule out things going crazy in <3 years.
0
0
5
This is a much more sensible way to conceptualize and evaluate CoT monitoring than the ways that dominate the discourse
The terms “CoT” and reasoning trace make it sound like the CoT is a summary of an LLM’s reasoning. But IMO it’s more accurate to view CoT as a tool models use to think better. CoT monitoring is about tracking how models use this tool so we can glean insight into their
0
0
3
An interpretability method, if you can keep it!
Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says? In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.
0
0
6
My god they've actually done it
Dario Amodei: "My friends, we have but two years to rigorously prepare the global community for the tumultuous arrival of AGI" Sam Altman: "we r gonna build a $55 trillion data center" Demis Hassabis: "I've created the worlds most accurate AI simulation of a Volcano."
13
26
1K
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
3
29
152
IMO, the biggest bottleneck in AI safety is people who are interested and capable of executing well on research like this. But the importance of this sort of work becomes more and more palpable over time; get in early! See also Anthropic's similar list:
At Redwood Research, we recently posted a list of empirical AI security/safety project proposal docs across a variety of areas. Link in thread.
2
2
34
* I find this deflationary explanation (learning effects after 40 hours of agent usage) intuitively plausible, probably the best alternative to METR's primary explanation. I'm very grateful to Emmett for reading the paper closely and bringing it up; seems like a valuable
METR’s analysis of this experiment is wildly misleading. The results indicate that people who have ~never used AI tools before are less productive while learning to use the tools, and say ~nothing about experienced AI tool users. Let's take a look at why.
4
4
86
This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
8
24
170
1) It takes *way* longer than anticipated to actually build/deploy custom AI agents for large enterprises. AI makes the engineering fast. But sales, product, system integration, and implementation are *incredibly* slow. Customers don't know what they want, getting stakeholders
29
36
569
Reliable sources have told me that after you start work at Anthropic, they give you a spiral-bound notebook, and tell you: "To assist your work, this is your SECRET SCRATCHPAD. No one else will see the contents of your SECRET SCRATCHPAD, so you can use it freely as you wish -
4
29
551
Really interesting thread, contrary to my assumptions about scale. Thanks for putting it together @nsaphra!
Reasoning is about variable binding. It’s not about information retrieval. If a model cannot do variable binding, it is not good at grounded reasoning, and there’s evidence accruing that large scale can make LLMs worse at in-context grounded reasoning. 🧵
0
0
0
This is such a fun piece of performance art. For those who haven't seen, the agents are planning a party/performance (tonight, in SF). If I didn't have preexisting evening plans I'd definitely go.
Of all the agents, o3 is the most willing to take charge and tell the others what to do. The other agents are *mostly* happy to comply
0
0
2
My view are similar.
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a
0
0
2
Weirdly underrated research direction. We need automatic methods for surfacing realistic inputs that trigger unacceptable LLM behaviors, but almost all the research effort goes to finding jailbreaks. Glad Transluce is paving the way!
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
1
0
15