David Gringras
@david_gring26
Followers
34
Following
66K
Media
20
Statuses
67
British medical doctor & law grad. Currently studying @Harvard/@MIT and thinking a lot about AI. I spend far too much time on X but do not post (yet).
Cambridge, MA
Joined September 2016
Agentic scaffolds are everywhere now, but major safety benchmarks still test models through the API in isolation. I wanted to find out what happens to safety properties when you wrap a model in ReAct, multi-agent, and map-reduce architectures (62,808 evaluations across 6
2
1
3
MC format alone explains almost all observable safety differences between models on BBQ (8.4pp gap collapses to <2.5pp in open-ended). Full thread on why scaffolds can look like safety failures when they’re often measurement failures:
0
0
0
First thread ever... Agentic scaffolds are everywhere now, but safety benchmarks still test models in isolation. Pre-registered clinical-trial-style experiment (62k evals): map-reduce produced a clear safety drop on TruthfulQA/BBQ (NNH=14) while factual recall stayed completely
0
0
0
Map-reduce safety drop on TruthfulQA/BBQ (NNH=14) was mostly format artifact. MC inflates safety differences between models. Mechanism + heterogeneity details and more in follow-up. @_lewtun @goodside @AISafetyMemes @METR_Evals @CAIS @hendrycks
0
0
0
So why does any of this matter? Most obviously: the benchmarks measured in this study still feed into model cards, internal lab policies, and regulatory compliance frameworks like the EU AI Act. Most importantly: the safety properties in this study are the easy ones.
davidgringras.github.io
Two of three scaffold architectures preserve safety. Map-reduce degrades safety by 7.3pp (NNH=14). Explore 62,808 observations with interactive visualizations.
1
0
3
So we've established why map-reduce degrades safety scores on some benchmarks, but why does it improve sycophancy? Well, it doesn't (always). For Opus 4.6 (the model with the highest baseline sycophancy resistance) sycophancy rates increased by 16.8pp under map-reduce. For
1
0
3
The format finding goes beyond "different formats give different scores". In MC, there is an 8.4 percentage point gap between the safest and least safe model on BBQ. In open-ended, that gap collapses to under 2.5pp (every model converges near ceiling). Beyond shifting
2
0
3
So what was actually real in the NNH=14 headline? I decomposed it to find out. Format artifact accounts for 40-89% depending on the model. Opus 4.6 gets 89% of its degradation back when you restore MC options in the sub-calls; GPT-5.2 only recovers 40%. Genuine reasoning
1
0
3
The clean test for this was straightforward. I took 220 matched items across five models and ran each one in MC and open-ended (both with and without map-reduce). If the scaffold was the problem you would see degradation regardless of format. Instead, degradation essentially
1
0
3
I pulled 1,285 sub-calls from the map-reduce pipeline to find out what actually made it through the decomposition step. The MC answer choices, in almost every case, did not (0-4% propagation depending on the model). Map-reduce was not degrading safety training. It was
1
0
3
The spec curve confirmed the answer. In all 18 primary specifications, map-reduce degradation was limited to multiple-choice benchmarks (replicated by a broader 384-specification curve). Open-ended benchmarks showed effectively zero degradation. The effect clustered by answer
1
0
3
The first answer: not all safety benchmarks degraded. In the same scaffold that increased BBQ bias 12x, aggregate sycophancy rates dropped. Whatever was driving the degradation, it wasn't a general disruption of safety reasoning. So what was it?
1
0
3
Let's start with the control (AI factual recall) which I ran through every scaffold configuration alongside the safety benchmarks. If map-reduce were degrading model performance through some general mechanism, one would expect the control to degrade too. Same models, same MC
1
0
3
We coded our ~100k articles using LLMs. Should you believe them? To answer this, we benchmarked 4 human RAs against 3 LLMs on their ability to recover ground truth article data. Details in the paper and appendices, but the LLMs did well and handily beat the highly trained humans.
7
19
146
i view this document as basically the culmination of effective altruism, possibly the single most effective and positive thing the movement has produced
We’re publishing a new constitution for Claude. The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process. https://t.co/CJsMIO0uej
11
9
621