David Gringras @david_gring26 X Profile

David Gringras

@david_gring26

Followers

34

Following

66K

Media

20

Statuses

67

British medical doctor & law grad. Currently studying @Harvard/@MIT and thinking a lot about AI. I spend far too much time on X but do not post (yet).

https://t.co/mJT9KP8ZeX

Cambridge, MA

Joined September 2016

Don't wanna be here? Send us removal request.

David Gringras

@david_gring26

15 hours

Agentic scaffolds are everywhere now, but major safety benchmarks still test models through the API in isolation. I wanted to find out what happens to safety properties when you wrap a model in ReAct, multi-agent, and map-reduce architectures (62,808 evaluations across 6

2

1

3

David Gringras

@david_gring26

8 hours

MC format alone explains almost all observable safety differences between models on BBQ (8.4pp gap collapses to <2.5pp in open-ended). Full thread on why scaffolds can look like safety failures when they’re often measurement failures:

0

David Gringras

@david_gring26

14 hours

First thread ever... Agentic scaffolds are everywhere now, but safety benchmarks still test models in isolation. Pre-registered clinical-trial-style experiment (62k evals): map-reduce produced a clear safety drop on TruthfulQA/BBQ (NNH=14) while factual recall stayed completely

0

David Gringras

@david_gring26

14 hours

Map-reduce safety drop on TruthfulQA/BBQ (NNH=14) was mostly format artifact. MC inflates safety differences between models. Mechanism + heterogeneity details and more in follow-up. @_lewtun @goodside @AISafetyMemes @METR_Evals @CAIS @hendrycks

0

David Gringras

@david_gring26

15 hours

So why does any of this matter? Most obviously: the benchmarks measured in this study still feed into model cards, internal lab policies, and regulatory compliance frameworks like the EU AI Act. Most importantly: the safety properties in this study are the easy ones.

davidgringras.github.io

Two of three scaffold architectures preserve safety. Map-reduce degrades safety by 7.3pp (NNH=14). Explore 62,808 observations with interactive visualizations.

1

0

3

David Gringras

@david_gring26

15 hours

So we've established why map-reduce degrades safety scores on some benchmarks, but why does it improve sycophancy? Well, it doesn't (always). For Opus 4.6 (the model with the highest baseline sycophancy resistance) sycophancy rates increased by 16.8pp under map-reduce. For

1

0

3

David Gringras

@david_gring26

15 hours

The format finding goes beyond "different formats give different scores". In MC, there is an 8.4 percentage point gap between the safest and least safe model on BBQ. In open-ended, that gap collapses to under 2.5pp (every model converges near ceiling). Beyond shifting

2

0

3

David Gringras

@david_gring26

15 hours

So what was actually real in the NNH=14 headline? I decomposed it to find out. Format artifact accounts for 40-89% depending on the model. Opus 4.6 gets 89% of its degradation back when you restore MC options in the sub-calls; GPT-5.2 only recovers 40%. Genuine reasoning

1

0

3

David Gringras

@david_gring26

15 hours

The clean test for this was straightforward. I took 220 matched items across five models and ran each one in MC and open-ended (both with and without map-reduce). If the scaffold was the problem you would see degradation regardless of format. Instead, degradation essentially

1

0

3

David Gringras

@david_gring26

15 hours

I pulled 1,285 sub-calls from the map-reduce pipeline to find out what actually made it through the decomposition step. The MC answer choices, in almost every case, did not (0-4% propagation depending on the model). Map-reduce was not degrading safety training. It was

1

0

3

David Gringras

@david_gring26

15 hours

The spec curve confirmed the answer. In all 18 primary specifications, map-reduce degradation was limited to multiple-choice benchmarks (replicated by a broader 384-specification curve). Open-ended benchmarks showed effectively zero degradation. The effect clustered by answer

1

0

3

David Gringras

@david_gring26

15 hours

The first answer: not all safety benchmarks degraded. In the same scaffold that increased BBQ bias 12x, aggregate sycophancy rates dropped. Whatever was driving the degradation, it wasn't a general disruption of safety reasoning. So what was it?

1

0

3

David Gringras

@david_gring26

15 hours

Let's start with the control (AI factual recall) which I ran through every scaffold configuration alongside the safety benchmarks. If map-reduce were degrading model performance through some general mechanism, one would expect the control to degrade too. Same models, same MC

1

0

3

Ryan Briggs

@ryancbriggs

17 days

We coded our ~100k articles using LLMs. Should you believe them? To answer this, we benchmarked 4 human RAs against 3 LLMs on their ability to recover ground truth article data. Details in the paper and appendices, but the LLMs did well and handily beat the highly trained humans.

7

19

146

Tenobrus

@tenobrus

1 month

i view this document as basically the culmination of effective altruism, possibly the single most effective and positive thing the movement has produced

Anthropic

@AnthropicAI

1 month

We’re publishing a new constitution for Claude. The constitution is a detailed description of our vision for Claude’s behavior and values. It’s written primarily for Claude, and used directly in our training process. https://t.co/CJsMIO0uej

11

9

621