Daniel J
@djarosai
Followers
45
Following
2
Media
8
Statuses
14
Full results: GPT-5 maintains strong performance. GPT-5-mini notably competitive with o3 and gemini-2.5-pro. Absolute accuracy numbers depend on instruction and task complexity and will vary across settings—key takeaway is relative model rankings and degradation patterns
1
3
11
GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500). *run on 1 seed, 5 ongoing
2
15
51
Interested in this work and want to advance the frontier of what LLMs can do in real-world applications? Come join us at Distyl AI!
jobs.ashbyhq.com
Distyl AI Jobs
0
0
3
Many more insights in our paper that can inform the design of instruction-dense prompts--increasingly relevant for emerging agentic applications that must juggle various tool use instructions and collected context simultaneously
1
0
6
OpenAI reasoning models (o3/o4-mini) uniquely degrade in report coherence as instructions scale up. There is a potential tradeoff between instruction following and core task performance that needs further study and may be model-specific
1
2
7
We also notice a universal trend in model bias for instructions appearing earlier in the prompt (primacy effect): bias increases with instruction load until the model hits a saturation point where it gets overwhelmed and trends toward uniformly distributed failure
1
1
6
Another unexpected result for frontier models: Claude 3.7 Sonnet outperforms Claude 4 Opus
1
1
5
We expect small models like gpt-4.1 nano and Claude 3.5 Haiku to show exponential decay, but surprisingly gpt-4o does as well—at 50 instructions it falls to 80% adherence, and at 100 instructions only 50%
1
1
5
We find 3 distinct performance degradation curves. The best reasoning models maintain near-perfect performance until beginning to decay at a threshold of ~200 simultaneous instructions. Mid-tier models decay linearly and the worst models show exponential decay
1
1
5
IFScale requires an LLM to generate a business report containing between 10-500 keywords each specified as a distinct instruction. We evaluate performance across 20 models and find that even the best models begin to fail under the high cognitive load of 100s of instructions
1
2
8
How many instructions can your LLM follow at once? Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload? We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵
2
5
23