Daniel J @djarosai X Profile

Daniel J

@djarosai

Followers

45

Following

2

Media

8

Statuses

14

Researcher @DistylAI

Joined July 2025

Don't wanna be here? Send us removal request.

Daniel J

@djarosai

4 months

Full results: GPT-5 maintains strong performance. GPT-5-mini notably competitive with o3 and gemini-2.5-pro. Absolute accuracy numbers depend on instruction and task complexity and will vary across settings—key takeaway is relative model rankings and degradation patterns

1

3

11

Daniel J

@djarosai

4 months

GPT-5 shows remarkable robustness for production instruction-following. On IFScale—our benchmark testing 100s of simultaneous constraints—it maintains >90% accuracy* through 500 instructions. Huge leap over previous bests o3 & gemini-2.5-pro (~69%@ 500). *run on 1 seed, 5 ongoing

2

15

51

Daniel J

@djarosai

4 months

Interested in this work and want to advance the frontier of what LLMs can do in real-world applications? Come join us at Distyl AI!

jobs.ashbyhq.com

Distyl AI Jobs

0

3

Daniel J

@djarosai

4 months

Many more insights in our paper that can inform the design of instruction-dense prompts--increasingly relevant for emerging agentic applications that must juggle various tool use instructions and collected context simultaneously

1

0

6

Daniel J

@djarosai

4 months

OpenAI reasoning models (o3/o4-mini) uniquely degrade in report coherence as instructions scale up. There is a potential tradeoff between instruction following and core task performance that needs further study and may be model-specific

1

2

7

Daniel J

@djarosai

4 months

We also notice a universal trend in model bias for instructions appearing earlier in the prompt (primacy effect): bias increases with instruction load until the model hits a saturation point where it gets overwhelmed and trends toward uniformly distributed failure

1

6

Daniel J

@djarosai

4 months

Another unexpected result for frontier models: Claude 3.7 Sonnet outperforms Claude 4 Opus

1

5

Daniel J

@djarosai

4 months

We expect small models like gpt-4.1 nano and Claude 3.5 Haiku to show exponential decay, but surprisingly gpt-4o does as well—at 50 instructions it falls to 80% adherence, and at 100 instructions only 50%

1

5

Daniel J

@djarosai

4 months

We find 3 distinct performance degradation curves. The best reasoning models maintain near-perfect performance until beginning to decay at a threshold of ~200 simultaneous instructions. Mid-tier models decay linearly and the worst models show exponential decay

1

5

Daniel J

@djarosai

4 months

IFScale requires an LLM to generate a business report containing between 10-500 keywords each specified as a distinct instruction. We evaluate performance across 20 models and find that even the best models begin to fail under the high cognitive load of 100s of instructions

1

2

8

Daniel J

@djarosai

4 months

How many instructions can your LLM follow at once? Production LLM systems juggle 10-100s of instructions: policies, style, safety rules, tool use--but when do they overload? We introduce IFScale, a new benchmark measuring how instruction following degrades as instructions scale🧵

2

5

23