Jacob Stavrianos @JacobStavrianos X Profile

Jacob Stavrianos

@JacobStavrianos

Followers

13

Following

33

Media

4

Statuses

40

professional LM gaslighter

San Francisco

Joined July 2023

Don't wanna be here? Send us removal request.

Jacob Stavrianos

@JacobStavrianos

1 day

dug in and we burned a billion tokens on the sonnet run

Lisan al Gaib

@scaling01

1 day

GPT-5.1 Codex beats Sonnet 4.5 Thinking on SWE-Bench, while being 26 times cheaper ouch

0

1

4

Vals AI

@_valsai

1 day

Results are in for GPT 5.1 codex! It's #1 on SWE Bench, and has similar performance to its predecessor on Terminal Bench and LiveCodeBench.

9

25

279

Vals AI

@_valsai

8 days

Kimi this Kimi that. Just Kimi the bottom line. The @Kimi_Moonshot K2 Thinking model has taken 2nd place on Vals Index for open source. It beat out @Zai_org’s GLM 4.5, though GLM 4.6 is still holding strong and doesn’t look to be budging anytime soon Here’s what we found in our

23

33

155

Langston Nashold

@langstonnashold

12 days

We're hiring for full-time roles in SF! Come join an absolutely cracked team working on the most important problems in LLM evaluation. https://t.co/obflBmSv2N

jobs.polymer.co

View our open jobs at Vals AI.

0

3

6

Jacob Stavrianos

@JacobStavrianos

22 days

Does this count

0

1

Jacob Stavrianos

@JacobStavrianos

1 month

now I'll only spend a third of my salary on claude code

Vals AI

@_valsai

1 month

We evaluated @Claudeai 4.5 Haiku (Thinking) and found that the model places 3rd on our Vals Index. Our full evaluation of @AnthropicAI’s Haiku 4.5 shows the model performs strongest on coding tasks, ranking 3rd on Terminal-Bench. (1/3)

0

1

Jacob Stavrianos

@JacobStavrianos

1 month

https://t.co/gQTWsUtZHW

claude.ai

Play a fun 5x5 crossword puzzle online for free! Interactive grid with auto-advance, instant checking, and helpful clues. Perfect brain training game.

0

Vals AI

@_valsai

1 month

📣 New Vals AI benchmark just released 📣 We built the SAGE benchmark after finding that models struggle to grade student work. Paradoxically, the best models can now solve challenging math problems + win IMO but struggle to break 50% when grading. (1/5)

8

6

39

Vals AI

@_valsai

1 month

We are looking forward to hosting @Mike_A_Merrill to discuss his work on Terminal Bench, a widely used open-source benchmark for evaluating agents in terminal environments. Join us on @askalphaxiv Thursday, October 9th at 11 am PT. Link to sign up below! (1/2)

1

4

11

Vals AI

@_valsai

2 months

We are excited to have @ShashwatGoel7 to discuss how AI evaluations need to change in tandem with LLM capabilities! Join us on @askalphaxiv tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. Link to sign up below! (1/2)

2

5

16

Vals AI

@_valsai

2 months

@sama says GPT-5 is “superhuman on one-minute tasks, but has a long way to go on thousand-hour tasks.” But what does that mean? And how does GPT-5 itself prove his point?

2

5

13

Rayan Krishnan

@RayanKrishnan

2 months

An annoying side effect of training models to be good test takers is that they think every interaction is an exam. This is GPT 5 mini forgetting its system prompt to grade a student response and instead answering the question itself...

Rayan Krishnan

@RayanKrishnan

2 months

https://t.co/Q5Wqqt9kPF

0

2

10

Vals AI

@_valsai

2 months

AI wins gold medals in the International Math Olympiad, but struggles at entry-level financial analyst work… but why?

4

8

25

Rayan Krishnan

@RayanKrishnan

2 months

https://t.co/Q5Wqqt9kPF

6

11

28

RDH

@ramdhanhdy

2 months

@charles_irl evals

0

1

Jacob Stavrianos

@JacobStavrianos

3 months

@xai

0

Jacob Stavrianos

@JacobStavrianos

3 months

We found that #GPT5 performs competitively with #Grok on IOI 2025, but less than half as well on the 2024 exam. @SherylHsu02 @OpenAI I wonder why?

Vals AI

@_valsai

3 months

We tested top foundation models on the International Olympiad in Informatics (IOI) - a programming competition that tests algorithmic thinking and C++ coding skills. We found @xai's @grok 4 to be the clear SOTA winner, scoring first place on both 2024 and 2025 exams. 🥇📊👏

1

0

4

Vals AI

@_valsai

3 months

Though there's a slight variance between the company's reported scores and our scores, GPT-5 takes first place on MMMU. It outperforms both OpenAI's predecessor models and the previously top-ranking model, Gemini 2.5 Pro Exp, by 0.2%.

1

2

5

Jacob Stavrianos

@JacobStavrianos

3 months

https://t.co/Wozor68NJ5

0

Jacob Stavrianos

@JacobStavrianos

4 months

but only kinda

0

1