JacobStavrianos Profile Banner
Jacob Stavrianos Profile
Jacob Stavrianos

@JacobStavrianos

Followers
13
Following
33
Media
4
Statuses
40

professional LM gaslighter

San Francisco
Joined July 2023
Don't wanna be here? Send us removal request.
@JacobStavrianos
Jacob Stavrianos
1 day
dug in and we burned a billion tokens on the sonnet run
@scaling01
Lisan al Gaib
1 day
GPT-5.1 Codex beats Sonnet 4.5 Thinking on SWE-Bench, while being 26 times cheaper ouch
0
1
4
@_valsai
Vals AI
1 day
Results are in for GPT 5.1 codex! It's #1 on SWE Bench, and has similar performance to its predecessor on Terminal Bench and LiveCodeBench.
9
25
279
@_valsai
Vals AI
8 days
Kimi this Kimi that. Just Kimi the bottom line. The @Kimi_Moonshot K2 Thinking model has taken 2nd place on Vals Index for open source. It beat out @Zai_org’s GLM 4.5, though GLM 4.6 is still holding strong and doesn’t look to be budging anytime soon Here’s what we found in our
23
33
155
@langstonnashold
Langston Nashold
12 days
We're hiring for full-time roles in SF! Come join an absolutely cracked team working on the most important problems in LLM evaluation. https://t.co/obflBmSv2N
Tweet card summary image
jobs.polymer.co
View our open jobs at Vals AI.
0
3
6
@JacobStavrianos
Jacob Stavrianos
22 days
Does this count
0
0
1
@JacobStavrianos
Jacob Stavrianos
1 month
now I'll only spend a third of my salary on claude code
@_valsai
Vals AI
1 month
We evaluated @Claudeai 4.5 Haiku (Thinking) and found that the model places 3rd on our Vals Index. Our full evaluation of @AnthropicAI’s Haiku 4.5 shows the model performs strongest on coding tasks, ranking 3rd on Terminal-Bench. (1/3)
0
0
1
@_valsai
Vals AI
1 month
📣 New Vals AI benchmark just released 📣 We built the SAGE benchmark after finding that models struggle to grade student work. Paradoxically, the best models can now solve challenging math problems + win IMO but struggle to break 50% when grading. (1/5)
8
6
39
@_valsai
Vals AI
1 month
We are looking forward to hosting @Mike_A_Merrill to discuss his work on Terminal Bench, a widely used open-source benchmark for evaluating agents in terminal environments. Join us on @askalphaxiv Thursday, October 9th at 11 am PT. Link to sign up below! (1/2)
1
4
11
@_valsai
Vals AI
2 months
We are excited to have @ShashwatGoel7 to discuss how AI evaluations need to change in tandem with LLM capabilities! Join us on @askalphaxiv tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. Link to sign up below! (1/2)
2
5
16
@_valsai
Vals AI
2 months
@sama says GPT-5 is “superhuman on one-minute tasks, but has a long way to go on thousand-hour tasks.” But what does that mean? And how does GPT-5 itself prove his point?
2
5
13
@RayanKrishnan
Rayan Krishnan
2 months
An annoying side effect of training models to be good test takers is that they think every interaction is an exam. This is GPT 5 mini forgetting its system prompt to grade a student response and instead answering the question itself...
0
2
10
@_valsai
Vals AI
2 months
AI wins gold medals in the International Math Olympiad, but struggles at entry-level financial analyst work… but why?
4
8
25
@RayanKrishnan
Rayan Krishnan
2 months
6
11
28
@ramdhanhdy
RDH
2 months
@charles_irl evals
0
1
1
@JacobStavrianos
Jacob Stavrianos
3 months
0
0
0
@JacobStavrianos
Jacob Stavrianos
3 months
We found that #GPT5 performs competitively with #Grok on IOI 2025, but less than half as well on the 2024 exam. @SherylHsu02 @OpenAI I wonder why?
@_valsai
Vals AI
3 months
We tested top foundation models on the International Olympiad in Informatics (IOI) - a programming competition that tests algorithmic thinking and C++ coding skills. We found @xai's @grok 4 to be the clear SOTA winner, scoring first place on both 2024 and 2025 exams. 🥇📊👏
1
0
4
@_valsai
Vals AI
3 months
Though there's a slight variance between the company's reported scores and our scores, GPT-5 takes first place on MMMU. It outperforms both OpenAI's predecessor models and the previously top-ranking model, Gemini 2.5 Pro Exp, by 0.2%.
1
2
5
@JacobStavrianos
Jacob Stavrianos
3 months
0
0
0
@JacobStavrianos
Jacob Stavrianos
4 months
but only kinda
0
0
1