Jacob Stavrianos
@JacobStavrianos
Followers
13
Following
33
Media
4
Statuses
40
professional LM gaslighter
San Francisco
Joined July 2023
Kimi this Kimi that. Just Kimi the bottom line. The @Kimi_Moonshot K2 Thinking model has taken 2nd place on Vals Index for open source. It beat out @Zai_org’s GLM 4.5, though GLM 4.6 is still holding strong and doesn’t look to be budging anytime soon Here’s what we found in our
23
33
155
We're hiring for full-time roles in SF! Come join an absolutely cracked team working on the most important problems in LLM evaluation. https://t.co/obflBmSv2N
jobs.polymer.co
View our open jobs at Vals AI.
0
3
6
now I'll only spend a third of my salary on claude code
We evaluated @Claudeai 4.5 Haiku (Thinking) and found that the model places 3rd on our Vals Index. Our full evaluation of @AnthropicAI’s Haiku 4.5 shows the model performs strongest on coding tasks, ranking 3rd on Terminal-Bench. (1/3)
0
0
1
📣 New Vals AI benchmark just released 📣 We built the SAGE benchmark after finding that models struggle to grade student work. Paradoxically, the best models can now solve challenging math problems + win IMO but struggle to break 50% when grading. (1/5)
8
6
39
We are looking forward to hosting @Mike_A_Merrill to discuss his work on Terminal Bench, a widely used open-source benchmark for evaluating agents in terminal environments. Join us on @askalphaxiv Thursday, October 9th at 11 am PT. Link to sign up below! (1/2)
1
4
11
We are excited to have @ShashwatGoel7 to discuss how AI evaluations need to change in tandem with LLM capabilities! Join us on @askalphaxiv tomorrow, Oct 2nd at 11 am PT and ask him your burning questions about evals. Link to sign up below! (1/2)
2
5
16
AI wins gold medals in the International Math Olympiad, but struggles at entry-level financial analyst work… but why?
4
8
25
We found that #GPT5 performs competitively with #Grok on IOI 2025, but less than half as well on the 2024 exam. @SherylHsu02 @OpenAI I wonder why?
We tested top foundation models on the International Olympiad in Informatics (IOI) - a programming competition that tests algorithmic thinking and C++ coding skills. We found @xai's @grok 4 to be the clear SOTA winner, scoring first place on both 2024 and 2025 exams. 🥇📊👏
1
0
4
Though there's a slight variance between the company's reported scores and our scores, GPT-5 takes first place on MMMU. It outperforms both OpenAI's predecessor models and the previously top-ranking model, Gemini 2.5 Pro Exp, by 0.2%.
1
2
5