Vals AI
@ValsAI
Followers
7K
Following
318
Media
326
Statuses
876
Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc
San Francisco, CA
Joined March 2024
Introducing Ask Vals — @AskVals Keeping up with the flood of model releases, benchmarks, and rankings is overwhelming. We built a bot internally to cut through the noise, and now it's live on X. Tag it to ask questions about models, benchmarks, performance, comparisons on
6
4
20
All benchmarks were ran with “xhigh” reasoning except Terminal Bench 2, which used “high” reasoning. Full results are available on Vals AI
1
0
8
On IOI, our benchmark for measuring the model’s ability to solve the hardest competitive programming competitions, there is a similar story. The model performs quite well, but is not able to match the performance of GPT 5.2. The latency for IOI is also high, almost an hour
1
0
8
On VibeCodeBench, it is currently at 41.4%, compared to GPT 5.2’s 46.9%. The model performs well, but seems less adapted to the OpenHands harness than the more general GPT 5.2.
1
0
9
On Terminal Bench 2, the model is a +12.3% improvement over GPT 5.2, even on the generic Terminus 2 harness. It places second behind Gemini 3.1 Pro Preview.
3
0
22
Evaluations were run with a temperature of 1.0 and a “high” thinking level, via the official Google API.
2
0
6
Overall, we find the model to be a significant improvement over Gemini 3 Pro. At the same time, there remains room for improvement on benchmarks like MedScribe, CorpFin (a benchmark on corporate finance), or SAGE (a multimodal benchmark measuring ability to grade handwritten
1
0
4
It is the best model by far on Terminal Bench 2, beating the #2-ranked Claude Sonnet 4.6 by 8%. On SWE-bench Verified, the model regressed slightly but performed nearly as well as the Gemini 3 Pro model. The delta between our results and the model card results is likely due to
1
0
7
Findings will be released on all benchmarks soon! You can view the Vals Index results here:
0
0
5
Evaluations were run with a “high” thinking level, and temp = 1, using the official Google API.
2
0
4
Performance remained the same on the FinanceAgent Index split (measuring the model’s ability to complete core financial analyst tasks), and approximately the same on SWE-Bench (decreasing slightly).
1
0
5