ValsAI Profile Banner
Vals AI Profile
Vals AI

@ValsAI

Followers
7K
Following
318
Media
326
Statuses
876

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

San Francisco, CA
Joined March 2024
Don't wanna be here? Send us removal request.
@ValsAI
Vals AI
4 days
Try it now! Reply to a tweet and tag @AskVals. Example: “@AskVals how does this model do on Finance + Terminal?”
1
0
2
@ValsAI
Vals AI
4 days
What you can do with @askvals: - Sanity-check SOTA claims with evidence - Compare models for your use case (coding, finance, legal, agents) - Get a detailed analysis of a benchmark or model
1
0
2
@ValsAI
Vals AI
4 days
Introducing Ask Vals — @AskVals Keeping up with the flood of model releases, benchmarks, and rankings is overwhelming. We built a bot internally to cut through the noise, and now it's live on X. Tag it to ask questions about models, benchmarks, performance, comparisons on
6
4
20
@ValsAI
Vals AI
5 days
All benchmarks were ran with “xhigh” reasoning except Terminal Bench 2, which used “high” reasoning. Full results are available on Vals AI
1
0
8
@ValsAI
Vals AI
5 days
On IOI, our benchmark for measuring the model’s ability to solve the hardest competitive programming competitions, there is a similar story. The model performs quite well, but is not able to match the performance of GPT 5.2. The latency for IOI is also high, almost an hour
1
0
8
@ValsAI
Vals AI
5 days
On VibeCodeBench, it is currently at 41.4%, compared to GPT 5.2’s 46.9%. The model performs well, but seems less adapted to the OpenHands harness than the more general GPT 5.2.
1
0
9
@ValsAI
Vals AI
5 days
On Terminal Bench 2, the model is a +12.3% improvement over GPT 5.2, even on the generic Terminus 2 harness. It places second behind Gemini 3.1 Pro Preview.
3
0
22
@ValsAI
Vals AI
5 days
Results are live for OpenAI’s Codex 5.3 model! Highlights include being #2 on Terminal Bench 2, #2 on our IOI benchmark, #3 on LiveCodeBench, and #4 on Vibe Code Bench.
7
9
177
@ValsAI
Vals AI
9 days
Congratulations to the @GoogleAI team on another strong release! Full results and analysis can be found on our site:
0
0
5
@ValsAI
Vals AI
9 days
Evaluations were run with a temperature of 1.0 and a “high” thinking level, via the official Google API.
2
0
6
@ValsAI
Vals AI
9 days
Overall, we find the model to be a significant improvement over Gemini 3 Pro. At the same time, there remains room for improvement on benchmarks like MedScribe, CorpFin (a benchmark on corporate finance), or SAGE (a multimodal benchmark measuring ability to grade handwritten
1
0
4
@ValsAI
Vals AI
9 days
It is the best model by far on Terminal Bench 2, beating the #2-ranked Claude Sonnet 4.6 by 8%. On SWE-bench Verified, the model regressed slightly but performed nearly as well as the Gemini 3 Pro model. The delta between our results and the model card results is likely due to
1
0
7
@ValsAI
Vals AI
9 days
It outperforms Gemini 3 Pro on ProofBench, our benchmark on writing formally verified math proofs, by 6%. The model also shows dramatic improvement on our Case Law benchmark, jumping from 53.4% (rank 50) to 65.6% accuracy (rank #11), a 12 percentage point improvement.
1
0
5
@ValsAI
Vals AI
9 days
Full results on Gemini 3.1 Pro Preview are now released. It gets first place on several of our benchmarks, including MedCode, MortgageTax, and LegalBench. It is also #3 on our Finance Agent Benchmark, putting it ahead of OpenAI but still behind Anthropic.
1
4
83
@ValsAI
Vals AI
10 days
Findings will be released on all benchmarks soon! You can view the Vals Index results here:
0
0
5
@ValsAI
Vals AI
10 days
Evaluations were run with a “high” thinking level, and temp = 1, using the official Google API.
2
0
4
@ValsAI
Vals AI
10 days
Performance remained the same on the FinanceAgent Index split  (measuring the model’s ability to complete core financial analyst tasks), and approximately the same on SWE-Bench (decreasing slightly).
1
0
5
@ValsAI
Vals AI
10 days
The model also shows dramatic improvement compared with Gemini 3 Pro (11/25) on our Case Law benchmark, jumping from rank #50 (53.4% accuracy) to rank #11 (65.6% accuracy)- a 12 percentage point improvement.
1
0
12
@ValsAI
Vals AI
10 days
It is the #1 model by a significant margin on Terminal Bench 2, scoring 67.42% and beating the second-ranked Claude Sonnet 4.6 by ~8%. This shows its capability in agentic coding tasks.
2
0
27
@ValsAI
Vals AI
10 days
The newly released Gemini 3.1 Pro (Preview) from @GoogleAI takes the #3 spot on our Vals Index, beating GPT 5.2.
6
5
84