Vals AI @ValsAI X Profile

Vals AI

@ValsAI

Followers

7K

Following

318

Media

326

Statuses

876

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

https://t.co/DjHC8CF9ka

San Francisco, CA

Joined March 2024

Don't wanna be here? Send us removal request.

Vals AI

@ValsAI

4 days

Try it now! Reply to a tweet and tag @AskVals. Example: “@AskVals how does this model do on Finance + Terminal?”

1

0

2

Vals AI

@ValsAI

4 days

What you can do with @askvals: - Sanity-check SOTA claims with evidence - Compare models for your use case (coding, finance, legal, agents) - Get a detailed analysis of a benchmark or model

1

0

2

Vals AI

@ValsAI

4 days

Introducing Ask Vals — @AskVals Keeping up with the flood of model releases, benchmarks, and rankings is overwhelming. We built a bot internally to cut through the noise, and now it's live on X. Tag it to ask questions about models, benchmarks, performance, comparisons on

6

4

20

Vals AI

@ValsAI

5 days

All benchmarks were ran with “xhigh” reasoning except Terminal Bench 2, which used “high” reasoning. Full results are available on Vals AI

1

0

8

Vals AI

@ValsAI

5 days

On IOI, our benchmark for measuring the model’s ability to solve the hardest competitive programming competitions, there is a similar story. The model performs quite well, but is not able to match the performance of GPT 5.2. The latency for IOI is also high, almost an hour

1

0

8

Vals AI

@ValsAI

5 days

On VibeCodeBench, it is currently at 41.4%, compared to GPT 5.2’s 46.9%. The model performs well, but seems less adapted to the OpenHands harness than the more general GPT 5.2.

1

0

9

Vals AI

@ValsAI

5 days

On Terminal Bench 2, the model is a +12.3% improvement over GPT 5.2, even on the generic Terminus 2 harness. It places second behind Gemini 3.1 Pro Preview.

3

0

22

Vals AI

@ValsAI

5 days

Results are live for OpenAI’s Codex 5.3 model! Highlights include being #2 on Terminal Bench 2, #2 on our IOI benchmark, #3 on LiveCodeBench, and #4 on Vibe Code Bench.

7

9

177

Vals AI

@ValsAI

9 days

Congratulations to the @GoogleAI team on another strong release! Full results and analysis can be found on our site:

0

5

Vals AI

@ValsAI

9 days

Evaluations were run with a temperature of 1.0 and a “high” thinking level, via the official Google API.

2

0

6

Vals AI

@ValsAI

9 days

Overall, we find the model to be a significant improvement over Gemini 3 Pro. At the same time, there remains room for improvement on benchmarks like MedScribe, CorpFin (a benchmark on corporate finance), or SAGE (a multimodal benchmark measuring ability to grade handwritten

1

0

4

Vals AI

@ValsAI

9 days

It is the best model by far on Terminal Bench 2, beating the #2-ranked Claude Sonnet 4.6 by 8%. On SWE-bench Verified, the model regressed slightly but performed nearly as well as the Gemini 3 Pro model. The delta between our results and the model card results is likely due to

1

0

7

Vals AI

@ValsAI

9 days

It outperforms Gemini 3 Pro on ProofBench, our benchmark on writing formally verified math proofs, by 6%. The model also shows dramatic improvement on our Case Law benchmark, jumping from 53.4% (rank 50) to 65.6% accuracy (rank #11), a 12 percentage point improvement.

1

0

5

Vals AI

@ValsAI

9 days

Full results on Gemini 3.1 Pro Preview are now released. It gets first place on several of our benchmarks, including MedCode, MortgageTax, and LegalBench. It is also #3 on our Finance Agent Benchmark, putting it ahead of OpenAI but still behind Anthropic.

1

4

83

Vals AI

@ValsAI

10 days

Findings will be released on all benchmarks soon! You can view the Vals Index results here:

0

5

Vals AI

@ValsAI

10 days

Evaluations were run with a “high” thinking level, and temp = 1, using the official Google API.

2

0

4

Vals AI

@ValsAI

10 days

Performance remained the same on the FinanceAgent Index split (measuring the model’s ability to complete core financial analyst tasks), and approximately the same on SWE-Bench (decreasing slightly).

1

0

5

Vals AI

@ValsAI

10 days

The model also shows dramatic improvement compared with Gemini 3 Pro (11/25) on our Case Law benchmark, jumping from rank #50 (53.4% accuracy) to rank #11 (65.6% accuracy)- a 12 percentage point improvement.

1

0

12

Vals AI

@ValsAI

10 days

It is the #1 model by a significant margin on Terminal Bench 2, scoring 67.42% and beating the second-ranked Claude Sonnet 4.6 by ~8%. This shows its capability in agentic coding tasks.

2

0

27

Vals AI

@ValsAI

10 days

The newly released Gemini 3.1 Pro (Preview) from @GoogleAI takes the #3 spot on our Vals Index, beating GPT 5.2.

6

5

84