Vals AI
@_valsai
Followers
5K
Following
239
Media
251
Statuses
629
Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc
San Francisco, CA
Joined March 2024
GPT 5.1 has also taken 2nd on our Vals Index - which measures the potential economic impact of a model on the U.S. economy.
Is Anthropic cooked? GPT-5.1-Codex tops Sonnet 4.5 on SWE-Bench at ~26x cheaper. Gemini 3 incoming. Don't know how good it will be at code. Likely amazing. Opus 4.5 *needs* to hit hard at a great price. Anthropic has trouble with cost efficiency. (No joy in saying this)
0
0
23
It also had performance gains on LCB, jumping from 12th place to 2nd. On others, its performance generally stayed about the same, with some incremental improvements on benchmarks - a two-point increase on MMMU, one point on GPQA, 1.5 points on IOI. (â…”)
2
0
13
🏆 @OpenAI's GPT 5.1 sets a new state-of-the art on our Finance Agent Benchmark, beating out Sonnet 4.5 (Thinking) by 0.6%. (1/3)
@deanwball better to do it that way than the alternative i have learned :)
13
31
240
For everyone wondering ... yes, we're working on a eval. For everyone not wondering ... hey, we're Vals AI and we do unbiased benchmarking on public enterprise models. We make all of our results public on our page. Check us out:
GPT-5.1 in ChatGPT is rolling out to all users this week. It’s smarter, more reliable, and a lot more conversational. https://t.co/SA1Q1GPyxV
0
0
10
Check our eval of GLM 4.6: https://t.co/PxYPv57LIn
0
0
1
Next Thursday, November 20th, 11AM-12PM PST we're hosting @vjhofmann for a discussion on Fluid Language Benchmarking. If you have questions/topics you want to hear about during the talk drop them in the comments! Register here for the talk: https://t.co/lH3kxArD7m
0
1
11
Can we cut the small talk and get straight to Fluid Language Benchmark talk? Join us next Thursday alongside @alphaxiv as we host @vjhofmann for a discussion on Fluid Language Benchmarking. Valentin Hofmann, a Postdoc at the Allen Institute for AI, integrates his research on
0
0
1
Michael... you're on to something. Check out our eval comparison view of Claude Sonnet 4.5 v ChatGPT 5. We're not sure which particular model you're using - but you can compare anything here: https://t.co/HNXQhIb0yB
0
0
9
Dexter evals are now live. Our initial setup: • test data from @_valsai • test runner from @LangChainAI Any dataset works - just load it and run. Results show up instantly in LangSmith.
2
3
19
Ultimately, it’s an improvement, but it’s still not enough to take number 1 in the open source category or even break top ten in the overall Vals Index. Nevertheless, this is an important breakthrough for the open source ecosystem. @Kimi_Moonshot See the full benchmark results
1
0
8
Latency. Latency. Latency. K2 is thinking nearly 5x slower than @Zai_org’s GLM 4.6 which remains number 1 on our Vals AI Index. But… It performed incredibly well with tool calling, notably on our Finance Agent Benchmark and SWE-Bench. Other than that, it performs similarly to
1
0
13
Kimi this Kimi that. Just Kimi the bottom line. The @Kimi_Moonshot K2 Thinking model has taken 2nd place on Vals Index for open source. It beat out @Zai_org’s GLM 4.5, though GLM 4.6 is still holding strong and doesn’t look to be budging anytime soon Here’s what we found in our
23
33
155