Vals AI @_valsai X Profile

Vals AI

@_valsai

Followers

5K

Following

239

Media

251

Statuses

629

Public LLM Evaluation // https://t.co/FjWabQY2jk @8vc @BloombergBeta @pearvc

https://t.co/DjHC8CF9ka

San Francisco, CA

Joined March 2024

Don't wanna be here? Send us removal request.

Jacob Stavrianos

@JacobStavrianos

20 hours

dug in and we burned a billion tokens on the sonnet run

Lisan al Gaib

@scaling01

23 hours

GPT-5.1 Codex beats Sonnet 4.5 Thinking on SWE-Bench, while being 26 times cheaper ouch

0

1

4

Diwakar Ray Yadav

@Norwakar

20 hours

@_valsai @OpenAI The benchmarking game is heating up. Results speak louder than hype

0

1

Vals AI

@_valsai

5 hours

GPT 5.1 has also taken 2nd on our Vals Index - which measures the potential economic impact of a model on the U.S. economy.

Dan Mac

@daniel_mac8

8 hours

Is Anthropic cooked? GPT-5.1-Codex tops Sonnet 4.5 on SWE-Bench at ~26x cheaper. Gemini 3 incoming. Don't know how good it will be at code. Likely amazing. Opus 4.5 *needs* to hit hard at a great price. Anthropic has trouble with cost efficiency. (No joy in saying this)

0

23

Vals AI

@_valsai

23 hours

Congrats @OpenAI 🎉

0

1

22

Vals AI

@_valsai

23 hours

Results are in for GPT 5.1 codex! It's #1 on SWE Bench, and has similar performance to its predecessor on Terminal Bench and LiveCodeBench.

9

25

271

Langston Nashold

@langstonnashold

1 day

@Rettend1 @_valsai @OpenAI Stay tuned here...

1

4

shyamal

@shyamalanadkat

1 day

gpt 5.1 is SOTA on @_valsai finance agent benchmark

Vals AI

@_valsai

1 day

🏆 @OpenAI's GPT 5.1 sets a new state-of-the art on our Finance Agent Benchmark, beating out Sonnet 4.5 (Thinking) by 0.6%. (1/3)

0

3

31

Vals AI

@_valsai

1 day

Stay tuned for results on the latest GPT 5.1 Codex. (3/3)

0

9

Vals AI

@_valsai

1 day

It also had performance gains on LCB, jumping from 12th place to 2nd. On others, its performance generally stayed about the same, with some incremental improvements on benchmarks - a two-point increase on MMMU, one point on GPQA, 1.5 points on IOI. (⅔)

2

0

13

Vals AI

@_valsai

1 day

🏆 @OpenAI's GPT 5.1 sets a new state-of-the art on our Finance Agent Benchmark, beating out Sonnet 4.5 (Thinking) by 0.6%. (1/3)

Sam Altman

@sama

2 days

@deanwball better to do it that way than the alternative i have learned :)

13

31

240

Vals AI

@_valsai

2 days

For everyone wondering ... yes, we're working on a eval. For everyone not wondering ... hey, we're Vals AI and we do unbiased benchmarking on public enterprise models. We make all of our results public on our page. Check us out:

OpenAI

@OpenAI

3 days

GPT-5.1 in ChatGPT is rolling out to all users this week. It’s smarter, more reliable, and a lot more conversational. https://t.co/SA1Q1GPyxV

0

10

Vals AI

@_valsai

2 days

Check our eval of GLM 4.6: https://t.co/PxYPv57LIn

jietang

@jietang

2 days

for GLM-4.6, which features do you want most? Speed up to 100t/s? Stability? Lower price?

0

1

Vals AI

@_valsai

3 days

Next Thursday, November 20th, 11AM-12PM PST we're hosting @vjhofmann for a discussion on Fluid Language Benchmarking. If you have questions/topics you want to hear about during the talk drop them in the comments! Register here for the talk: https://t.co/lH3kxArD7m

0

1

11

Vals AI

@_valsai

3 days

Can we cut the small talk and get straight to Fluid Language Benchmark talk? Join us next Thursday alongside @alphaxiv as we host @vjhofmann for a discussion on Fluid Language Benchmarking. Valentin Hofmann, a Postdoc at the Allen Institute for AI, integrates his research on

0

1

Vals AI

@_valsai

3 days

Michael... you're on to something. Check out our eval comparison view of Claude Sonnet 4.5 v ChatGPT 5. We're not sure which particular model you're using - but you can compare anything here: https://t.co/HNXQhIb0yB

Michael Girdley

@girdley

4 days

Wow -- Claude got so much better than ChatGPT since I last used it.

0

9

virat

@virattt

4 days

Dexter evals are now live. Our initial setup: • test data from @_valsai • test runner from @LangChainAI Any dataset works - just load it and run. Results show up instantly in LangSmith.

2

3

19

Vals AI

@_valsai

8 days

Ultimately, it’s an improvement, but it’s still not enough to take number 1 in the open source category or even break top ten in the overall Vals Index. Nevertheless, this is an important breakthrough for the open source ecosystem. @Kimi_Moonshot See the full benchmark results

1

0

8

Vals AI

@_valsai

8 days

Latency. Latency. Latency. K2 is thinking nearly 5x slower than @Zai_org’s GLM 4.6 which remains number 1 on our Vals AI Index. But… It performed incredibly well with tool calling, notably on our Finance Agent Benchmark and SWE-Bench. Other than that, it performs similarly to

1

0

13

Vals AI

@_valsai

8 days

Kimi this Kimi that. Just Kimi the bottom line. The @Kimi_Moonshot K2 Thinking model has taken 2nd place on Vals Index for open source. It beat out @Zai_org’s GLM 4.5, though GLM 4.6 is still holding strong and doesn’t look to be budging anytime soon Here’s what we found in our

23

33

155