hud
@hud_evals
Followers
2K
Following
204
Media
8
Statuses
43
RL environments + evals for agents | @ycombinator | we're hiring!
Joined January 2025
5/ @jayendra_ram + @seeklis from @hud_evals and @joeybesgen from TLDC discussed how to build and scale RL environments. Link to slides: https://t.co/H7MqgHSDwG TL;DR: RL environments have a "Goldilocks" level of difficulty - hard enough to hill climb on, but not too hard
2
4
12
We’re giving a talk at YC this Saturday, join us!
We're going to teach people the best practices when making RL environments at @ycombinator this Saturday. If you're interested in the space and want to see the potential use cases of RL environments, come by! https://t.co/7rm3P5o4s5
1
0
11
Last but not least, we'll be wrapping this all up at RL IRL forum with @reductoai @Osmosis_AI @greptileai @encord_team @hud_evals I'm super bullish on big IRL events as a untapped opportunity! Come to get exclusive Osmosis swag! Register here: https://t.co/P3H7JFwwJc
1
1
14
nearly every talented ML anon i've interacted with over the past year has by now ended up at one of: @PrimeIntellect
@arcee_ai
@hud_evals
8
2
293
For more details on SheetBench-50, and the challenges of evaluating how agentic workflows can build dashboards giving insight on critical business functions, check out our blog! https://t.co/OV27gI1cUn
hud.so
HUD and Sepal AI introduce SheetBench-50, a benchmark for financial workflows in Google Sheets that measures real-world agent performance.
0
0
3
Claude Sonnet 4.5 is currently the best Computer Use Agent on SheetBench-50. We're adding more real-world/enterprise usecases soon. If you'd like to evaluate AI Agents on SheetBench-50 or other tasks, or improve models for your custom usecase, DM us. Or email founders@hud.so (!)
1
0
5
SheetBench-50 tests agents' abilities on real workflows: -Long-context: building financial models from scratch -Error recovery: debugging circular references in budgets -Multistep planning: compiling reports across business units -UI navigation: creating dynamic dashboards
1
0
1
Existing evals have key gaps: 1. Testing only formula generation, not pivot tables, auditing adjusting forecasts or reasoning about models 2. Using single-table examples from Excel forums, not messy data like in real enterprise workflows We worked with experts to make hard tasks.
1
0
1
The world runs on spreadsheets. Can AI agents handle real-world workflows? We've worked with @sepal_ai on SheetBench-50, a financial analyst benchmark verified by experts from PwC, Cisco, Charles Schwab, and Fannie Mae For now, Sonnet 4.5 is the top scorer, beating OpenAI CUAđź‘€
4
7
18
Think your agent’s SOTA? Prove it. Join Track A (On-site at @HackTheNorth, Waterloo · Sept 12–14) - Best SOTA Computer-Use Agent Hit the highest score on OSWorld-Gold by @hud_evals using the Cua Agent framework (cloud or local models). 🏆 Prize: 🇾Guaranteed @ycombinator
1
1
10
We're super grateful to the XLang team for this collaboration! If you're an agent provider and you want to evaluate your models or share your benchmark scores, DM us or check out the leaderboard at https://t.co/HDfnEiMTRe
2
0
20
Everyone claims SOTA for Computer Use Agents (CUAs), but there's no way to ensure reproducible results. We're publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon.
7
18
171
Since everyone is talking about RL Environments and GRPO now but no one knows how it works we thought it would be cool to make an explainer video + code you can run: This is an example of using GRPO to train Qwen 2.5 to play 2048 (code in thread) đź§µ:
26
161
2K
Day 1 of 5 Days of Cua. We’re starting big: Cua is an official sponsor of Hack the North 2025, Sept 12–14, at UWaterloo. We're running the first Computer-Use Agents SOTA Challenge on-site in Waterloo - and alongside it, a separate global online challenge Cua x Ollama. 1/5
3
21
98
I usually don’t talk about RL envs on the tl out of respect for our customers but this take won’t age well. Making problems that provide signal for models to get better is pretty hard, and is only going to get harder every year as models improve. The notion that you can vibe
the only RL envs frontier labs will continue buying in the medium term are the idiosyncratic high value ones and if you’re building a high value env, why not just do the RL in house, deploy vertically, and capture orders of magnitude more value?
6
6
218
btw, if you're interested in agentic benchmarks, we're releasing our financial analyst benchmark soon. DM us if you'd like to benchmark your agent and be on the leaderboardđź‘€
0
0
8