hud Profile
hud

@hud_evals

Followers
2K
Following
204
Media
8
Statuses
43

RL environments + evals for agents | @ycombinator | we're hiring!

Joined January 2025
Don't wanna be here? Send us removal request.
@hud_evals
hud
4 months
we're actively hiring for these roles btwđź‘€
@creatine_cycle
atlas
4 months
the jobs left after the singularity will be: - agentic workflow engineer - twink - chief of staff
25
5
216
@parth220
parth
2 days
Much love to @xeophon_ for having the same type of eval tism. @EpochAIResearch 🤝 @hud_evals
@EpochAIResearch
Epoch AI
4 days
We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. See thread for details!
2
3
16
@_WEEXIAO
Kasey Zhang
10 days
5/ @jayendra_ram + @seeklis from @hud_evals and @joeybesgen from TLDC discussed how to build and scale RL environments. Link to slides: https://t.co/H7MqgHSDwG TL;DR: RL environments have a "Goldilocks" level of difficulty - hard enough to hill climb on, but not too hard
2
4
12
@hud_evals
hud
17 days
We’re giving a talk at YC this Saturday, join us!
@jayendra_ram
Jay
18 days
We're going to teach people the best practices when making RL environments at @ycombinator this Saturday. If you're interested in the space and want to see the potential use cases of RL environments, come by! https://t.co/7rm3P5o4s5
1
0
11
@Jackyhuang
jacky huang (:
23 days
Last but not least, we'll be wrapping this all up at RL IRL forum with @reductoai @Osmosis_AI @greptileai @encord_team @hud_evals I'm super bullish on big IRL events as a untapped opportunity! Come to get exclusive Osmosis swag! Register here: https://t.co/P3H7JFwwJc
1
1
14
@leothecurious
davinci
1 month
nearly every talented ML anon i've interacted with over the past year has by now ended up at one of: @PrimeIntellect @arcee_ai @hud_evals
8
2
293
@hud_evals
hud
1 month
For more details on SheetBench-50, and the challenges of evaluating how agentic workflows can build dashboards giving insight on critical business functions, check out our blog! https://t.co/OV27gI1cUn
hud.so
HUD and Sepal AI introduce SheetBench-50, a benchmark for financial workflows in Google Sheets that measures real-world agent performance.
0
0
3
@hud_evals
hud
1 month
Claude Sonnet 4.5 is currently the best Computer Use Agent on SheetBench-50. We're adding more real-world/enterprise usecases soon. If you'd like to evaluate AI Agents on SheetBench-50 or other tasks, or improve models for your custom usecase, DM us. Or email founders@hud.so (!)
1
0
5
@hud_evals
hud
1 month
SheetBench-50 tests agents' abilities on real workflows: -Long-context: building financial models from scratch -Error recovery: debugging circular references in budgets -Multistep planning: compiling reports across business units -UI navigation: creating dynamic dashboards
1
0
1
@hud_evals
hud
1 month
Existing evals have key gaps: 1. Testing only formula generation, not pivot tables, auditing adjusting forecasts or reasoning about models 2. Using single-table examples from Excel forums, not messy data like in real enterprise workflows We worked with experts to make hard tasks.
1
0
1
@hud_evals
hud
1 month
The world runs on spreadsheets. Can AI agents handle real-world workflows? We've worked with @sepal_ai on SheetBench-50, a financial analyst benchmark verified by experts from PwC, Cisco, Charles Schwab, and Fannie Mae For now, Sonnet 4.5 is the top scorer, beating OpenAI CUAđź‘€
4
7
18
@trycua
Cua
2 months
Think your agent’s SOTA? Prove it. Join Track A (On-site at @HackTheNorth, Waterloo · Sept 12–14) - Best SOTA Computer-Use Agent Hit the highest score on OSWorld-Gold by @hud_evals using the Cua Agent framework (cloud or local models). 🏆 Prize: 🇾Guaranteed @ycombinator
1
1
10
@parth220
parth
2 months
When your startup office is Zoomer coded…
1
1
22
@hud_evals
hud
2 months
We're super grateful to the XLang team for this collaboration! If you're an agent provider and you want to evaluate your models or share your benchmark scores, DM us or check out the leaderboard at https://t.co/HDfnEiMTRe
2
0
20
@hud_evals
hud
2 months
Everyone claims SOTA for Computer Use Agents (CUAs), but there's no way to ensure reproducible results. We're publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon.
7
18
171
@parth220
parth
2 months
uvx hud-python quickstart :D
7
8
114
@jayendra_ram
Jay
2 months
Since everyone is talking about RL Environments and GRPO now but no one knows how it works we thought it would be cool to make an explainer video + code you can run: This is an example of using GRPO to train Qwen 2.5 to play 2048 (code in thread) đź§µ:
26
161
2K
@trycua
Cua
2 months
Day 1 of 5 Days of Cua. We’re starting big: Cua is an official sponsor of Hack the North 2025, Sept 12–14, at UWaterloo. We're running the first Computer-Use Agents SOTA Challenge on-site in Waterloo - and alongside it, a separate global online challenge Cua x Ollama. 1/5
3
21
98
@jayendra_ram
Jay
3 months
I usually don’t talk about RL envs on the tl out of respect for our customers but this take won’t age well. Making problems that provide signal for models to get better is pretty hard, and is only going to get harder every year as models improve. The notion that you can vibe
@khoomeik
Rohan Pandey
3 months
the only RL envs frontier labs will continue buying in the medium term are the idiosyncratic high value ones and if you’re building a high value env, why not just do the RL in house, deploy vertically, and capture orders of magnitude more value?
6
6
218
@hud_evals
hud
3 months
btw, if you're interested in agentic benchmarks, we're releasing our financial analyst benchmark soon. DM us if you'd like to benchmark your agent and be on the leaderboardđź‘€
0
0
8