hud @hud_evals X Profile

hud

@hud_evals

Followers

2K

Following

204

Media

8

Statuses

43

RL environments + evals for agents | @ycombinator | we're hiring!

https://t.co/zgeFuBPEjC

Joined January 2025

Don't wanna be here? Send us removal request.

hud

@hud_evals

4 months

we're actively hiring for these roles btw👀

atlas

@creatine_cycle

4 months

the jobs left after the singularity will be: - agentic workflow engineer - twink - chief of staff

25

5

216

parth

@parth220

2 days

Much love to @xeophon_ for having the same type of eval tism. @EpochAIResearch 🤝 @hud_evals

Epoch AI

@EpochAIResearch

4 days

We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. See thread for details!

2

3

16

Kasey Zhang

@_WEEXIAO

10 days

5/ @jayendra_ram + @seeklis from @hud_evals and @joeybesgen from TLDC discussed how to build and scale RL environments. Link to slides: https://t.co/H7MqgHSDwG TL;DR: RL environments have a "Goldilocks" level of difficulty - hard enough to hill climb on, but not too hard

2

4

12

hud

@hud_evals

17 days

We’re giving a talk at YC this Saturday, join us!

Jay

@jayendra_ram

18 days

We're going to teach people the best practices when making RL environments at @ycombinator this Saturday. If you're interested in the space and want to see the potential use cases of RL environments, come by! https://t.co/7rm3P5o4s5

1

0

11

jacky huang (:

@Jackyhuang

23 days

Last but not least, we'll be wrapping this all up at RL IRL forum with @reductoai @Osmosis_AI @greptileai @encord_team @hud_evals I'm super bullish on big IRL events as a untapped opportunity! Come to get exclusive Osmosis swag! Register here: https://t.co/P3H7JFwwJc

1

14

davinci

@leothecurious

1 month

nearly every talented ML anon i've interacted with over the past year has by now ended up at one of: @PrimeIntellect @arcee_ai @hud_evals

8

2

293

hud

@hud_evals

1 month

For more details on SheetBench-50, and the challenges of evaluating how agentic workflows can build dashboards giving insight on critical business functions, check out our blog! https://t.co/OV27gI1cUn

hud.so

HUD and Sepal AI introduce SheetBench-50, a benchmark for financial workflows in Google Sheets that measures real-world agent performance.

0

3

hud

@hud_evals

1 month

Claude Sonnet 4.5 is currently the best Computer Use Agent on SheetBench-50. We're adding more real-world/enterprise usecases soon. If you'd like to evaluate AI Agents on SheetBench-50 or other tasks, or improve models for your custom usecase, DM us. Or email founders@hud.so (!)

1

0

5

hud

@hud_evals

1 month

SheetBench-50 tests agents' abilities on real workflows: -Long-context: building financial models from scratch -Error recovery: debugging circular references in budgets -Multistep planning: compiling reports across business units -UI navigation: creating dynamic dashboards

1

0

1

hud

@hud_evals

1 month

Existing evals have key gaps: 1. Testing only formula generation, not pivot tables, auditing adjusting forecasts or reasoning about models 2. Using single-table examples from Excel forums, not messy data like in real enterprise workflows We worked with experts to make hard tasks.

1

0

1

hud

@hud_evals

1 month

The world runs on spreadsheets. Can AI agents handle real-world workflows? We've worked with @sepal_ai on SheetBench-50, a financial analyst benchmark verified by experts from PwC, Cisco, Charles Schwab, and Fannie Mae For now, Sonnet 4.5 is the top scorer, beating OpenAI CUA👀

4

7

18

Cua

@trycua

2 months

Think your agent’s SOTA? Prove it. Join Track A (On-site at @HackTheNorth, Waterloo · Sept 12–14) - Best SOTA Computer-Use Agent Hit the highest score on OSWorld-Gold by @hud_evals using the Cua Agent framework (cloud or local models). 🏆 Prize: 🇾Guaranteed @ycombinator

1

10

parth

@parth220

2 months

When your startup office is Zoomer coded…

1

22

hud

@hud_evals

2 months

We're super grateful to the XLang team for this collaboration! If you're an agent provider and you want to evaluate your models or share your benchmark scores, DM us or check out the leaderboard at https://t.co/HDfnEiMTRe

2

0

20

hud

@hud_evals

2 months

Everyone claims SOTA for Computer Use Agents (CUAs), but there's no way to ensure reproducible results. We're publicly releasing our OSWorld Verified leaderboard, starting with CUA models from OpenAI and Anthropic. We will include more evals and models soon.

7

18

171

parth

@parth220

2 months

uvx hud-python quickstart :D

7

8

114

will brown

@willccbb

2 months

@DarwinianVyas shoutoutt @hud_evals fr https://t.co/ZFp39vstao

app.primeintellect.ai

Browser-based 2048 game for evaluating agents using visual observations and keyboard actions

1

5

18

Jay

@jayendra_ram

2 months

Since everyone is talking about RL Environments and GRPO now but no one knows how it works we thought it would be cool to make an explainer video + code you can run: This is an example of using GRPO to train Qwen 2.5 to play 2048 (code in thread) 🧵:

26

161

2K

Cua

@trycua

2 months

Day 1 of 5 Days of Cua. We’re starting big: Cua is an official sponsor of Hack the North 2025, Sept 12–14, at UWaterloo. We're running the first Computer-Use Agents SOTA Challenge on-site in Waterloo - and alongside it, a separate global online challenge Cua x Ollama. 1/5

3

21

98

Jay

@jayendra_ram

3 months

I usually don’t talk about RL envs on the tl out of respect for our customers but this take won’t age well. Making problems that provide signal for models to get better is pretty hard, and is only going to get harder every year as models improve. The notion that you can vibe

Rohan Pandey

@khoomeik

3 months

the only RL envs frontier labs will continue buying in the medium term are the idiosyncratic high value ones and if you’re building a high value env, why not just do the RL in house, deploy vertically, and capture orders of magnitude more value?

6

218

hud

@hud_evals

3 months

btw, if you're interested in agentic benchmarks, we're releasing our financial analyst benchmark soon. DM us if you'd like to benchmark your agent and be on the leaderboard👀

0

8