Alex Shan Profile
Alex Shan

@alexshander03

Followers
77
Following
32
Media
2
Statuses
28

Agent Behavior Monitoring (ABM) Co-founder, CEO of @JudgmentLabs

California, USA
Joined July 2025
Don't wanna be here? Send us removal request.
@JamesAlcorn94
James Alcorn
10 days
@alexshander03 tabling his AI evals doctrine in DC this week, flanked by @JudgmentLabs' varsity cheer team - ie @carloagostinel2 & myself. despite best efforts we never made it past the fence
1
2
18
@AndrewYNg
Andrew Ng
2 months
Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error
Tweet card summary image
deeplearning.ai
DeepLearning.AI | Andrew Ng | Join over 7 million people learning how to use and build AI through our online courses. Earn certifications, level up your skills, and stay ahead of the industry.
84
291
2K
@carloagostinel2
carlo agostinelli
2 months
@alexshander03 and the entire @JudgmentLabs team have been quietly pushing the limits these past few months. Thrilled for what’s ahead for this exceptional group—big things are coming. Read more below: https://t.co/7jsqtjBfUh
Tweet card summary image
judgmentlabs.ai
We cannot improve on what we cannot measure. Most teams aren’t measuring what matters.
0
2
5
@alexshander03
Alex Shan
2 months
At @JudgmentLabs we've had the opportunity to work with countless AI agent teams building fantastic products. Measuring and understanding agent behavior has become a bottleneck to agent improvement and everyone knows it. However, few get this process right and most teams fall
2
5
15
@supabase
Supabase
3 months
chat is this real?
12
2
95
@alexshander03
Alex Shan
3 months
@atelicinvest
Unemployed Capital Allocator
3 months
Coding 4.2% We live in a bubble
0
0
0
@alexshander03
Alex Shan
3 months
Brendan is right that methods for evaluations will push the bounds of what agents can learn. I think there's something to be said here about how we know what to evaluate and how we go about doing that. Sometimes we can rely on human experts to craft criteria, but this can break
@BrendanFoody
Brendan (can/do)
3 months
0
0
1
@mattyp
matt palmer
3 months
There is insane demand for people who can understand and explain technology in a compelling way.
1K
1K
17K
@rox_ai
Rox
3 months
6 months, 25 million revenue agents & 3 trillion tokens later... Rox is now globally available 🌎 Just as coding agents 10x’d engineering, revenue agents 10x customer work. With Rox, humans are evolving to orchestrators while agents manage the end-to-end customer lifecycle.
94
88
648
@alexshander03
Alex Shan
3 months
This is insane - and foreshadows a future that will come fast. Cursor just handed us the first production-ready demonstration of how strong online RL can be!! The secret here to generalizing is figuring out how different apps, each with their own interface, can collect
@cursor_ai
Cursor
3 months
We've trained a new Tab model that is now the default in Cursor. This model makes 21% fewer suggestions than the previous model while having a 28% higher accept rate for the suggestions it makes. Learn more about how we improved Tab with online RL.
0
0
0
@khoomeik
Rohan Pandey
3 months
we’re approaching the end of 2025 and there’s still no plug-n-play RL lib in the interrim: - i built a shitty version of this (llamagym) - RL started working (o1) - oss found out how it worked (r1) - “RL env” became the new buzzword - oss RL envs unified around `verifiers`
@khoomeik
Rohan Pandey
2 years
how is it 2024 and there are still no simple opensource frameworks for finetuning an LLM agent in an RL setup? i should be able to take an old openai gym env and drop in llama for fine-tuning. who's building this?
38
32
497
@alexshander03
Alex Shan
3 months
“Evals” are becoming an ever-growing umbrella of terminology that describes any measure of quality across an AI app. As a result, conversations and discourse are getting lost in definitions and semantics... here's an example. Frontier labs use evals (reward models, human
1
1
8
@_jasonwei
Jason Wei
5 months
New blog post about asymmetry of verification and "verifier's law": https://t.co/bvS8HrX1jP Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of
54
247
2K
@JamesAlcorn94
James Alcorn
7 months
reward is eval, eval is reward, reward is enough
1
3
30