Alex Shan
@alexshander03
Followers
77
Following
32
Media
2
Statuses
28
Agent Behavior Monitoring (ABM) Co-founder, CEO of @JudgmentLabs
California, USA
Joined July 2025
@alexshander03 tabling his AI evals doctrine in DC this week, flanked by @JudgmentLabs' varsity cheer team - ie @carloagostinel2 & myself. despite best efforts we never made it past the fence
1
2
18
Readers responded with both surprise and agreement last week when I wrote that the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error
deeplearning.ai
DeepLearning.AI | Andrew Ng | Join over 7 million people learning how to use and build AI through our online courses. Earn certifications, level up your skills, and stay ahead of the industry.
84
291
2K
@alexshander03 and the entire @JudgmentLabs team have been quietly pushing the limits these past few months. Thrilled for what’s ahead for this exceptional group—big things are coming. Read more below: https://t.co/7jsqtjBfUh
judgmentlabs.ai
We cannot improve on what we cannot measure. Most teams aren’t measuring what matters.
0
2
5
At @JudgmentLabs we've had the opportunity to work with countless AI agent teams building fantastic products. Measuring and understanding agent behavior has become a bottleneck to agent improvement and everyone knows it. However, few get this process right and most teams fall
2
5
15
There is insane demand for people who can understand and explain technology in a compelling way.
1K
1K
17K
6 months, 25 million revenue agents & 3 trillion tokens later... Rox is now globally available 🌎 Just as coding agents 10x’d engineering, revenue agents 10x customer work. With Rox, humans are evolving to orchestrators while agents manage the end-to-end customer lifecycle.
94
88
648
This is insane - and foreshadows a future that will come fast. Cursor just handed us the first production-ready demonstration of how strong online RL can be!! The secret here to generalizing is figuring out how different apps, each with their own interface, can collect
We've trained a new Tab model that is now the default in Cursor. This model makes 21% fewer suggestions than the previous model while having a 28% higher accept rate for the suggestions it makes. Learn more about how we improved Tab with online RL.
0
0
0
we’re approaching the end of 2025 and there’s still no plug-n-play RL lib in the interrim: - i built a shitty version of this (llamagym) - RL started working (o1) - oss found out how it worked (r1) - “RL env” became the new buzzword - oss RL envs unified around `verifiers`
how is it 2024 and there are still no simple opensource frameworks for finetuning an LLM agent in an RL setup? i should be able to take an old openai gym env and drop in llama for fine-tuning. who's building this?
38
32
497
“Evals” are becoming an ever-growing umbrella of terminology that describes any measure of quality across an AI app. As a result, conversations and discourse are getting lost in definitions and semantics... here's an example. Frontier labs use evals (reward models, human
1
1
8
New blog post about asymmetry of verification and "verifier's law": https://t.co/bvS8HrX1jP Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of
54
247
2K