Labelbox
@labelbox
Followers
3K
Following
589
Media
28
Statuses
240
High-quality frontier data for leading AI teams.
San Francisco, CA
Joined January 2018
Essential weekend reading. The Scaling Era: an oral history of AI by @dwarkesh_sp and thank you @stripepress!
4
18
217
this is great
The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self
101
300
5K
Highly recommend tuning into @dwarkesh_sp's episode today with @karpathy. They dive deep into why RL is so information-sparse and what that means for realizing the decade of agents. A few highlights that stood out: - “RL is terrible; it’s just that everything else is much
25
167
2K
Get in touch to learn more about our latest work in RL here:
labelbox.com
Discover how we partner with researchers to fuel the next wave of AI advancements, powered by experts in post-training and model evaluation.
1
1
4
Thrilled to be featured in Dwarkesh’s latest episode with Richard Sutton, widely regarded as the father of reinforcement learning and 2024 Turing Award winner. As Richard explains, we’re entering the Era of Experience, where training AI means creating environments that capture
13
90
739
See more of Dwarkesh’s visit and get in touch to learn how Labelbox delivers large-scale, high-fidelity data collection to advance next-gen robotics.
labelbox.com
Discover how we partner with researchers to fuel the next wave of AI advancements, powered by experts in post-training and model evaluation.
0
0
5
As his latest guest, @svlevine (co-founder of @physical_int) predicts, robots could be running households entirely autonomously by 2030.
1
1
8
We recently invited @dwarkesh_sp to stop by our SF robotics lab. World-class podcaster, rookie robotics intern.
19
180
2K
.@svlevine is one of the world's leading robotics researchers (and co-founder of @physical_int). He thinks fully autonomous robots are much closer than people realize - when I pushed him on a prediction, he said 5 years to robots that can autonomously run your household). The
21
121
970
If you’re a Dwarkesh fan, check out the landing page and follow along, this is just the beginning of something special.
0
0
4
We’ve always admired how @dwarkesh_sp sparks conversations with top thinkers in AI, academia, and tech. Now we’re teaming up to connect with a community that shares our mission of pushing the limits of what’s possible in AI. The first episode together with one of his most
32
127
1K
We’ll continue evaluating frontier models on more constraint domains and reporting as the gap between leading AI capabilities closes. Check out our blog post for more info!
labelbox.com
0
0
3
Lessons learned: Constraint interactions, not just rules, limit performance, and success on synthetic tasks doesn’t always transfer to real-world cases. We observe that high constraint densities tend to also expose weaknesses, and analyzing failures helps guide targeted
1
1
3
Our initial findings show that no current model maintains consistent feasibility under real-world, high-complexity scenarios. On synthetic stress tests, o3 demonstrates the highest feasibility, closely followed by GPT-5. In a domain-grounded data center migration benchmark, GPT-5
1
0
2
We tested whether leading models could generate schedules on RCPSP (resource-constrained project scheduling problems) that meet all constraints and remain consistent as complexity increases. To do this, we varied task difficulty across hundreds of levels and applied realistic
1
0
2
Introducing ConstraintBench: a new benchmark for evaluating LLM reasoning on realistic resource-constrained project scheduling problems (RCPSP), a well-known NP-complete challenge. It tackles some of the toughest planning challenges (such as project management, construction,
10
62
523
As AI advances, so do the human skills required to shape and align it. Full report: https://t.co/pmiGSriNU6
0
0
4
Grok-4 just landed on our Complex Reasoning leaderboard, and it’s impressive💥 - Math: 81.8% - Pure Math: 84.8% - Applied Math: 79.9% - CS: 75.4% - Reasoning: 77.8% - Aggregate: 80.7% See how it stacks up:
labelbox.com
The Labelbox complex reasoning leaderboard rigorously assesses top AI models against some of the most demanding tasks available today.
11
2
12