Theta
@trytheta
Followers
758
Following
64
Media
2
Statuses
16
Specialized AI for Every Job
Joined May 2025
Introducing CUB: Humanity's Last Exam for Computer and Browser Use Agents
32
41
248
Why Now? (4/4) AI-first browsers are poised to disrupt the massive web browser market, with highly anticipated releases like Comet from @perplexity_ai on the way. It's yet to be seen how Google integrates Project Mariner and other AI tools within Chrome.
1
1
16
Why Now? (3/4) Open source frameworks like @browser_use and @Stagehanddev have become some of the most popular repos on Github, with tens of thousands of stars.
1
0
13
Why Now? (2/4) Computer/browser use has become one of the most important frontiers for model capabilities, with @OpenAI, @AnthropicAI, and @GoogleDeepMind having dedicated teams to Operator, Claude Computer Use, and Project Mariner.
1
0
12
Why Now? (1/4) We're seeing new companies launch in the space every week, for both consumer and enterprise use cases. @ManusAI_HQ is one of the most popular generalist consumer agents, and @AthenaIntell is already being used by companies like Anheuser-Busch.
2
1
12
Browser agents use computers the same way humans do, unlocking powerful use cases for personal assistants, browsers, and enterprise workflows. After talking to 20+ founders in the space, we're excited to put out the definitive market map for browser agents.
28
87
588
The Theta team started CUB as an internal evalset, but it quickly grew into a full-fledged benchmark over the past month. We're excited to test even more models and frameworks. For more on the benchmark, including examples and a full paper, check out our blog:
1
0
19
Computer/browser use agents still have a long way to go for more complex, end-to-end workflows. Actual task completion is far below our reported numbers: we gave credit for partially correct solutions and reaching key checkpoints. In total, there were less than 10 instances
1
0
18
We worked with domain experts (accountants, investment bankers, doctors, etc.) to create representative tasks of real-world workflows and software tools. We've heard from so many companies in the CUA/browser agent space who are already tackling these workflows, but existing
1
0
18
@browser_use took a big hit at 3.78% because it struggled with spreadsheets, but we're confident it would do much better with some improvement in that area. Despite @GoogleAI Gemini 2.5 Pro's strong multimodal performance on other benchmarks, it completely failed at computer use
2
1
22
Among the agents we tested, @ManusAI_HQ came out on top at 9.23%, followed by @OpenAI Operator at 7.28% and @AnthropicAI Claude 3.7 Computer Use at 6.01%. We found that Manus' proactive planning and orchestration helped it come out on top.
1
1
24
we've been misled to believe that manual prompt hacking is the solution to teaching LLMs how to approach complex problems. why write a "magic prompt" to pattern match for every type of problem you might care about, when LLMs have already shown extraordinary ability to self-review
We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human
3
5
28
Theta (@trytheta) allows AI agents to learn from their mistakes in real-time. Their memory layer has already improved the accuracy of OpenAI Operator by 43% with 7x fewer steps taken. https://t.co/9uI9vbSYLs Congrats on the launch, @RayanGarg, @tsha444, and @_gurvir_!
21
44
382