Theta @trytheta X Profile

Theta

@trytheta

Followers

758

Following

64

Media

2

Statuses

16

Specialized AI for Every Job

https://t.co/36ZVWp12Gi

Joined May 2025

Don't wanna be here? Send us removal request.

Theta

@trytheta

8 months

Introducing CUB: Humanity's Last Exam for Computer and Browser Use Agents

32

41

248

Theta

@trytheta

8 months

For a deeper dive, check out our blog with @SeanZCai:

1

15

Theta

@trytheta

8 months

Why Now? (4/4) AI-first browsers are poised to disrupt the massive web browser market, with highly anticipated releases like Comet from @perplexity_ai on the way. It's yet to be seen how Google integrates Project Mariner and other AI tools within Chrome.

1

16

Theta

@trytheta

8 months

Why Now? (3/4) Open source frameworks like @browser_use and @Stagehanddev have become some of the most popular repos on Github, with tens of thousands of stars.

1

0

13

Theta

@trytheta

8 months

Why Now? (2/4) Computer/browser use has become one of the most important frontiers for model capabilities, with @OpenAI, @AnthropicAI, and @GoogleDeepMind having dedicated teams to Operator, Claude Computer Use, and Project Mariner.

1

0

12

Theta

@trytheta

8 months

Why Now? (1/4) We're seeing new companies launch in the space every week, for both consumer and enterprise use cases. @ManusAI_HQ is one of the most popular generalist consumer agents, and @AthenaIntell is already being used by companies like Anheuser-Busch.

2

1

12

Theta

@trytheta

8 months

Browser agents use computers the same way humans do, unlocking powerful use cases for personal assistants, browsers, and enterprise workflows. After talking to 20+ founders in the space, we're excited to put out the definitive market map for browser agents.

28

87

588

Garry Tan

@garrytan

8 months

The AI labs need better evals and one of my favorite current YC batch companies just released a one with a *lot* of headroom

Theta

@trytheta

8 months

Introducing CUB: Humanity's Last Exam for Computer and Browser Use Agents

20

27

374

Theta

@trytheta

8 months

The Theta team started CUB as an internal evalset, but it quickly grew into a full-fledged benchmark over the past month. We're excited to test even more models and frameworks. For more on the benchmark, including examples and a full paper, check out our blog:

1

0

19

Theta

@trytheta

8 months

Computer/browser use agents still have a long way to go for more complex, end-to-end workflows. Actual task completion is far below our reported numbers: we gave credit for partially correct solutions and reaching key checkpoints. In total, there were less than 10 instances

1

0

18

Theta

@trytheta

8 months

We worked with domain experts (accountants, investment bankers, doctors, etc.) to create representative tasks of real-world workflows and software tools. We've heard from so many companies in the CUA/browser agent space who are already tackling these workflows, but existing

1

0

18

Theta

@trytheta

8 months

@browser_use took a big hit at 3.78% because it struggled with spreadsheets, but we're confident it would do much better with some improvement in that area. Despite @GoogleAI Gemini 2.5 Pro's strong multimodal performance on other benchmarks, it completely failed at computer use

2

1

22

Theta

@trytheta

8 months

Among the agents we tested, @ManusAI_HQ came out on top at 9.23%, followed by @OpenAI Operator at 7.28% and @AnthropicAI Claude 3.7 Computer Use at 6.01%. We found that Manus' proactive planning and orchestration helped it come out on top.

1

24

Gurvir Singh

@_gurvir_

8 months

we've been misled to believe that manual prompt hacking is the solution to teaching LLMs how to approach complex problems. why write a "magic prompt" to pattern match for every type of problem you might care about, when LLMs have already shown extraordinary ability to self-review

Andrej Karpathy

@karpathy

8 months

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human

3

5

28

Y Combinator

@ycombinator

8 months

Theta (@trytheta) allows AI agents to learn from their mistakes in real-time. Their memory layer has already improved the accuracy of OpenAI Operator by 43% with 7x fewer steps taken. https://t.co/9uI9vbSYLs Congrats on the launch, @RayanGarg, @tsha444, and @_gurvir_!

21

44

382