
Anaïs Howland
@AnaisHowland18
Followers
81
Following
765
Media
3
Statuses
196
Founder @ParadigmShiftAI - evals for computer-use agents | ex-Product @Google | ex-PM (CFA)
Joined January 2025
Love seeing a fully open-source & reproducible CUA with data, model, tools, and eval. This is how CUAs get better. Kudos! 👏.
We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] .📌
0
0
4
this is the kind of benchmark we need. Games give models a rule-bound + dynamic playground to plan, adapt and strategize against opponents. Great step toward evaluating real-world reasoning.
We built an open source game arena (RL environments) to put frontier models against each other head to head. Games are an area I’m super excited to see Gemini shine more! .
0
0
2
WebGuard is a big step for web-agent safety: ~5k human-tagged actions across 193 sites. Frontier LLMs hit <60% on high-risk detection and even after fine-tuning a 7B model, high-risk recall tops out at 76%. Lots of room and urgency for better guardrails before we put agents in.
Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? . We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best
0
3
17
We just made it way easier to debug & benchmark web agents. Heatmaps + deep insights now live on @ParadigmShiftAI. If you care about why your web agent fails (not just if), read on 👇.
Paradigm Shift AI just supercharged web-agent evals 🚀. We revamped our analytics with deeper agent insights, success heatmaps, variance scores, human baselines, full replay & crash logs and more. See where your agent shines or stumbles all in one place. Want access to the
0
2
8
Ran @browser_use on @ParadigmShiftAI to pit Claude 4 Sonnet vs Gemini 2.5 Pro on 10x10 WebVoyager vision tasks. Claude: 99 % accuracy & 3× faster ⚡️.Gemini: 75 % accuracy 😬.@GoogleDeepMind why the lag? #AI #VisionAI
0
3
12
Big few weeks for web & computer-use agents:.Comet from @perplexity_ai and ChatGPT Agent from @OpenAI just dropped. Exciting time to be building in that space 🔥.
0
0
3
Pretty interesting paper on agents learning new skills on their own: RL loop lets web agents self-train with no human intervention. Sees 10% jump in success on benchmarks. Cool stuff!.
arxiv.org
The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the...
0
0
2
If you’re building a web/desktop agent, this is a great hub for the latest research in the space:.
webagentlab.notion.site
WebAgentLab is building an open-source community focused on Web Agent. Our community aims to create value through open collaboration, knowledge sharing, and technological innovation, and to collect...
0
0
4
Ran a web-agent evaluation on 5k+ tasks in one pass with @ParadigmShiftAI, our biggest batch yet! Planning to 2x capacity each week and aiming for a 100K-task eval in a few weeks. Stay tuned, more insights coming! 🔥.
0
2
5
Totally agree, great analysis. That’s why @ParadigmShiftAI delivers richer metrics, deeper failure-trace analytics, and a bigger task bank (proprietary + public) to really stress-test web agents.
Existing AI Agent benchmarks are broken 🤖💔 . Great work by @maxYuxuanZhu and @daniel_d_kang identify + fix issues, and establish rigorous best practices for Agentic AI benchmarks!. Check out the blog:
0
1
3
RT @ParadigmShiftAI: Thrilled to announce we've been accepted into the @UofBeta Pre-Acceleration Program Cohort 10! Looking forward to conn….
0
2
0
Is a Soham application the new @ycombinator stamp of approval for startups? I guess it’s time to open up applications for our team too @ParadigmShiftAI 😅.
Looks like Soham applied to Bespoke as well (via a google form we had -- and his CV was uploaded). This is the new badge to carry: if Soham didn't apply you are not a serious startup. :D
2
0
1
Building a browser agent? Put it to the test with our NeuroSim eval platform! .Check out our new blog post that shows live demos of our analytics and platform.👇. Want to try it? DM me for an invite and full access.
Introducing NeuroSim, our browser agent evaluation platform!. Run real-world evaluations for browser agents + models, see gap-to-human scores, share team leaderboards—free while we iterate with you. Read more 👉 DM or email info@paradigm-shift.ai for
0
0
4
RT @ParadigmShiftAI: o3 just got 80% cheaper (thanks @OpenAI), so we added it. NeuroSim supports o3 + o4-mini, run your browser-use agent e….
0
1
0
Check out our agent marketplace AgentHub and publish your agent today!.
🚀 Agent Hub v1 is live! The “App Store” for AI agents. Built an agent? Publish one Agent Card today:.✅ appear in a public directory.✅ give devs a ready endpoint + JSON spec.✅ push updates with version tags. Read more → . #AIagents #GenerativeAI
0
0
3
Congrats @hcompany_ai on Holo-1 + Surfer H! 92 % on WebVoyager and low cost, perfect for stress-testing our computer-use evals. Excited to see companies pushing the bounds of computer-use! 👏 #Agents #ComputerUse.
hcompany.ai
The Cost Efficient Web Agent With Open Weights
0
0
2
Attending the AI Engineer World’s Fair in SF this week! Excited for the packed lineup of speakers. Let me know if you’re around and want to connect! #AIEWF #AIEngineer
0
1
4