Anaïs Howland @AnaisHowland18 X Profile

Anaïs Howland

@AnaisHowland18

Followers

81

Following

765

Media

3

Statuses

196

Founder @ParadigmShiftAI - evals for computer-use agents | ex-Product @Google | ex-PM (CFA)

Joined January 2025

Don't wanna be here? Send us removal request.

Anaïs Howland

@AnaisHowland18

17 hours

Unsolicited thoughts on GPT-5 after 10 days:.I used to run o3 for almost everything in the consumer app, no other model came close for reasoning imo. So gpt-5 didn’t feel like the same kind of leap o3 was over gpt-4o. But it's definitely better: much stronger at coding (a big.

0

2

Anaïs Howland

@AnaisHowland18

5 days

Love seeing a fully open-source & reproducible CUA with data, model, tools, and eval. This is how CUAs get better. Kudos! 👏.

Xinyuan Wang

@xywang626

6 days

We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] .📌

0

4

Grok

@grok

2 days

Join millions who have switched to Grok.

29

40

281

Anaïs Howland

@AnaisHowland18

16 days

this is the kind of benchmark we need. Games give models a rule-bound + dynamic playground to plan, adapt and strategize against opponents. Great step toward evaluating real-world reasoning.

Logan Kilpatrick

@OfficialLoganK

17 days

We built an open source game arena (RL environments) to put frontier models against each other head to head. Games are an area I’m super excited to see Gemini shine more! .

0

2

Anaïs Howland

@AnaisHowland18

24 days

WebGuard is a big step for web-agent safety: ~5k human-tagged actions across 193 sites. Frontier LLMs hit <60% on high-risk detection and even after fine-tuning a 7B model, high-risk recall tops out at 76%. Lots of room and urgency for better guardrails before we put agents in.

Yu Su (hiring postdoc)

@ysu_nlp

24 days

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? . We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best

0

3

17

Anaïs Howland

@AnaisHowland18

28 days

We just made it way easier to debug & benchmark web agents. Heatmaps + deep insights now live on @ParadigmShiftAI. If you care about why your web agent fails (not just if), read on 👇.

Paradigm Shift AI

@ParadigmShiftAI

28 days

Paradigm Shift AI just supercharged web-agent evals 🚀. We revamped our analytics with deeper agent insights, success heatmaps, variance scores, human baselines, full replay & crash logs and more. See where your agent shines or stumbles all in one place. Want access to the

0

2

8

Anaïs Howland

@AnaisHowland18

29 days

Ran @browser_use on @ParadigmShiftAI to pit Claude 4 Sonnet vs Gemini 2.5 Pro on 10x10 WebVoyager vision tasks. Claude: 99 % accuracy & 3× faster ⚡️.Gemini: 75 % accuracy 😬.@GoogleDeepMind why the lag? #AI #VisionAI

0

3

12

Anaïs Howland

@AnaisHowland18

1 month

Big few weeks for web & computer-use agents:.Comet from @perplexity_ai and ChatGPT Agent from @OpenAI just dropped. Exciting time to be building in that space 🔥.

0

3

Anaïs Howland

@AnaisHowland18

1 month

Pretty interesting paper on agents learning new skills on their own: RL loop lets web agents self-train with no human intervention. Sees 10% jump in success on benchmarks. Cool stuff!.

arxiv.org

The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the...

0

2

Anaïs Howland

@AnaisHowland18

1 month

Big eval in progress 👀.

Paradigm Shift AI

@ParadigmShiftAI

1 month

Track browser-eval progress in real time, episode by episode and right from your dashboard! No more hunting through live logs (unless you still get a kick out of it 😅)

0

1

2

Anaïs Howland

@AnaisHowland18

1 month

If you’re building a web/desktop agent, this is a great hub for the latest research in the space:.

webagentlab.notion.site

WebAgentLab is building an open-source community focused on Web Agent. Our community aims to create value through open collaboration, knowledge sharing, and technological innovation, and to collect...

0

4

Anaïs Howland

@AnaisHowland18

1 month

Ran a web-agent evaluation on 5k+ tasks in one pass with @ParadigmShiftAI, our biggest batch yet! Planning to 2x capacity each week and aiming for a 100K-task eval in a few weeks. Stay tuned, more insights coming! 🔥.

0

2

5

Anaïs Howland

@AnaisHowland18

1 month

Totally agree, great analysis. That’s why @ParadigmShiftAI delivers richer metrics, deeper failure-trace analytics, and a bigger task bank (proprietary + public) to really stress-test web agents.

Shayne Longpre

@ShayneRedford

1 month

Existing AI Agent benchmarks are broken 🤖💔 . Great work by @maxYuxuanZhu and @daniel_d_kang identify + fix issues, and establish rigorous best practices for Agentic AI benchmarks!. Check out the blog:

0

1

3

Anaïs Howland

@AnaisHowland18

1 month

RT @ParadigmShiftAI: Thrilled to announce we've been accepted into the @UofBeta Pre-Acceleration Program Cohort 10! Looking forward to conn….

0

2

0

Anaïs Howland

@AnaisHowland18

2 months

Is a Soham application the new @ycombinator stamp of approval for startups? I guess it’s time to open up applications for our team too @ParadigmShiftAI 😅.

Mahesh Sathiamoorthy

@madiator

2 months

Looks like Soham applied to Bespoke as well (via a google form we had -- and his CV was uploaded). This is the new badge to carry: if Soham didn't apply you are not a serious startup. :D

2

0

1

Anaïs Howland

@AnaisHowland18

2 months

Building a browser agent? Put it to the test with our NeuroSim eval platform! .Check out our new blog post that shows live demos of our analytics and platform.👇. Want to try it? DM me for an invite and full access.

Paradigm Shift AI

@ParadigmShiftAI

2 months

Introducing NeuroSim, our browser agent evaluation platform!. Run real-world evaluations for browser agents + models, see gap-to-human scores, share team leaderboards—free while we iterate with you. Read more 👉 DM or email info@paradigm-shift.ai for

0

4

Anaïs Howland

@AnaisHowland18

2 months

RT @ParadigmShiftAI: o3 just got 80% cheaper (thanks @OpenAI), so we added it. NeuroSim supports o3 + o4-mini, run your browser-use agent e….

0

1

0

Anaïs Howland

@AnaisHowland18

2 months

Check out our agent marketplace AgentHub and publish your agent today!.

Paradigm Shift AI

@ParadigmShiftAI

2 months

🚀 Agent Hub v1 is live! The “App Store” for AI agents. Built an agent? Publish one Agent Card today:.✅ appear in a public directory.✅ give devs a ready endpoint + JSON spec.✅ push updates with version tags. Read more → . #AIagents #GenerativeAI

0

3

Anaïs Howland

@AnaisHowland18

3 months

Congrats @hcompany_ai on Holo-1 + Surfer H! 92 % on WebVoyager and low cost, perfect for stress-testing our computer-use evals. Excited to see companies pushing the bounds of computer-use! 👏 #Agents #ComputerUse.

hcompany.ai

The Cost Efficient Web Agent With Open Weights

0

2

Anaïs Howland

@AnaisHowland18

3 months

Attending the AI Engineer World’s Fair in SF this week! Excited for the packed lineup of speakers. Let me know if you’re around and want to connect! #AIEWF #AIEngineer

0

1

4