Anaïs Howland Profile
Anaïs Howland

@AnaisHowland18

Followers
81
Following
765
Media
3
Statuses
196

Founder @ParadigmShiftAI - evals for computer-use agents | ex-Product @Google | ex-PM (CFA)

Joined January 2025
Don't wanna be here? Send us removal request.
@AnaisHowland18
Anaïs Howland
17 hours
Unsolicited thoughts on GPT-5 after 10 days:.I used to run o3 for almost everything in the consumer app, no other model came close for reasoning imo. So gpt-5 didn’t feel like the same kind of leap o3 was over gpt-4o. But it's definitely better: much stronger at coding (a big.
0
0
2
@AnaisHowland18
Anaïs Howland
5 days
Love seeing a fully open-source & reproducible CUA with data, model, tools, and eval. This is how CUAs get better. Kudos! 👏.
@xywang626
Xinyuan Wang
6 days
We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] .📌
Tweet media one
0
0
4
@grok
Grok
2 days
Join millions who have switched to Grok.
29
40
281
@AnaisHowland18
Anaïs Howland
16 days
this is the kind of benchmark we need. Games give models a rule-bound + dynamic playground to plan, adapt and strategize against opponents. Great step toward evaluating real-world reasoning.
@OfficialLoganK
Logan Kilpatrick
17 days
We built an open source game arena (RL environments) to put frontier models against each other head to head. Games are an area I’m super excited to see Gemini shine more! .
0
0
2
@AnaisHowland18
Anaïs Howland
24 days
WebGuard is a big step for web-agent safety: ~5k human-tagged actions across 193 sites. Frontier LLMs hit <60% on high-risk detection and even after fine-tuning a 7B model, high-risk recall tops out at 76%. Lots of room and urgency for better guardrails before we put agents in.
@ysu_nlp
Yu Su (hiring postdoc)
24 days
Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? . We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best
Tweet media one
0
3
17
@AnaisHowland18
Anaïs Howland
28 days
We just made it way easier to debug & benchmark web agents. Heatmaps + deep insights now live on @ParadigmShiftAI. If you care about why your web agent fails (not just if), read on 👇.
@ParadigmShiftAI
Paradigm Shift AI
28 days
Paradigm Shift AI just supercharged web-agent evals 🚀. We revamped our analytics with deeper agent insights, success heatmaps, variance scores, human baselines, full replay & crash logs and more. See where your agent shines or stumbles all in one place. Want access to the
0
2
8
@AnaisHowland18
Anaïs Howland
29 days
Ran @browser_use on @ParadigmShiftAI to pit Claude 4 Sonnet vs Gemini 2.5 Pro on 10x10 WebVoyager vision tasks. Claude: 99 % accuracy & 3× faster ⚡️.Gemini: 75 % accuracy 😬.@GoogleDeepMind why the lag? #AI #VisionAI
Tweet media one
Tweet media two
0
3
12
@AnaisHowland18
Anaïs Howland
1 month
Big few weeks for web & computer-use agents:.Comet from @perplexity_ai and ChatGPT Agent from @OpenAI just dropped. Exciting time to be building in that space 🔥.
0
0
3
@AnaisHowland18
Anaïs Howland
1 month
Pretty interesting paper on agents learning new skills on their own: RL loop lets web agents self-train with no human intervention. Sees 10% jump in success on benchmarks. Cool stuff!.
Tweet card summary image
arxiv.org
The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the...
0
0
2
@AnaisHowland18
Anaïs Howland
1 month
Big eval in progress 👀.
@ParadigmShiftAI
Paradigm Shift AI
1 month
Track browser-eval progress in real time, episode by episode and right from your dashboard! No more hunting through live logs (unless you still get a kick out of it 😅)
0
1
2
@AnaisHowland18
Anaïs Howland
1 month
Ran a web-agent evaluation on 5k+ tasks in one pass with @ParadigmShiftAI, our biggest batch yet! Planning to 2x capacity each week and aiming for a 100K-task eval in a few weeks. Stay tuned, more insights coming! 🔥.
0
2
5
@AnaisHowland18
Anaïs Howland
1 month
Totally agree, great analysis. That’s why @ParadigmShiftAI delivers richer metrics, deeper failure-trace analytics, and a bigger task bank (proprietary + public) to really stress-test web agents.
@ShayneRedford
Shayne Longpre
1 month
Existing AI Agent benchmarks are broken 🤖💔 . Great work by @maxYuxuanZhu and @daniel_d_kang identify + fix issues, and establish rigorous best practices for Agentic AI benchmarks!. Check out the blog:
Tweet media one
0
1
3
@AnaisHowland18
Anaïs Howland
1 month
RT @ParadigmShiftAI: Thrilled to announce we've been accepted into the @UofBeta Pre-Acceleration Program Cohort 10! Looking forward to conn….
0
2
0
@AnaisHowland18
Anaïs Howland
2 months
Is a Soham application the new @ycombinator stamp of approval for startups? I guess it’s time to open up applications for our team too @ParadigmShiftAI 😅.
@madiator
Mahesh Sathiamoorthy
2 months
Looks like Soham applied to Bespoke as well (via a google form we had -- and his CV was uploaded). This is the new badge to carry: if Soham didn't apply you are not a serious startup. :D
Tweet media one
2
0
1
@AnaisHowland18
Anaïs Howland
2 months
Building a browser agent? Put it to the test with our NeuroSim eval platform! .Check out our new blog post that shows live demos of our analytics and platform.👇. Want to try it? DM me for an invite and full access.
@ParadigmShiftAI
Paradigm Shift AI
2 months
Introducing NeuroSim, our browser agent evaluation platform!. Run real-world evaluations for browser agents + models, see gap-to-human scores, share team leaderboards—free while we iterate with you. Read more 👉 DM or email info@paradigm-shift.ai for
0
0
4
@AnaisHowland18
Anaïs Howland
2 months
RT @ParadigmShiftAI: o3 just got 80% cheaper (thanks @OpenAI), so we added it. NeuroSim supports o3 + o4-mini, run your browser-use agent e….
0
1
0
@AnaisHowland18
Anaïs Howland
2 months
Check out our agent marketplace AgentHub and publish your agent today!.
@ParadigmShiftAI
Paradigm Shift AI
2 months
🚀 Agent Hub v1 is live! The “App Store” for AI agents. Built an agent? Publish one Agent Card today:.✅ appear in a public directory.✅ give devs a ready endpoint + JSON spec.✅ push updates with version tags. Read more → . #AIagents #GenerativeAI
Tweet media one
0
0
3
@AnaisHowland18
Anaïs Howland
3 months
Congrats @hcompany_ai on Holo-1 + Surfer H! 92 % on WebVoyager and low cost, perfect for stress-testing our computer-use evals. Excited to see companies pushing the bounds of computer-use! 👏 #Agents #ComputerUse.
Tweet card summary image
hcompany.ai
The Cost Efficient Web Agent With Open Weights
0
0
2
@AnaisHowland18
Anaïs Howland
3 months
Attending the AI Engineer World’s Fair in SF this week! Excited for the packed lineup of speakers. Let me know if you’re around and want to connect! #AIEWF #AIEngineer
Tweet media one
0
1
4