Charly Wargnier
@DataChaz
Followers
144K
Following
91K
Media
6K
Statuses
26K
Ex @Streamlit @Snowflake Maestro 🪄 • X about AI agents, LLMs, web apps, Python & SEO • My ❤️ is open source • DM for collabs 📩
London 🇬🇧 ⇆ 🇫🇷 Pyrenees
Joined January 2009
Gemini 3 just launched, and @Browserbase's already run full computer-use evaluations to see how well it handles a real browser. Clicking, searching, filling forms: they tested it with real browsing tasks 🤘 Here’s how Gemini 3 stacks up against Claude, GPT-5, and others 🧵↓
2
18
27
Gemini 3 just launched, and @Browserbase's already run full computer-use evaluations to see how well it handles a real browser. Clicking, searching, filling forms: they tested it with real browsing tasks 🤘 Here’s how Gemini 3 stacks up against Claude, GPT-5, and others 🧵↓
2
18
27
If you found this useful, a like or RT goes a long way! 🦾 Follow me → @datachaz for insights on LLMs, AI agents, and data science!
Gemini 3 just launched, and @Browserbase's already run full computer-use evaluations to see how well it handles a real browser. Clicking, searching, filling forms: they tested it with real browsing tasks 🤘 Here’s how Gemini 3 stacks up against Claude, GPT-5, and others 🧵↓
0
0
0
5/ That's a wrap! You can check the full list of results here: →
stagehand.dev
Compare accuracy, costs, and speed for Computer Use Models on Web Voyager and Online Mind2Web benchmarks.
1
0
0
4/ So... Gemini takes 1st place across all 3 fronts: → accuracy, cost per task, and speed. Claude Sonnet 4 comes 2nd with solid results, and Claude 4.5 follows close behind. A clean sweep for Gemini!! 🏆
1
0
0
3/ SPEED (lower is better) Gemini doesn’t just win on accuracy and cost. It’s also the fastest model to complete real browser tasks. Browserbase’s benchmarks show an average of ~223s per task, well ahead of Claude 4, Claude 4.5, and GPT-5 ↓
1
0
0
2/ COST Gemini is also the most cost-efficient model in Browserbase’s Stagehand evals. Around $0.18 per task, far below Claude 4, Claude 4.5, and the OpenAI model 💰💰💰
1
0
0
1/ ACCURACY In Browserbase’s @Stagehanddev tests, Gemini tops the accuracy charts at ~66%. It outperforms Claude 4, Claude 4.5, and the OpenAI model evaluated.
1
0
1
But first, I just wanted to say these benchmarks are on another level: → ~4,000 browser hours (!!) → 200+ runs → All parallelized in Browserbase! I tend to be skeptical of leaderboards, but this one is grounded in data. More on their methodology: → https://t.co/7DG2F8kN5H
1
0
1
If you found this useful, a like or RT goes a long way! Follow me → @datachaz for daily insights on LLMs, AI agents, and data science
0
0
3
1
0
4
♻️ If this sparked an idea, hit repost so others can catch it too! Follow me → @datachaz for daily drops on LLMs, agents, and data workflows! 🦾
This one’s a gem. A Free 80-page prompt engineering guide is surprising deep, covering: → CoT → Eval methods → RAG → Agents → Prompt hacking → Multimodal prompts ... and more! Link to the guide in 🧵 ↓
0
0
2
This one’s a gem. A Free 80-page prompt engineering guide is surprising deep, covering: → CoT → Eval methods → RAG → Agents → Prompt hacking → Multimodal prompts ... and more! Link to the guide in 🧵 ↓
6
7
29
I'm cool with that, as long as we don’t end up in another round of OpenAI-style naming madness 😅
0
0
73
My Italian friends say that we gotta start adding "al dente" in our prompts dang 🤌
2
0
7