JRobertsAI Profile Banner
Jonathan Roberts Profile
Jonathan Roberts

@JRobertsAI

Followers
554
Following
205
Media
19
Statuses
85

PhD Student, Applied Machine Learning, University of Cambridge

Cambridge
Joined December 2022
Don't wanna be here? Send us removal request.
@JRobertsAI
Jonathan Roberts
6 months
Is computer vision “solved”?. Not yet. Current models score 0% on ZeroBench. 🧵1/6
Tweet media one
58
256
3K
@JRobertsAI
Jonathan Roberts
13 days
0
0
0
@grok
Grok
4 days
Join millions who have switched to Grok.
171
332
3K
@JRobertsAI
Jonathan Roberts
13 days
🏅New GAMEBoT GPT-5 vs Gemini 2.5 Pro evaluation results on Connect 4 & Checkers. Leaderboard and battle replay visualisations on the project page 👇.
@kaihan_x
Kai Han
14 days
🏆🏆🏆Clash of the Titans (GPT 5 vs. Gemini 2.5 Pro) on GAMEBoT:.Connect4-->11:8.Checkers-->20:0.#GPT5,#Gemini.
1
1
4
@JRobertsAI
Jonathan Roberts
22 days
Benchmark details and full leaderboard 👇.
zerobench.github.io
An Impossible Visual Benchmark for Contemporary Large Multimodal Models
0
0
3
@JRobertsAI
Jonathan Roberts
22 days
📢 GPT-5 on ZeroBench 📢. GPT-5 (medium reasoning).pass@1: 1%.pass@5: 7%.5/5: 0%.sub-q pass@1: 26.2%. GPT-5-mini (high).pass@1: 4%.pass@5: 9%.5/5: 3% 🥇.sub-q pass@1: 27.8%. GPT-5-nano (high).pass@1: 2%.pass@5: 3%.5/5: 0%.sub-q pass@1: 21.7%. 🔥 gpt-5-mini scores new 5/5 SOTA.
2
3
10
@JRobertsAI
Jonathan Roberts
24 days
More details and updated leaderboard 👇.
0
0
1
@JRobertsAI
Jonathan Roberts
24 days
Some Claude Opus 4 ZeroBench improvements:. Claude Opus 4 → 4.1:.pass@1: 1% → 1%.pass@5: 4% → 4%.all@5: 0% → 1% ⬆️. Claude Opus 4 → 4.1 (Thinking):.pass@1: 4% → 5% ⬆️🏆.pass@5: 5% → 8% ⬆️.all@5: 1% → 1%. Opus 4.1 (Thinking) sets pass@1 SOTA ahead of the GPT-5 release 👀.
1
0
9
@JRobertsAI
Jonathan Roberts
25 days
RT @elliottszwu: New opening for Assistant Professor in Machine Learning @Cambridge_Eng closing on 22 Sept 2025:.ht….
0
16
0
@JRobertsAI
Jonathan Roberts
29 days
RT @SamuelAlbanie: We just shipped Gemini 2.5 Deep Think. it doesn't just recall research papers - it fuses ideas across papers in ways I h….
0
155
0
@JRobertsAI
Jonathan Roberts
1 month
RT @kaihan_x: #ACL2025NLP Introducing GAMEBoT—a competitive battle arena for LLM reasoning!.We pit 17 top LLMs against each other in 8 str….
0
1
0
@JRobertsAI
Jonathan Roberts
1 month
🔍 Dive deeper—leaderboard, sample questions, eval protocol, and more on the project site:. 👉
0
0
1
@JRobertsAI
Jonathan Roberts
1 month
🚀 ZeroBench update:. Grok 4.pass@1: 1%.pass@5: 4%.5/5 reliability: 0%.Sub‑Q pass@1: 21.6%. 📊A solid showing, but still trailing today’s SOTA:.pass@1: 4% – Claude Opus 4.pass@5: 10% – o4-mini.5/5 reliability: 1% – several models.
2
0
15
@JRobertsAI
Jonathan Roberts
2 months
Thanks to all those who contributed to ZeroBench!.
0
0
2
@JRobertsAI
Jonathan Roberts
2 months
📄 You can read the full Gemini report here ⬇️.
1
0
2
@JRobertsAI
Jonathan Roberts
2 months
🎉 Thrilled @GoogleDeepMind included ZeroBench in the Gemini 2.5 technical report as a benchmark for image understanding. Gemini has made impressive gains—it’s great to see our benchmark is still challenging for frontier models!
Tweet media one
3
5
22
@JRobertsAI
Jonathan Roberts
3 months
0
0
3
@JRobertsAI
Jonathan Roberts
3 months
📢📢More progress on ZeroBench!. With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥. Claude Sonnet 3.7: 1%.Claude Sonnet 3.7 (Thinking): 3%. Claude Sonnet 4: 2%.Claude Sonnet 4 (Thinking): 3%. Claude Opus 4: 1%.Claude Opus 4 (Thinking): 4%.
1
2
15
@JRobertsAI
Jonathan Roberts
4 months
0
0
0
@JRobertsAI
Jonathan Roberts
4 months
🇸🇬 Excited to present our work later today at #ICLR2025! . Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?. 📍: Hall 3 + Hall 2B #314.📅: Thu 24 Apr 3-5:30 pm
Tweet media one
1
0
2
@JRobertsAI
Jonathan Roberts
5 months
🔥Leaderboard:
0
0
1
@JRobertsAI
Jonathan Roberts
5 months
👏Some recent ZeroBench pass@1 results:. o3: 3%. Gemini 2.5 Pro: 3%. o4-mini: 2%. Llama 4 Maverick: 0%. GPT-4.1: 0%.
4
6
43