JRobertsAI Profile Banner
Jonathan Roberts Profile
Jonathan Roberts

@JRobertsAI

Followers
549
Following
212
Media
19
Statuses
85

PhD Student, Applied Machine Learning, University of Cambridge

Cambridge
Joined December 2022
Don't wanna be here? Send us removal request.
@JRobertsAI
Jonathan Roberts
8 months
Is computer vision “solved”? Not yet Current models score 0% on ZeroBench 🧵1/6
58
254
3K
@JRobertsAI
Jonathan Roberts
2 months
0
0
0
@JRobertsAI
Jonathan Roberts
2 months
🏅New GAMEBoT GPT-5 vs Gemini 2.5 Pro evaluation results on Connect 4 & Checkers Leaderboard and battle replay visualisations on the project page 👇
@kaihan_x
Kai Han
2 months
🏆🏆🏆Clash of the Titans (GPT 5 vs. Gemini 2.5 Pro) on GAMEBoT: Connect4-->11:8 Checkers-->20:0 #GPT5,#Gemini
1
0
3
@JRobertsAI
Jonathan Roberts
2 months
📢 GPT-5 on ZeroBench 📢 GPT-5 (medium reasoning) pass@1: 1% pass@5: 7% 5/5: 0% sub-q pass@1: 26.2% GPT-5-mini (high) pass@1: 4% pass@5: 9% 5/5: 3% 🥇 sub-q pass@1: 27.8% GPT-5-nano (high) pass@1: 2% pass@5: 3% 5/5: 0% sub-q pass@1: 21.7% 🔥 gpt-5-mini scores new 5/5 SOTA
2
3
10
@JRobertsAI
Jonathan Roberts
2 months
More details and updated leaderboard 👇 https://t.co/E4noN7yDDM
0
0
1
@JRobertsAI
Jonathan Roberts
2 months
Some Claude Opus 4 ZeroBench improvements: Claude Opus 4 → 4.1: pass@1: 1% → 1% pass@5: 4% → 4% all@5: 0% → 1% ⬆️ Claude Opus 4 → 4.1 (Thinking): pass@1: 4% → 5% ⬆️🏆 pass@5: 5% → 8% ⬆️ all@5: 1% → 1% Opus 4.1 (Thinking) sets pass@1 SOTA ahead of the GPT-5 release 👀
1
0
9
@elliottszwu
Elliott / Shangzhe Wu
2 months
New opening for Assistant Professor in Machine Learning @Cambridge_Eng closing on 22 Sept 2025: https://t.co/7mNgww7Vq3
3
16
115
@SamuelAlbanie
Samuel Albanie 🇬🇧
3 months
We just shipped Gemini 2.5 Deep Think it doesn't just recall research papers - it fuses ideas across papers in ways I haven't seen before this level of capability demands careful evaluation model card below 👇
38
151
2K
@kaihan_x
Kai Han
3 months
#ACL2025NLP Introducing GAMEBoT—a competitive battle arena for LLM reasoning! We pit 17 top LLMs against each other in 8 strategic games. Who will outsmart whom? 🧠⚔️ We break down their reasoning into clear, verifiable steps. No black boxes—just transparent evaluation.
1
1
7
@JRobertsAI
Jonathan Roberts
3 months
🔍 Dive deeper—leaderboard, sample questions, eval protocol, and more on the project site: 👉
0
0
1
@JRobertsAI
Jonathan Roberts
3 months
🚀 ZeroBench update: Grok 4 pass@1: 1% pass@5: 4% 5/5 reliability: 0% Sub‑Q pass@1: 21.6% 📊A solid showing, but still trailing today’s SOTA: pass@1: 4% – Claude Opus 4 pass@5: 10% – o4-mini 5/5 reliability: 1% – several models
2
0
15
@JRobertsAI
Jonathan Roberts
4 months
Thanks to all those who contributed to ZeroBench! https://t.co/E4noN7yDDM
0
0
2
@JRobertsAI
Jonathan Roberts
4 months
📄 You can read the full Gemini report here ⬇️ https://t.co/i4GVKyr3RZ
1
0
2
@JRobertsAI
Jonathan Roberts
4 months
🎉 Thrilled @GoogleDeepMind included ZeroBench in the Gemini 2.5 technical report as a benchmark for image understanding. Gemini has made impressive gains—it’s great to see our benchmark is still challenging for frontier models!
3
5
22
@JRobertsAI
Jonathan Roberts
5 months
0
0
3
@JRobertsAI
Jonathan Roberts
5 months
📢📢More progress on ZeroBench! With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%
1
2
15
@JRobertsAI
Jonathan Roberts
6 months
0
0
0
@JRobertsAI
Jonathan Roberts
6 months
🇸🇬 Excited to present our work later today at #ICLR2025! Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks? 📍: Hall 3 + Hall 2B #314 📅: Thu 24 Apr 3-5:30 pm
1
0
2
@JRobertsAI
Jonathan Roberts
6 months
🔥Leaderboard:
0
0
1
@JRobertsAI
Jonathan Roberts
6 months
👏Some recent ZeroBench pass@1 results: o3: 3% Gemini 2.5 Pro: 3% o4-mini: 2% Llama 4 Maverick: 0% GPT-4.1: 0%
4
6
42