Jonathan Roberts @JRobertsAI X Profile

Jonathan Roberts

@JRobertsAI

Followers

549

Following

212

Media

19

Statuses

85

PhD Student, Applied Machine Learning, University of Cambridge

https://t.co/ve77f4c1Ko

Cambridge

Joined December 2022

Don't wanna be here? Send us removal request.

Jonathan Roberts

@JRobertsAI

8 months

Is computer vision “solved”? Not yet Current models score 0% on ZeroBench 🧵1/6

58

254

3K

Jonathan Roberts

@JRobertsAI

2 months

https://t.co/dWYSl5YaaZ

0

Jonathan Roberts

@JRobertsAI

2 months

🏅New GAMEBoT GPT-5 vs Gemini 2.5 Pro evaluation results on Connect 4 & Checkers Leaderboard and battle replay visualisations on the project page 👇

Kai Han

@kaihan_x

2 months

🏆🏆🏆Clash of the Titans (GPT 5 vs. Gemini 2.5 Pro) on GAMEBoT: Connect4-->11:8 Checkers-->20:0 #GPT5,#Gemini

1

0

3

Jonathan Roberts

@JRobertsAI

2 months

Benchmark details and full leaderboard 👇 https://t.co/NouEsFxJEM

zerobench.github.io

An Impossible Visual Benchmark for Contemporary Large Multimodal Models

0

3

Jonathan Roberts

@JRobertsAI

2 months

📢 GPT-5 on ZeroBench 📢 GPT-5 (medium reasoning) pass@1: 1% pass@5: 7% 5/5: 0% sub-q pass@1: 26.2% GPT-5-mini (high) pass@1: 4% pass@5: 9% 5/5: 3% 🥇 sub-q pass@1: 27.8% GPT-5-nano (high) pass@1: 2% pass@5: 3% 5/5: 0% sub-q pass@1: 21.7% 🔥 gpt-5-mini scores new 5/5 SOTA

2

3

10

Jonathan Roberts

@JRobertsAI

2 months

More details and updated leaderboard 👇 https://t.co/E4noN7yDDM

0

1

Jonathan Roberts

@JRobertsAI

2 months

Some Claude Opus 4 ZeroBench improvements: Claude Opus 4 → 4.1: pass@1: 1% → 1% pass@5: 4% → 4% all@5: 0% → 1% ⬆️ Claude Opus 4 → 4.1 (Thinking): pass@1: 4% → 5% ⬆️🏆 pass@5: 5% → 8% ⬆️ all@5: 1% → 1% Opus 4.1 (Thinking) sets pass@1 SOTA ahead of the GPT-5 release 👀

1

0

9

Elliott / Shangzhe Wu

@elliottszwu

2 months

New opening for Assistant Professor in Machine Learning @Cambridge_Eng closing on 22 Sept 2025: https://t.co/7mNgww7Vq3

3

16

115

Samuel Albanie 🇬🇧

@SamuelAlbanie

3 months

We just shipped Gemini 2.5 Deep Think it doesn't just recall research papers - it fuses ideas across papers in ways I haven't seen before this level of capability demands careful evaluation model card below 👇

38

151

2K

Kai Han

@kaihan_x

3 months

#ACL2025NLP Introducing GAMEBoT—a competitive battle arena for LLM reasoning! We pit 17 top LLMs against each other in 8 strategic games. Who will outsmart whom? 🧠⚔️ We break down their reasoning into clear, verifiable steps. No black boxes—just transparent evaluation.

1

7

Jonathan Roberts

@JRobertsAI

3 months

🔍 Dive deeper—leaderboard, sample questions, eval protocol, and more on the project site: 👉

0

1

Jonathan Roberts

@JRobertsAI

3 months

🚀 ZeroBench update: Grok 4 pass@1: 1% pass@5: 4% 5/5 reliability: 0% Sub‑Q pass@1: 21.6% 📊A solid showing, but still trailing today’s SOTA: pass@1: 4% – Claude Opus 4 pass@5: 10% – o4-mini 5/5 reliability: 1% – several models

2

0

15

Jonathan Roberts

@JRobertsAI

4 months

Thanks to all those who contributed to ZeroBench! https://t.co/E4noN7yDDM

0

2

Jonathan Roberts

@JRobertsAI

4 months

📄 You can read the full Gemini report here ⬇️ https://t.co/i4GVKyr3RZ

1

0

2

Jonathan Roberts

@JRobertsAI

4 months

🎉 Thrilled @GoogleDeepMind included ZeroBench in the Gemini 2.5 technical report as a benchmark for image understanding. Gemini has made impressive gains—it’s great to see our benchmark is still challenging for frontier models!

3

5

22

Jonathan Roberts

@JRobertsAI

5 months

https://t.co/E4noN7yDDM

0

3

Jonathan Roberts

@JRobertsAI

5 months

📢📢More progress on ZeroBench! With the release of Claude 4 from @AnthropicAI the SOTA pass@1 is now 4% 🔥 Claude Sonnet 3.7: 1% Claude Sonnet 3.7 (Thinking): 3% Claude Sonnet 4: 2% Claude Sonnet 4 (Thinking): 3% Claude Opus 4: 1% Claude Opus 4 (Thinking): 4%

1

2

15

Jonathan Roberts

@JRobertsAI

6 months

https://t.co/N9RXkLMnkU

0

Jonathan Roberts

@JRobertsAI

6 months

🇸🇬 Excited to present our work later today at #ICLR2025! Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks? 📍: Hall 3 + Hall 2B #314 📅: Thu 24 Apr 3-5:30 pm

1

0

2

Jonathan Roberts

@JRobertsAI

6 months

🔥Leaderboard:

0

1

Jonathan Roberts

@JRobertsAI

6 months

👏Some recent ZeroBench pass@1 results: o3: 3% Gemini 2.5 Pro: 3% o4-mini: 2% Llama 4 Maverick: 0% GPT-4.1: 0%

4

6

42