Chase Brower
@ChaseBrowe32432
Followers
2K
Following
3K
Media
721
Statuses
4K
software dev, working on AI stuff
Joined June 2023
Gemini 3 Pro (preview) scores 91% on VPCT (spatial reasoning) Uhhhh jesus christ
67
121
2K
Example problem, this benchmark tests basic visual physics reasoning Gemini 3 has ~solved this while Anthropic is still not yet in the game lol
3
0
23
Claude 4.5 Opus scores 40% on VPCT (visual physics) 🗿
16
10
237
You can see the redux problems (format-compatible with the existing vpct-runner) here: https://t.co/yg3n4Pgi3I
huggingface.co
0
1
7
Next up, I have been working on, and will soon release, VPCT-2. VPCT-2 will be accompanied by better tooling, better metrics, and much more difficult/diverse problems (whose "time horizons" will be quantified!)
1
0
8
This is an impressive result from google. Gemini 3 Pro comes close to entirely solving this benchmark. Importantly, these problems are still very very easy, for a human. I would estimate a "time horizon" of about 3 seconds based on my sampled participants.
1
0
3
Gemini 3's reasoning is generally sensible and relevant to the problem. However, it incorrectly assesses those lines on the right as extending far enough left to cause the ball to land in the 2nd bucket. Even in this failing case, the model is close to solving the problem.
1
0
3
Here, the model reasons: Based on a step-by-step analysis of the physics simulation in the image:\n\n1. **Initial Drop:** The ball starts at the top center of the simulation. Gravity will pull it straight down.\n2. **First Obstacle:** Directly below the ball is a long, slanted
1
0
3
This is a great result! There was no overfit, intentional or otherwise (through e.g. leaking into general internet pretrain set). Gemini 3 Pro is indeed the strong visual reasoning model that it appears.
1
0
7
I tested several top models from OpenAI and Google, avg@5, (as well as a baseline GPT-4o Mini) and observe no statistically significant difference in performance for any model on the redux.
1
0
8
First, I produced a new set of 100 problems, VPCT-Redux, which looks slightly different from the original. Background color + horizontal position of the ball are randomized, and the buckets are now labeled. The overall problem difficulty remains unchanged.
1
0
5
VPCT-1 post-mortem! I examine the original benchmark, Gemini 3 Pro's recent score, and what this means for vision tasks. TL;DR: I observe no signs of overfit (very good!).
3
6
25
Some people are unhappy with the AI 2027 title and our AI timelines. Let me quickly clarify: We’re not confident that: 1. AGI will happen in exactly 2027 (2027 is one of the most likely specific years though!) 2. It will take <1 yr to get from AGI to ASI 3. AGIs will definitely
121
92
1K
never mind, "if anyone builds it, everyone dies" is a good title
@DKokotajlo most people who hear about your idea will never read the website, never watch an interview. they will assume you are predicting AGI in 2027
5
10
502
And if your default assumption is "people are so retarded that they will never actually read the blogpost"... I'd rather they misunderstand that AGI is coming in 2027 than misunderstand that it's definitely not. https://t.co/LLaYYOqtH1
@DKokotajlo most people who hear about your idea will never read the website, never watch an interview. they will assume you are predicting AGI in 2027
1
0
45
This sort of pikachu-facing over Daniel K's comment is embarrassing and obtuse. The idea that the original claim of AI 2027 was "AGI is going to happen definitely in 2027 and we'll all die if we don't do xyz" cannot possibly come from a sane reading of the blogpost. I have a
@DKokotajlo “Our timelines were longer than 2027 when we published ai 2027” bro what
28
16
300
GPT-5.1-Codex-Max (xhigh) scores 77.9% not to be confused with: -GPT-5.1-Codex-Max (high) -GPT-5.1-Codex (high) -GPT-5.1 (high) -GPT-5-Codex (high)
17
15
373
Will do a post-mortem + redux + new version of benchmark soon
0
0
62