George Tsoukalas Profile
George Tsoukalas

@gtsoukal

Followers
256
Following
457
Media
0
Statuses
53

PhD student at UT Austin interested in automatic theorem proving.

Joined September 2022
Don't wanna be here? Send us removal request.
@gtsoukal
George Tsoukalas
11 days
I found that the proofs it generates are much cleaner and readable than those written by DeepSeekProver, probably because the method is RL free. I was surprised that it did not solve more problems!.
0
0
4
@gtsoukal
George Tsoukalas
11 days
A new second-place method on the PutnamBench leaderboard: DSP+ solves 23/658 problems (pass@128), second only to DeepSeekProverV2 which solves 47 (pass@1024). You can find DSP+'s writeup here:
2
3
14
@gtsoukal
George Tsoukalas
1 month
I’m in Cambridge this week for Big Proof 2025 - dm me if you’d like to chat about ai for math!.
0
0
16
@gtsoukal
George Tsoukalas
1 month
A technical note: the preprint reports 49 problems solved, upon review of the proofs we found inaccuracies in the *statements* of two PutnamBench problems. These have now been updated to correctly reflect the (much harder) informal statements.
0
0
2
@gtsoukal
George Tsoukalas
1 month
The model is also open-sourced, with a preprint available here:
1
1
3
@gtsoukal
George Tsoukalas
1 month
DeepSeekProverV2 solves 47/657 problems on PutnamBench! The model represents a substantial advance in theorem proving. The previous best model only solved 10 problems! I'm excited to see DeepSeek's performance on IMO 2025 :).
2
3
28
@gtsoukal
George Tsoukalas
1 month
RT @AmitayushThakur: 1/🧵Excited to share CLEVER — a new benchmark for end-to-end verified code generation in Lean. Can we go from natural l….
0
12
0
@gtsoukal
George Tsoukalas
2 months
Interestingly I noticed the model frequently includes many Lean `sorry` terms for lower-level subgoals, I wonder if this is an artifact of alignment. Usually subgoals require more technically rigorous proofs, which Claude may have not felt confident in providing.
0
1
3
@gtsoukal
George Tsoukalas
2 months
Ran Claude 4 Sonnet on PutnamBench (pass@1, Lean) and it only solved 1 of 657 problems! Does not appear to be an improvement over 3.7 for Lean (in a one-shot setting, at least).
3
0
8
@gtsoukal
George Tsoukalas
3 months
To my OpenAI friends, will gladly take API credits so I can run more intense evals with o4-mini and o3 :).
1
0
8
@gtsoukal
George Tsoukalas
3 months
o4-mini-high solves 2 of 657 problems on PutnamBench (Lean). Also, Kimina-7B ( achieves the top of our leaderboard with 10 problems solved!.
2
16
78
@gtsoukal
George Tsoukalas
3 months
Ran Grok-3-mini, recently released on API, and it solves 0 of 657 problems. Happy to run more intensive evals if folks can spare some compute credits!.
0
0
1
@gtsoukal
George Tsoukalas
3 months
A small bug in the code written atop the Lean interaction tool caused a small # of correct proofs to be reported as incorrect. Updated numbers:. Gemini 2.0 Flash Thinking & DeepSeek R1: 1/657.Claude 3.7 & o3-mini: 0/657.
0
1
3
@gtsoukal
George Tsoukalas
3 months
Ran DeepSeek V3-0324, and it solves 0 problems.
2
0
2
@gtsoukal
George Tsoukalas
3 months
We ran Gemini 2.5 Pro Experimental and it solved 3 of 657 problems with the same evaluation setup!.
3
1
7
@gtsoukal
George Tsoukalas
4 months
Our code for running these experiments is open-source! installation is super easy and you can plug in your own models for evaluation. Our evaluation cost roughly $45, mostly split evenly between o3-mini and Sonnet 3.7.
1
0
8
@gtsoukal
George Tsoukalas
4 months
Even the leading Lean-specific models still only solve a handful of problems, demonstrating the difficulty of the benchmark.
1
0
4
@gtsoukal
George Tsoukalas
4 months
Producing a correct Lean proof requires producing a perfect reasoning chain for the problem. Even on the easier problems, we found that the models often confuse Lean syntax, hallucinate lemma names, and can get the whole proof ideas incorrect!.
1
0
9
@gtsoukal
George Tsoukalas
4 months
We evaluated o3-mini, Gemini 2 Flash Thinking, Sonnet 3.7, and Deepseek R1 on the Lean 4 version of PutnamBench. No model could solve even a single one of the 657 problems!.
1
1
14
@gtsoukal
George Tsoukalas
4 months
PutnamBench: A math benchmark where no reasoning model can solve even a single problem! We evaluated leading LRMs on the Lean 4 version🧵.
@gtsoukal
George Tsoukalas
1 year
Announcing PutnamBench: an evaluation benchmark for formal mathematical reasoning in Lean 4, Isabelle, and Coq! PutnamBench consists of problems from the William-Lowell Putnam Mathematical Competition, the premier collegiate mathematics exam in the US & Canada. 🧵.
8
8
75