George Tsoukalas @gtsoukal X Profile

George Tsoukalas

@gtsoukal

Followers

256

Following

457

Media

0

Statuses

53

PhD student at UT Austin interested in automatic theorem proving.

Joined September 2022

Don't wanna be here? Send us removal request.

George Tsoukalas

@gtsoukal

11 days

I found that the proofs it generates are much cleaner and readable than those written by DeepSeekProver, probably because the method is RL free. I was surprised that it did not solve more problems!.

0

4

George Tsoukalas

@gtsoukal

11 days

A new second-place method on the PutnamBench leaderboard: DSP+ solves 23/658 problems (pass@128), second only to DeepSeekProverV2 which solves 47 (pass@1024). You can find DSP+'s writeup here:

2

3

14

George Tsoukalas

@gtsoukal

1 month

I’m in Cambridge this week for Big Proof 2025 - dm me if you’d like to chat about ai for math!.

0

16

George Tsoukalas

@gtsoukal

1 month

A technical note: the preprint reports 49 problems solved, upon review of the proofs we found inaccuracies in the *statements* of two PutnamBench problems. These have now been updated to correctly reflect the (much harder) informal statements.

0

2

George Tsoukalas

@gtsoukal

1 month

The model is also open-sourced, with a preprint available here:

1

3

George Tsoukalas

@gtsoukal

1 month

DeepSeekProverV2 solves 47/657 problems on PutnamBench! The model represents a substantial advance in theorem proving. The previous best model only solved 10 problems! I'm excited to see DeepSeek's performance on IMO 2025 :).

2

3

28

George Tsoukalas

@gtsoukal

1 month

RT @AmitayushThakur: 1/🧵Excited to share CLEVER — a new benchmark for end-to-end verified code generation in Lean. Can we go from natural l….

0

12

0

George Tsoukalas

@gtsoukal

2 months

Interestingly I noticed the model frequently includes many Lean `sorry` terms for lower-level subgoals, I wonder if this is an artifact of alignment. Usually subgoals require more technically rigorous proofs, which Claude may have not felt confident in providing.

0

1

3

George Tsoukalas

@gtsoukal

2 months

Ran Claude 4 Sonnet on PutnamBench (pass@1, Lean) and it only solved 1 of 657 problems! Does not appear to be an improvement over 3.7 for Lean (in a one-shot setting, at least).

3

0

8

George Tsoukalas

@gtsoukal

3 months

To my OpenAI friends, will gladly take API credits so I can run more intense evals with o4-mini and o3 :).

1

0

8

George Tsoukalas

@gtsoukal

3 months

o4-mini-high solves 2 of 657 problems on PutnamBench (Lean). Also, Kimina-7B ( achieves the top of our leaderboard with 10 problems solved!.

2

16

78

George Tsoukalas

@gtsoukal

3 months

Ran Grok-3-mini, recently released on API, and it solves 0 of 657 problems. Happy to run more intensive evals if folks can spare some compute credits!.

0

1

George Tsoukalas

@gtsoukal

3 months

A small bug in the code written atop the Lean interaction tool caused a small # of correct proofs to be reported as incorrect. Updated numbers:. Gemini 2.0 Flash Thinking & DeepSeek R1: 1/657.Claude 3.7 & o3-mini: 0/657.

0

1

3

George Tsoukalas

@gtsoukal

3 months

Ran DeepSeek V3-0324, and it solves 0 problems.

2

0

2

George Tsoukalas

@gtsoukal

3 months

We ran Gemini 2.5 Pro Experimental and it solved 3 of 657 problems with the same evaluation setup!.

3

1

7

George Tsoukalas

@gtsoukal

4 months

Our code for running these experiments is open-source! installation is super easy and you can plug in your own models for evaluation. Our evaluation cost roughly $45, mostly split evenly between o3-mini and Sonnet 3.7.

1

0

8

George Tsoukalas

@gtsoukal

4 months

Even the leading Lean-specific models still only solve a handful of problems, demonstrating the difficulty of the benchmark.

1

0

4

George Tsoukalas

@gtsoukal

4 months

Producing a correct Lean proof requires producing a perfect reasoning chain for the problem. Even on the easier problems, we found that the models often confuse Lean syntax, hallucinate lemma names, and can get the whole proof ideas incorrect!.

1

0

9

George Tsoukalas

@gtsoukal

4 months

We evaluated o3-mini, Gemini 2 Flash Thinking, Sonnet 3.7, and Deepseek R1 on the Lean 4 version of PutnamBench. No model could solve even a single one of the 657 problems!.

1

14

George Tsoukalas

@gtsoukal

4 months

PutnamBench: A math benchmark where no reasoning model can solve even a single problem! We evaluated leading LRMs on the Lean 4 version🧵.

George Tsoukalas

@gtsoukal

1 year

Announcing PutnamBench: an evaluation benchmark for formal mathematical reasoning in Lean 4, Isabelle, and Coq! PutnamBench consists of problems from the William-Lowell Putnam Mathematical Competition, the premier collegiate mathematics exam in the US & Canada. 🧵.

8

75