George Tsoukalas Profile
George Tsoukalas

@gtsoukal

Followers
517
Following
554
Media
3
Statuses
89

AlphaEvolve @ DeepMind, AI for Math @ UT Austin.

Joined September 2022
Don't wanna be here? Send us removal request.
@gtsoukal
George Tsoukalas
10 days
Big thanks to @axiommathai for kindly contributing the formalized Putnam 2025 problem statements to PutnamBench! All 12 statements are now available publicly!
1
3
51
@gtsoukal
George Tsoukalas
15 days
The 2025 Putnam Competition is on Saturday! Excited to see how the new models from AI4Math companies fare on these new, uncontaminated problems! We will be sure to add them to PutnamBench!
0
1
24
@gtsoukal
George Tsoukalas
16 days
Come to our poster on a new benchmark for formally verified code generation from now till 7:30 at #1411 Exhibit CDE!!
2
8
50
@gtsoukal
George Tsoukalas
16 days
Come to our poster in Exhibit CDE from 11 to 2 today! Poster #1506, the whole team is ready to talk about our work!
0
0
8
@AmitayushThakur
Amitayush Thakur @ NeurIPS 2025
17 days
Will be presenting CLEVER at #NeurIPS2025 (San Diego) on December 3rd 4:30 PM in Exhibit Hall C D E, Poster location 1411. If you are interested in verified code generation you please visit our CLEVER poster.
@AmitayushThakur
Amitayush Thakur @ NeurIPS 2025
7 months
1/🧵Excited to share CLEVER — a new benchmark for end-to-end verified code generation in Lean. Can we go from natural language to a formally verified Lean program? CLEVER puts this to the test. 📄 https://t.co/oXa2iNFJE0 💻 https://t.co/YhW8GDKlZG
1
3
8
@gtsoukal
George Tsoukalas
18 days
New leader on the PutnamBench leaderboard! Getting close to saturation now, next big target will be optimizing cost for the same proving performance! Congrats to the Logical Intelligence team!
@logic_int
Logical Intelligence
18 days
Our Aleph prover agent just hit #1 on PutnamBench, a benchmark built from Putnam problems - one the hardest college-level math olympiad - fully formalized with machine-checked proofs and no human involvement. Putnam problems are often considered harder than IMO problems and span
0
5
30
@logic_int
Logical Intelligence
18 days
Our Aleph prover agent just hit #1 on PutnamBench, a benchmark built from Putnam problems - one the hardest college-level math olympiad - fully formalized with machine-checked proofs and no human involvement. Putnam problems are often considered harder than IMO problems and span
5
35
173
@gtsoukal
George Tsoukalas
21 days
In San Diego for #NeurIPS2025 from Dec. 1 to 8. Reach out if you want to chat about AI for math and science!
1
0
10
@gtsoukal
George Tsoukalas
1 month
We'll be presenting it at NeurIPS 2025 in San Diego next month, where it was awarded a spotlight presentation! Happy to chat more about it, please reach out if interested!
1
0
5
@gtsoukal
George Tsoukalas
1 month
Our paper is available at https://t.co/yRV3vlSJCA (and hopefully arxiv soon!). This work was done with my fantastic collaborators Rahul Saha @rah4927, Amitayush Thakur @AmitayushThakur, Sabrina Reguyal, and Swarat Chaudhuri @swarat. It wouldn't have been possible without them!
2
0
7
@gtsoukal
George Tsoukalas
1 month
We carry out all our experiments in Fermat, an open-sourced environment for mathematical theory exploration available here: https://t.co/BrOxfDMAN8.
Tweet card summary image
github.com
Contribute to trishullab/Fermat development by creating an account on GitHub.
1
1
4
@gtsoukal
George Tsoukalas
1 month
We do our experiments in elementary number theory & finite fields, and find that we can produce interestingness functions better than the base heuristics available in HR. Our approach can recover some interesting concepts in number theory, like primality, but can't yet find FLT.
1
0
2
@gtsoukal
George Tsoukalas
1 month
The numerical reward is attached to the interestingness function and used inside the FunSearch-like loop.
1
0
2
@gtsoukal
George Tsoukalas
1 month
The idea being that if a system can find many concepts that humans find interesting, it may be able to find concepts that we haven't considered but could be very valuable.
1
0
2
@gtsoukal
George Tsoukalas
1 month
We measure the value of an interestingness function by sampling trajectories with the policy, and checking the resulting generated theories against a ground truth set of human-made concepts that forms the interestingness signal.
1
0
2
@gtsoukal
George Tsoukalas
1 month
Given an interestingness function, we use it to form a policy which explores mathematical theory-space by selecting actions that produce concepts & conjectures, in an RL environment we call Fermat that we open source!
1
0
2
@gtsoukal
George Tsoukalas
1 month
In our work, we focus on learning how to generate a function which measures the interestingness of mathematical objects. In particular, we view this function as living in the space of programs, and design a FunSearch variant to optimize it.
1
0
2
@gtsoukal
George Tsoukalas
1 month
For example, making a system that can prove Fermat's Last Theorem seems difficult, but what about a system that can *find* its statement? Which tells you that it is interesting? How could that be done?
1
1
3
@gtsoukal
George Tsoukalas
1 month
Systems like HR (2000), AM (1976), Graffiti (1988) synthesize new concepts & conjectures. Centrally, learning to come up with the right concepts that captured what humans thought were interesting, was a challenging issue.
1
0
3