Ori Press
@ori_press
Followers
449
Following
109K
Media
23
Statuses
133
PhD from @uni_tue I'm on the industry job market, feel free to reach out! I yearn to deep learn
Joined December 2018
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!๐งตโฌ๏ธ
6
64
160
New eval! Code duels for LMs โ๏ธ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
26
90
365
Check out the full trajectories here:
algotune.io
Can Language Models Speed Up General-Purpose Numerical Programs?
0
0
2
Claude Sonnet 4.5 manages to score 1.52x on AlgoTune, coming in just in front of GLM 4.5. 3+ months in, newer models don't seem to be able to make meaningful gains on AlgoTune, excited to see how this evolves! ๐๐
1
0
11
This has been a long time coming. One avenue for progress is to have LMs learn in virtual gym environments such as in SWE-gym, SWE-smith, or our new AlgoTune environments. These can be generated autonomously or crafted manually. Lots more to do here!
OpenAI's models are getting too smart for human contractors to teach them new things in certain domains like linguistics. One contractor I spoke with said they're struggling to come up with new tasks GPT-5 can't do. https://t.co/BWBrigqm1V
1
1
15
Amazing work by @AndyLin2001! Check it out:
Excited to announce that I've adapted AlgoTune for both OpenHands and Terminal Bench! It's a fast, unbounded benchmark perfect for evaluating AI agents, offering a great alternative to slower suites like SWE/Kaggle tasks. Check it out: https://t.co/bmwOQUoTxa
#Agent #Benchmark
0
2
5
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog ๐งต
18
21
272
Introducing OpenEvolve x AlgoTune! Now you can run and benchmark evolutionary coding agents on 100+ algorithm optimization tasks from https://t.co/sPqLhaZyGj
2
20
186
The complete logs for every model are viewable here:
algotune.io
Can Language Models Speed Up General-Purpose Numerical Programs?
0
0
1
Just added Claude Opus 4.1 and gpt-oss-120b to the AlgoTune leaderboard. Excited to see if GPT-5 can break the 2 barrier!
0
2
17
We know that a bunch of teams are working on applying AlphaEvolve to AlgoTune, super excited to see some initial results! This is going to get super interesting.
1
4
23
We just benchmarked Qwen 3 Coder and GLM 4.5 on AlgoTune, and they manage to beat Claude Opus 4! We're excited to see if the models that will be released this week manage to make progress. Also: I just defended my PhD and I'm on the industry job market, my DMs are open :)
0
3
31
Excited to release AlgoTune!! It's a benchmark and coding agent for optimizing the runtime of numerical code ๐ https://t.co/bdR630y0dL ๐ https://t.co/vSnV3eUgVs ๐ค https://t.co/krJ7XDrJFA with @OfirPress @ori_press @PatrickKidger @b_stellato @ArmanZharmagam1 & many others ๐งต
3
41
196
AlgoBench is extremely tough, with agents not finding substantial speedups on most tasks. But sometimes these agents do really cool things: here, the agent realized that it could solve this convex optimization problem with a scipy function, leading to an 81x speedup.
10
27
159
Thanks to all contributors that submitted tasks, as well as @OfirPress for advising! Read the paper: https://t.co/utmxpu2t2J Check out the code: https://t.co/s1ttWnfo2m (6/6)
github.com
AlgoTune is a NeurIPS 2025 benchmark made up of 154 math, physics, and computer science problems. The goal is write code that solves each problem, and is faster than existing implementations. - ori...
0
0
8
Check out our website, https://t.co/JqD76Du6lp, for agent traces, and the code they ended up with for each algo. Our framework allows for anyone to easily submit tasks they think would be interesting to optimize. (5/6)
1
0
8
The current best overall AlgoTune score is 1.76x, achieved by o4-mini. We think that a score of 100x is possible, as progress should be possible from many angles: rewriting existing Python code in Numba or Cython, implementing existing faster algos, or discovering new ones. (4/6)
1
1
9
We release an agent, AlgoTuner, that enables LMs to optimize code. Using our system, LMs can get feedback on how fast their code is, profile its runtime, and compare their code to the reference implementation. (3/6)
1
0
10
For each algo, we give Gemini, Claude, o4-mini, and R1 a budget of 1 dollar, and have them iteratively develop code. Results are at: https://t.co/aRrbkD3FHi Models sometimes successfully optimize code, but are not currently able to come up with novel algos (2/6)
algotune.io
Can Language Models Speed Up General-Purpose Numerical Programs?
1
0
10