Kilian Lieret Profile
Kilian Lieret

@KLieret

Followers
876
Following
37
Media
27
Statuses
105

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

Princeton
Joined May 2021
Don't wanna be here? Send us removal request.
@KLieret
Kilian Lieret
1 month
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!.Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
Tweet media one
12
73
791
@KLieret
Kilian Lieret
13 days
You can find lots of other models evaluated under the same settings at (bash-only leaderboard). You can find our agent implementation at
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
0
0
7
@KLieret
Kilian Lieret
13 days
The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.
Tweet media one
1
0
7
@KLieret
Kilian Lieret
13 days
Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models
Tweet media one
1
1
7
@KLieret
Kilian Lieret
13 days
This is evaluated with mini-swe-agent (common-sense prompts, no tools other than bash, some 100 lines of code for the agent class): We're still working on evaluating some other open source models (including GLM).
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
1
1
10
@KLieret
Kilian Lieret
13 days
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
Tweet media one
8
21
157
@KLieret
Kilian Lieret
15 days
Small correction: GPT-5 bar chart should read 65.0%, not 65.2%, sorry (fixed in the blog). So the improvement is actually ever so slightly bigger.
0
1
10
@KLieret
Kilian Lieret
15 days
SWE-bench blog:
0
0
6
@KLieret
Kilian Lieret
15 days
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
Tweet media one
19
20
270
@KLieret
Kilian Lieret
22 days
RT @richardcsuwandi: Introducing OpenEvolve x AlgoTune! . Now you can run and benchmark evolutionary coding agents on 100+ algorithm optim….
0
20
0
@KLieret
Kilian Lieret
22 days
RT @_carlosejimenez: Recent open model scores on SWE-bench Bash Only:.🥇Qwen3-Coder 480B/A35B Instruct - 55.40%.🥈Kimi-K2-Instruct - 43.80%.🥉….
0
27
0
@KLieret
Kilian Lieret
22 days
RT @SemiAnalysis_: At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current m….
0
15
0
@KLieret
Kilian Lieret
26 days
We also added a blog post with the exact command to reproduce these numbers:
0
0
2
@KLieret
Kilian Lieret
27 days
Evaluated with our open source minimal agent that tests LMs in a bare-bones shell environment. Agent is implemented in just some 100 lines! We'll add the results to our swe-bench (bash-only) leaderboard shortly:
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
0
0
2
@KLieret
Kilian Lieret
27 days
GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.
Tweet media one
2
0
5
@KLieret
Kilian Lieret
27 days
Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!
Tweet media one
1
1
7
@KLieret
Kilian Lieret
27 days
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
Tweet media one
5
6
33
@KLieret
Kilian Lieret
27 days
gpt-5-mini delivers software engineering for very cheap! We're seeing 60% on SWE-bench verified with just $18 total using our bare-bones 100 line agent. That's for solving 299/500 GitHub issues! Very fast, too! (1.5h total with 10 workers).
1
2
11