Kilian Lieret
@KLieret
Followers
1K
Following
52
Media
32
Statuses
132
AI agents & benchmarks for software engineering @Princeton
Princeton
Joined May 2021
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
12
76
791
happy to see an evaluation of model abilities at higher level objectives (extra points for the arena format)
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
1
2
8
Let's stop babysitting LMs in our SWE benchmarks! In CodeClash, we're evaluating them like senior devs: By goals achieved, not by tickets closed! Our new eval is much more free-form than SWE-bench and models are still terrible at it!
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
0
1
12
SWE-bench turns 2 years old today! Led by @_carlosejimenez @jyangballin @KLieret and I, we've continued maintaining and expanding the SWE-bench universe: SWE-agent, SWE-bench Multimodal, SWE-bench Multilingual and SWE-smith. Looking forward to the next 2 years!
0
1
22
Tomorrow is the 2nd birthday of SWE-bench! I will be giving a talk at MIT about SWE-bench & SWE-agent: an overview, how we developed those ideas and what’s next. Link ⬇️
6
6
207
For reference, this is the prompt we used for mini-swe-agent: https://t.co/eGFCR4NauS (it's the same for all models)
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified! - SWE-agent/min...
0
0
1
We ran without this part in the prompt; still saw that Sonnet 4.5 takes more steps towards solutions. Really speaks to the ability of newer models to stay on track during long iterative processes. Plot below shows that more than 20% of the success come after 60 steps.
1
0
1
"You should use tools as much as possible, ideally more than 100 times." interesting part of the prompt that Anthropic used with Sonnet 4.5 on SWE-bench, showing an intentional effort to get the model to iterate longer on hard problems. 🧵
1
1
2
This analysis was conducted with mini-swe-agent. It's open source and the documentation tells you exactly how to reproduce our numbers.
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified! - SWE-agent/min...
0
2
7
This run has been performed with the new SWE-bench docker images provided by @_carlosejimenez that fix the recently discovered bug where models cheat with the git history. All other numbers quote here also include fixes for this issue.
1
0
6
By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.
1
2
13
Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice
2
2
30
We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in 🧵
4
14
85
To set up with ranger, you want to create ~/.config/ranger/rifle.conf and put something like ext json = jless "$1" ext jsonl = jless "$1" ext yaml = jless "$1"ext yml = jless "$1" else = "$EDITOR" -- "$@"- etc
0
0
0
https://t.co/P8yZW0rh6Y Somewhat hard to find is the most useful keybinding: ps will print the current value with linebreaks (which makes it a lot more useful than most other tools because you often have these long prompts in the configs/trajs).
jless.io
jless is a command-line JSON viewer designed for reading, exploring, and searching through JSON data.
1
0
0
jless is my new favorite command line tool. Super efficient for browsing yaml/json files with vim keybindings. Great for looking at agent configs and trajectories because long lines are collapsed by default. Also works amazing with ranger for command line navigation! 🧵
1
0
1
Really cool to see some of the methods from SWE-smith for synthetic training data generation be used here. And congrats to the @AlatMeta team on those amazing SWE-bench numbers with such a small model!
0
1
5
Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs! w/ @KLieret @_carlosejimenez @jyangballin
1
4
16
Do you find it challenging to run RL / agent simulations at a large scale (e.g. dealing with docker and remote execution)? Check out our blog post https://t.co/iNPivIzbc2 where we show how to do it with Ray and mini-swe-agent (kudos to @KLieret)
anyscale.com
Powered by Ray, Anyscale empowers AI builders to run and scale all ML and AI workloads on any cloud and on-prem.
0
7
17
You can find lots of other models evaluated under the same settings at https://t.co/sONyar3MGL (bash-only leaderboard). You can find our agent implementation at
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified! - SWE-agent/min...
0
0
9