Kilian Lieret Profile
Kilian Lieret

@KLieret

Followers
1K
Following
52
Media
32
Statuses
132

AI agents & benchmarks for software engineering @Princeton

Princeton
Joined May 2021
Don't wanna be here? Send us removal request.
@KLieret
Kilian Lieret
4 months
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
12
76
791
@andykonwinski
Andy Konwinski
9 days
happy to see an evaluation of model abilities at higher level objectives (extra points for the arena format)
@jyangballin
John Yang
9 days
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
1
2
8
@KLieret
Kilian Lieret
9 days
Let's stop babysitting LMs in our SWE benchmarks! In CodeClash, we're evaluating them like senior devs: By goals achieved, not by tickets closed! Our new eval is much more free-form than SWE-bench and models are still terrible at it!
@jyangballin
John Yang
9 days
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
0
1
12
@OfirPress
Ofir Press
1 month
SWE-bench turns 2 years old today! Led by @_carlosejimenez @jyangballin @KLieret and I, we've continued maintaining and expanding the SWE-bench universe: SWE-agent, SWE-bench Multimodal, SWE-bench Multilingual and SWE-smith. Looking forward to the next 2 years!
0
1
22
@OfirPress
Ofir Press
1 month
Tomorrow is the 2nd birthday of SWE-bench! I will be giving a talk at MIT about SWE-bench & SWE-agent: an overview, how we developed those ideas and what’s next. Link ⬇️
6
6
207
@KLieret
Kilian Lieret
1 month
We ran without this part in the prompt; still saw that Sonnet 4.5 takes more steps towards solutions. Really speaks to the ability of newer models to stay on track during long iterative processes. Plot below shows that more than 20% of the success come after 60 steps.
1
0
1
@KLieret
Kilian Lieret
1 month
"You should use tools as much as possible, ideally more than 100 times." interesting part of the prompt that Anthropic used with Sonnet 4.5 on SWE-bench, showing an intentional effort to get the model to iterate longer on hard problems. 🧵
1
1
2
@KLieret
Kilian Lieret
2 months
This analysis was conducted with mini-swe-agent. It's open source and the documentation tells you exactly how to reproduce our numbers.
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified! - SWE-agent/min...
0
2
7
@KLieret
Kilian Lieret
2 months
This run has been performed with the new SWE-bench docker images provided by @_carlosejimenez that fix the recently discovered bug where models cheat with the git history. All other numbers quote here also include fixes for this issue.
1
0
6
@KLieret
Kilian Lieret
2 months
You can find all of the trajectories here:
1
1
6
@KLieret
Kilian Lieret
2 months
By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.
1
2
13
@KLieret
Kilian Lieret
2 months
Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice
2
2
30
@KLieret
Kilian Lieret
2 months
We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in 🧵
4
14
85
@KLieret
Kilian Lieret
2 months
To set up with ranger, you want to create ~/.config/ranger/rifle.conf and put something like ext json = jless "$1" ext jsonl = jless "$1" ext yaml = jless "$1"ext yml = jless "$1" else = "$EDITOR" -- "$@"- etc
0
0
0
@KLieret
Kilian Lieret
2 months
https://t.co/P8yZW0rh6Y Somewhat hard to find is the most useful keybinding: ps will print the current value with linebreaks (which makes it a lot more useful than most other tools because you often have these long prompts in the configs/trajs).
Tweet card summary image
jless.io
jless is a command-line JSON viewer designed for reading, exploring, and searching through JSON data.
1
0
0
@KLieret
Kilian Lieret
2 months
jless is my new favorite command line tool. Super efficient for browsing yaml/json files with vim keybindings. Great for looking at agent configs and trajectories because long lines are collapsed by default. Also works amazing with ranger for command line navigation! 🧵
1
0
1
@KLieret
Kilian Lieret
2 months
Really cool to see some of the methods from SWE-smith for synthetic training data generation be used here. And congrats to the @AlatMeta team on those amazing SWE-bench numbers with such a small model!
@jyangballin
John Yang
2 months
Incredibly excited by this work, congrats @syhw + @AIatMeta codegen! 32b model that hits 65.8% on SWE-bench w/ TTS is incredible. A year ago that would've been unimaginable to me. Section 2 is a great read - resonates so much w/ what SWE-smith is trying to achieve in the open.
0
1
5
@OfirPress
Ofir Press
2 months
Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs! w/ @KLieret @_carlosejimenez @jyangballin
1
4
16
@pcmoritz
Philipp Moritz
2 months
Do you find it challenging to run RL / agent simulations at a large scale (e.g. dealing with docker and remote execution)? Check out our blog post https://t.co/iNPivIzbc2 where we show how to do it with Ray and mini-swe-agent (kudos to @KLieret)
Tweet card summary image
anyscale.com
Powered by Ray, Anyscale empowers AI builders to run and scale all ML and AI workloads on any cloud and on-prem.
0
7
17
@KLieret
Kilian Lieret
3 months
You can find lots of other models evaluated under the same settings at https://t.co/sONyar3MGL (bash-only leaderboard). You can find our agent implementation at
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified! - SWE-agent/min...
0
0
9