Kilian Lieret Profile
Kilian Lieret

@KLieret

Followers
522
Following
21
Media
14
Statuses
61

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

Princeton
Joined May 2021
Don't wanna be here? Send us removal request.
@KLieret
Kilian Lieret
3 days
RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimizeโ€ฆ.
0
51
0
@KLieret
Kilian Lieret
10 days
RT @SWEbench: We just updated the SWE-bench Multimodal leaderboard with new systems from @refact_ai, @allhands_ai and @TU_Muenchen. Congratโ€ฆ.
0
5
0
@KLieret
Kilian Lieret
1 month
RT @a1zhang: Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II?. ๐—ฉ๐—ถ๐—ฑ๐—ฒ๐—ผ๐—š๐—ฎ๐—บ๐—ฒ๐—•๐—ฒ๐—ป๐—ฐ๐—ต evaluates VLMs on Game Boy & MS-DOSโ€ฆ.
0
72
0
@KLieret
Kilian Lieret
1 month
Sonnet 3.7 was plateauing at $1.5 (the bump at the end is an artifact from correct solutions from terminated runs), so increasing cost limit to $3 would barely have made a difference for this comparison. Sonnet 4 at $3 still had 10 instances terminated for cost.
0
0
1
@KLieret
Kilian Lieret
1 month
"Correctness" corresponds to the resolution rate among all runs that terminated without being killed due to cost etc. "Incorr. localization" means that some files from the gold patch weren't edited. "Incorr. edit" are all failed submissions that aren't in "incorr. localization".
0
0
1
@KLieret
Kilian Lieret
1 month
Both compared agent runs correspond to near identical configurations here
0
0
5
@KLieret
Kilian Lieret
1 month
Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.
Tweet media one
4
12
85
@KLieret
Kilian Lieret
2 months
RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by syntโ€ฆ.
0
131
0
@KLieret
Kilian Lieret
2 months
RT @plodq: Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languagesโ€ฆ.
0
16
0
@KLieret
Kilian Lieret
2 months
0
0
2
@KLieret
Kilian Lieret
2 months
Maps. Diagrams. UI glitches. SWE-bench Multimodal benchmarks AI agents on real-world frontend issues and they struggle. Poster at #ICLR25 today. Multiple submissions to the leaderboard already.
Tweet media one
1
3
12
@KLieret
Kilian Lieret
3 months
Had a great time talking about building agents, SWE-agent, SWE-bench, and more.
@databrew_db
Data Brew by Databricks
3 months
๐Ÿ“ฃ ๐—ก๐—˜๐—ช #๐——๐—ฎ๐˜๐—ฎ๐—•๐—ฟ๐—ฒ๐˜„ ๐—˜๐—ฝ๐—ถ๐˜€๐—ผ๐—ฑ๐—ฒ!. In this episode, Kilian Lieret (Research Software Engineer) & Carlos Jimenez (Computer Science PhD Candidate) at @Princeton dive into SWE-bench & SWE-agent, two cutting-edge tools for evaluating & enhancing AI in software engineering.
0
1
4
@KLieret
Kilian Lieret
3 months
Find it at The main code boils down to a few hundred lines, making it super easy to adapt. If you write an agent that requires code execution in (possibly multiple) shell sessions, this takes all the dirty work with managing environments off your hands.
0
0
1
@KLieret
Kilian Lieret
3 months
Evaluating SWE-agent on SWE-bench lite was once an overnight job. With SWE-ReX parallelizing our execution it now takes half an hour! SWE-ReX spins up docker containers with a @FastAPI server that uses pexpect to interface with shell sessions. MIT licensed, lightweight & hackable
1
3
21
@KLieret
Kilian Lieret
3 months
RT @OfirPress: The creators of LiveCodeBench just released a new, private, SWE-bench like benchmark in Java, C++, Python, JavaScript, TypeSโ€ฆ.
0
4
0
@KLieret
Kilian Lieret
3 months
Join @_carlosejimenez and me today at GenAI Collective NYC as we break down SWE-Bench, SWE-agent, and the future of AI-driven software engineering. What works? Whatโ€™s next? What does this mean for developers? Let's discuss!. ๐Ÿ“ Today 1โ€“4pm, Brooklyn Navy Yard.
@_ai_collective
The AI Collective
3 months
AI coding tools are moving from autocomplete to autonomy ๐Ÿค– โ€” with big implications for developers, users, and businesses ๐Ÿ’ผ. Join GenAI Collective NYC this Thursday, April 3 at Brooklyn Navy Yard Bldg 303 for a panel + fireside chat featuring:.๐Ÿง  Carlos Jimenez & Kilian Lieret
Tweet media one
0
0
2
@KLieret
Kilian Lieret
3 months
RT @OfirPress: We just updated the SWE-bench Multimodal leaderboard. Congrats to Globant, Zencoder, and the Agentless team from UIUC for thโ€ฆ.
0
5
0
@KLieret
Kilian Lieret
3 months
RT @daytonaio: Watch Princeton's SWE-agent @KLieret reveal research on autonomous coding agents at Daytona AI Builders @github HQ! From benโ€ฆ.
0
8
0
@KLieret
Kilian Lieret
4 months
SWE-agent 1.0 is so much more flexible than before. It has never been easier to set it up with various tool bundles or multiple LMs. And you can combine them all in a multi-attempt scheme!.
@_carlosejimenez
carlos๐Ÿ‡บ๐Ÿ‡ธ
4 months
SWE-agent 1.0 lets you run multiple attempts with different models or tools on the same task. Use a review agent to select the best from these diverse runs to improve overall performance!
Tweet media one
0
0
5
@KLieret
Kilian Lieret
4 months
Join at
0
0
0