nilenso
@nilenso
Followers
2K
Following
320
Media
248
Statuses
2K
Employee-owned programmer cooperative in Bangalore.
Joined May 2013
In 2013, a group of makers got together to find new ways to work together. A lot has happened since. We recently celebrated our 10th birthday :) Over the years, we've had the privilege of working with some exceptional organizations and doing work we're proud of. 1/2
2
4
43
I let Codex CLI rip over the @nilenso website code to optimise performance. It scripted a benchmark, applied some changes and reran the bench to confirm that its changes sped things up by ~5x. Our website sends ~10x less data as well. We had been putting off the website
0
1
1
another win for bitter lesson driven development: specialised tool interfaces -> code execution https://t.co/UgHWFIIEM3
New on the Anthropic Engineering blog: tips on how to build more efficient agents that handle more tools while using fewer tokens. Code execution with the Model Context Protocol (MCP):
0
2
3
Sometimes I just want to give a github url, and a prompt to semantically search. Similar to web search tools, but for Github / Gitlab. I made a tool that does this, following @thorstenball 's "How to Build an Agent", and @nickbaumann_ 's "What Makes a Coding Agent?" blog posts.
1
4
18
@dbreunig My blog post has a lot more detail about all this. Check out this section for details on why existing observability tools don't cut it:
1
1
1
You know how your LLM context is a giant wall of text, and mostly an opaque box that you don't open? I built a tool to open it up, and pull it apart for you so that you can actually do "context engineering". You can just drag-drop your conversation.json into it, and it will
1
1
7
Wrote a new post about a trend I'm seeing. Designing a good AI-integrated application requires trading off between tricks to improve performance *today*, while also preparing for the bitter lesson that will strike in the future.
1
1
2
There seems to be a tradeoff between making a benchmark easy to run and making it more representative. It's hard to scale benchmarks. The more sophisticated benchmarks (TerminalBench, SWE-Lancer, GDPVal), have fewer data points, which makes statistical inference trickier.
1
0
1
4) LiveCodeBench This is a test for Python competitive programming. Not SWE, and uncontaminated. There's some tests other than writing code thrown in the mix, such as predicting the output of a function without actually running it. Pass criteria: get the hidden tests to pass.
1
0
1
3) Aider Polyglot It captures a broader set of languages the SWE-Bench family, even if the problems don't look quite like the ones a software engineer would realistically encounter. Pass criteria: unit tests associated with the Exercism exercises.
1
0
1
(interlude) the astute software engineer would note that passing the unit tests does not mean that the underlying issue is resolved or the feature is built correctly. this is known. Yu et al found issues of this kind with SWE-Bench.
1
0
2
2) SWE Bench Pro Also like SWE Bench Verified, but more recent and across a wider variety of repositories, both open source and private. Pass criteria: get the unit tests to pass.
1
0
2
What popular SWE/coding benchmarks are measuring. 1) SWE Bench Verified Patches that close GitHub issues. Mostly for Python library repositories. In that mostly Django. Pass criteria: get the unit tests to pass.
1
1
10
"Breaking down your task into “right-sized” units of work, which describe just the right amount of detail is perhaps the most powerful lever to improve your context window, and thus the correctness and quality of the generated code." https://t.co/GCkhhlQWlS < great POV from
0
1
7
The more turns your AI takes to get the job done more slop you will find the answer. Thanks to @nilenso for this interesting article
2
1
5
More to come: we'll continue to experiment on good units of work for AI agents to work with over the next few months and share what we find. We are picking good ol' User Stories as a starting point as they are small units of work the end with a clear business outcome.
1
0
0
Atharva notes the gap between the results on the widely shared @METR_Evals chart on AI's ability to perform long tasks, and coding agent performance in the messy real world, further strengthening the need for a system that manages the units of work that are fed to a coding agent.
1
0
0