nilenso Profile Banner
nilenso Profile
nilenso

@nilenso

Followers
2K
Following
320
Media
248
Statuses
2K

Employee-owned programmer cooperative in Bangalore.

Joined May 2013
Don't wanna be here? Send us removal request.
@nilenso
nilenso
2 years
In 2013, a group of makers got together to find new ways to work together. A lot has happened since. We recently celebrated our 10th birthday :) Over the years, we've had the privilege of working with some exceptional organizations and doing work we're proud of. 1/2
2
4
43
@AtharvaRaykar
atharva
9 days
I let Codex CLI rip over the @nilenso website code to optimise performance. It scripted a benchmark, applied some changes and reran the bench to confirm that its changes sped things up by ~5x. Our website sends ~10x less data as well. We had been putting off the website
0
1
1
@AtharvaRaykar
atharva
11 days
another win for bitter lesson driven development: specialised tool interfaces -> code execution https://t.co/UgHWFIIEM3
@AnthropicAI
Anthropic
12 days
New on the Anthropic Engineering blog: tips on how to build more efficient agents that handle more tools while using fewer tokens. Code execution with the Model Context Protocol (MCP):
0
2
3
@SrihariSriraman
Srihari Sriraman
12 days
Sometimes I just want to give a github url, and a prompt to semantically search. Similar to web search tools, but for Github / Gitlab. I made a tool that does this, following @thorstenball 's "How to Build an Agent", and @nickbaumann_ 's "What Makes a Coding Agent?" blog posts.
1
4
18
@SrihariSriraman
Srihari Sriraman
17 days
@dbreunig My blog post has a lot more detail about all this. Check out this section for details on why existing observability tools don't cut it:
1
1
1
@SrihariSriraman
Srihari Sriraman
17 days
You know how your LLM context is a giant wall of text, and mostly an opaque box that you don't open? I built a tool to open it up, and pull it apart for you so that you can actually do "context engineering". You can just drag-drop your conversation.json into it, and it will
1
1
7
@AtharvaRaykar
atharva
1 month
Wrote a new post about a trend I'm seeing. Designing a good AI-integrated application requires trading off between tricks to improve performance *today*, while also preparing for the bitter lesson that will strike in the future.
1
1
2
@nilenso
nilenso
2 months
Read the full post here: https://t.co/FArnJQCR0q
0
0
1
@nilenso
nilenso
2 months
Throwing out some ideas for improving benchmarks.
1
0
1
@nilenso
nilenso
2 months
There seems to be a tradeoff between making a benchmark easy to run and making it more representative. It's hard to scale benchmarks. The more sophisticated benchmarks (TerminalBench, SWE-Lancer, GDPVal), have fewer data points, which makes statistical inference trickier.
1
0
1
@nilenso
nilenso
2 months
There's a lot more out there.
1
0
1
@nilenso
nilenso
2 months
4) LiveCodeBench This is a test for Python competitive programming. Not SWE, and uncontaminated. There's some tests other than writing code thrown in the mix, such as predicting the output of a function without actually running it. Pass criteria: get the hidden tests to pass.
1
0
1
@nilenso
nilenso
2 months
3) Aider Polyglot It captures a broader set of languages the SWE-Bench family, even if the problems don't look quite like the ones a software engineer would realistically encounter. Pass criteria: unit tests associated with the Exercism exercises.
1
0
1
@nilenso
nilenso
2 months
(interlude) the astute software engineer would note that passing the unit tests does not mean that the underlying issue is resolved or the feature is built correctly. this is known. Yu et al found issues of this kind with SWE-Bench.
1
0
2
@nilenso
nilenso
2 months
2) SWE Bench Pro Also like SWE Bench Verified, but more recent and across a wider variety of repositories, both open source and private. Pass criteria: get the unit tests to pass.
1
0
2
@nilenso
nilenso
2 months
What popular SWE/coding benchmarks are measuring. 1) SWE Bench Verified Patches that close GitHub issues. Mostly for Python library repositories. In that mostly Django. Pass criteria: get the unit tests to pass.
1
1
10
@rseroter
Richard Seroter
2 months
"Breaking down your task into “right-sized” units of work, which describe just the right amount of detail is perhaps the most powerful lever to improve your context window, and thus the correctness and quality of the generated code." https://t.co/GCkhhlQWlS < great POV from
0
1
7
@mfranz_on
Marco Franzon
2 months
The more turns your AI takes to get the job done more slop you will find the answer. Thanks to @nilenso for this interesting article
2
1
5
@nilenso
nilenso
2 months
Read his post here: https://t.co/SPaVlOQH61
0
0
1
@nilenso
nilenso
2 months
More to come: we'll continue to experiment on good units of work for AI agents to work with over the next few months and share what we find. We are picking good ol' User Stories as a starting point as they are small units of work the end with a clear business outcome.
1
0
0
@nilenso
nilenso
2 months
Atharva notes the gap between the results on the widely shared @METR_Evals chart on AI's ability to perform long tasks, and coding agent performance in the messy real world, further strengthening the need for a system that manages the units of work that are fed to a coding agent.
1
0
0