nilenso @nilenso X Profile

nilenso

@nilenso

Followers

2K

Following

320

Media

248

Statuses

2K

Employee-owned programmer cooperative in Bangalore.

https://t.co/d4vyAUHfv0

Joined May 2013

Don't wanna be here? Send us removal request.

nilenso

@nilenso

2 years

In 2013, a group of makers got together to find new ways to work together. A lot has happened since. We recently celebrated our 10th birthday :) Over the years, we've had the privilege of working with some exceptional organizations and doing work we're proud of. 1/2

2

4

43

atharva

@AtharvaRaykar

9 days

I let Codex CLI rip over the @nilenso website code to optimise performance. It scripted a benchmark, applied some changes and reran the bench to confirm that its changes sped things up by ~5x. Our website sends ~10x less data as well. We had been putting off the website

0

1

atharva

@AtharvaRaykar

11 days

another win for bitter lesson driven development: specialised tool interfaces -> code execution https://t.co/UgHWFIIEM3

Anthropic

@AnthropicAI

12 days

New on the Anthropic Engineering blog: tips on how to build more efficient agents that handle more tools while using fewer tokens. Code execution with the Model Context Protocol (MCP):

0

2

3

Srihari Sriraman

@SrihariSriraman

12 days

Sometimes I just want to give a github url, and a prompt to semantically search. Similar to web search tools, but for Github / Gitlab. I made a tool that does this, following @thorstenball 's "How to Build an Agent", and @nickbaumann_ 's "What Makes a Coding Agent?" blog posts.

1

4

18

Srihari Sriraman

@SrihariSriraman

17 days

@dbreunig My blog post has a lot more detail about all this. Check out this section for details on why existing observability tools don't cut it:

1

Srihari Sriraman

@SrihariSriraman

17 days

You know how your LLM context is a giant wall of text, and mostly an opaque box that you don't open? I built a tool to open it up, and pull it apart for you so that you can actually do "context engineering". You can just drag-drop your conversation.json into it, and it will

1

7

atharva

@AtharvaRaykar

1 month

Wrote a new post about a trend I'm seeing. Designing a good AI-integrated application requires trading off between tricks to improve performance *today*, while also preparing for the bitter lesson that will strike in the future.

1

2

nilenso

@nilenso

2 months

Read the full post here: https://t.co/FArnJQCR0q

0

1

nilenso

@nilenso

2 months

Throwing out some ideas for improving benchmarks.

1

0

1

nilenso

@nilenso

2 months

There seems to be a tradeoff between making a benchmark easy to run and making it more representative. It's hard to scale benchmarks. The more sophisticated benchmarks (TerminalBench, SWE-Lancer, GDPVal), have fewer data points, which makes statistical inference trickier.

1

0

1

nilenso

@nilenso

2 months

There's a lot more out there.

1

0

1

nilenso

@nilenso

2 months

4) LiveCodeBench This is a test for Python competitive programming. Not SWE, and uncontaminated. There's some tests other than writing code thrown in the mix, such as predicting the output of a function without actually running it. Pass criteria: get the hidden tests to pass.

1

0

1

nilenso

@nilenso

2 months

3) Aider Polyglot It captures a broader set of languages the SWE-Bench family, even if the problems don't look quite like the ones a software engineer would realistically encounter. Pass criteria: unit tests associated with the Exercism exercises.

1

0

1

nilenso

@nilenso

2 months

(interlude) the astute software engineer would note that passing the unit tests does not mean that the underlying issue is resolved or the feature is built correctly. this is known. Yu et al found issues of this kind with SWE-Bench.

1

0

2

nilenso

@nilenso

2 months

2) SWE Bench Pro Also like SWE Bench Verified, but more recent and across a wider variety of repositories, both open source and private. Pass criteria: get the unit tests to pass.

1

0

2

nilenso

@nilenso

2 months

What popular SWE/coding benchmarks are measuring. 1) SWE Bench Verified Patches that close GitHub issues. Mostly for Python library repositories. In that mostly Django. Pass criteria: get the unit tests to pass.

1

10

Richard Seroter

@rseroter

2 months

"Breaking down your task into “right-sized” units of work, which describe just the right amount of detail is perhaps the most powerful lever to improve your context window, and thus the correctness and quality of the generated code." https://t.co/GCkhhlQWlS < great POV from

0

1

7

Marco Franzon

@mfranz_on

2 months

The more turns your AI takes to get the job done more slop you will find the answer. Thanks to @nilenso for this interesting article

2

1

5

nilenso

@nilenso

2 months

Read his post here: https://t.co/SPaVlOQH61

0

1

nilenso

@nilenso

2 months

More to come: we'll continue to experiment on good units of work for AI agents to work with over the next few months and share what we find. We are picking good ol' User Stories as a starting point as they are small units of work the end with a clear business outcome.

1

0

nilenso

@nilenso

2 months

Atharva notes the gap between the results on the widely shared @METR_Evals chart on AI's ability to perform long tasks, and coding agent performance in the messy real world, further strengthening the need for a system that manages the units of work that are fed to a coding agent.

1

0