Kabir @plodq X Profile

Kabir

@plodq

Followers

26

Following

11

Media

8

Statuses

35

Joined March 2020

Don't wanna be here? Send us removal request.

Kabir

@plodq

23 days

0

Kabir

@plodq

23 days

Every cardinal has a coat of arms and a latin motto. Pope Leo's motto translates to "In the One, we are one"

1

0

Kabir

@plodq

2 months

RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….

0

131

0

Kabir

@plodq

2 months

More details on my blog at Thanks to @jyangballin, @KLieret, and @OfirPress for their help! .(5/5).

0

8

Kabir

@plodq

2 months

Observations:.• Smaller patches are correlated with higher success.• Action mix in successful vs failed runs is nearly identical. Suggests that model capability, not agent design, is the bottleneck.• Most tasks < 10 LOC changed, so still far from real‑world complexity.(4/5).

1

0

4

Kabir

@plodq

2 months

Why is Multilingual useful? Python-centric evals allow agent frameworks use strategies like ast inspection. On the model side, benchmarks evaluate models on their ability to write python, missing out on different languages, tools, and software domains. (3/5).

1

0

6

Kabir

@plodq

2 months

Multilingual consists 300 curated tasks from 42 repositories in 9 languages: C, C++, Go, Java, JS, TS, PHP, Ruby, Rust. Code is integrated into the the SWE-bench repo and everything follows the same format, so its drop-in compatible with existing tools. (2/5).

1

0

6

Kabir

@plodq

2 months

Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵

2

16

67

Kabir

@plodq

5 months

For a period some frontier llms were being released in 3 sizes. Eg Gemini flash, pro, ultra; Claude haiku, sonnet, opus. It looks they're trending towards 2 sizes now - no more ultra or opus; o1/mini.

0

2

Kabir

@plodq

6 months

It feels like the 2 main product improvements cursor brings are 1/ context automatically in chat and 2/ good autocomplete. Feels like this should be easy to implement in other domains. But products like Microsoft's Copilot havent panned out, so I must be missing something.

0

Kabir

@plodq

7 months

I like how @AnthropicAI's Claude gracefully degrades when there are a lot of users. First, it switches to concise mode for shorter responses. If there's no capacity at all, Claude will return an error - but it saves your prompt so you don't need to type it in again.

0

Kabir

@plodq

8 months

This isn't intended to be a rigorous benchmark, but it does seem like LLMs can't really "see" as well they "read". Read the full write up on my blog at .

0

2

Kabir

@plodq

8 months

Gemini probably scored well because of special post-training sauce. From their docs: “For object detection, the Gemini model has been trained to provide these coordinates as relative widths or heights…” (8/9).

1

0

Kabir

@plodq

8 months

I also asked the LLMs to find objects that didn't exist to see how much they hallucinate. GPT-4o found a non-existent telescope in 96% (!) of images. Gemini Pro was a much better 12%. (7/9)

1

0

Kabir

@plodq

8 months

Size mattered. sometimes. Bigger objects got more accurate bounding boxes, but size didn't affect description quality much - models could describe small objects yet miss large ones. (6/9)

1

0

Kabir

@plodq

8 months

Headline results: even the best model I tested (Gemini vs Claude & GPT-4o) only accurately described objects 31% of the time. Including "mostly correct" descriptions bumps this to 73%. (5/9)

1

0

Kabir

@plodq

8 months

Wimmelbench is the image analogue: a model is asked to describe a small object (the needle) in a complex scene (the haystack) and draw a bounding box around it. (4/9)

1

0

Kabir

@plodq

8 months

Wimmelbench takes inspiration from needle in a haystack. In that benchmark, a random fact (the needle) is inserted into the middle of a large piece of text (the haystack), and an LLM is asked to retrieve the fact. Most LLMs these days score close to 100% on this task. (3/9).

1

0

Kabir

@plodq

8 months

For example, can an LLM find the cartoon tiger in this picture? (2/9)

1

0

Kabir

@plodq

8 months

How well can LLMs see?.While I’m convinced that language models can really “read” language, I’m less sure that they “see” images. To measure sight quantitatively, I created a small benchmark that I’m calling Wimmelbench. 🧵(1/9).

1

0

4