Kabir Profile
Kabir

@plodq

Followers
26
Following
11
Media
8
Statuses
35

Joined March 2020
Don't wanna be here? Send us removal request.
@plodq
Kabir
23 days
0
0
0
@plodq
Kabir
23 days
Every cardinal has a coat of arms and a latin motto. Pope Leo's motto translates to "In the One, we are one"
Tweet media one
1
0
0
@plodq
Kabir
2 months
RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….
0
131
0
@plodq
Kabir
2 months
More details on my blog at Thanks to @jyangballin, @KLieret, and @OfirPress for their help! .(5/5).
0
0
8
@plodq
Kabir
2 months
Observations:.• Smaller patches are correlated with higher success.• Action mix in successful vs failed runs is nearly identical. Suggests that model capability, not agent design, is the bottleneck.• Most tasks < 10 LOC changed, so still far from real‑world complexity.(4/5).
1
0
4
@plodq
Kabir
2 months
Why is Multilingual useful? Python-centric evals allow agent frameworks use strategies like ast inspection. On the model side, benchmarks evaluate models on their ability to write python, missing out on different languages, tools, and software domains. (3/5).
1
0
6
@plodq
Kabir
2 months
Multilingual consists 300 curated tasks from 42 repositories in 9 languages: C, C++, Go, Java, JS, TS, PHP, Ruby, Rust. Code is integrated into the the SWE-bench repo and everything follows the same format, so its drop-in compatible with existing tools. (2/5).
1
0
6
@plodq
Kabir
2 months
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵
Tweet media one
2
16
67
@plodq
Kabir
5 months
For a period some frontier llms were being released in 3 sizes. Eg Gemini flash, pro, ultra; Claude haiku, sonnet, opus. It looks they're trending towards 2 sizes now - no more ultra or opus; o1/mini.
0
0
2
@plodq
Kabir
6 months
It feels like the 2 main product improvements cursor brings are 1/ context automatically in chat and 2/ good autocomplete. Feels like this should be easy to implement in other domains. But products like Microsoft's Copilot havent panned out, so I must be missing something.
0
0
0
@plodq
Kabir
7 months
I like how @AnthropicAI's Claude gracefully degrades when there are a lot of users. First, it switches to concise mode for shorter responses. If there's no capacity at all, Claude will return an error - but it saves your prompt so you don't need to type it in again.
0
0
0
@plodq
Kabir
8 months
This isn't intended to be a rigorous benchmark, but it does seem like LLMs can't really "see" as well they "read". Read the full write up on my blog at .
0
0
2
@plodq
Kabir
8 months
Gemini probably scored well because of special post-training sauce. From their docs: “For object detection, the Gemini model has been trained to provide these coordinates as relative widths or heights…” (8/9).
1
0
0
@plodq
Kabir
8 months
I also asked the LLMs to find objects that didn't exist to see how much they hallucinate. GPT-4o found a non-existent telescope in 96% (!) of images. Gemini Pro was a much better 12%. (7/9)
Tweet media one
1
0
0
@plodq
Kabir
8 months
Size mattered. sometimes. Bigger objects got more accurate bounding boxes, but size didn't affect description quality much - models could describe small objects yet miss large ones. (6/9)
Tweet media one
1
0
0
@plodq
Kabir
8 months
Headline results: even the best model I tested (Gemini vs Claude & GPT-4o) only accurately described objects 31% of the time. Including "mostly correct" descriptions bumps this to 73%. (5/9)
Tweet media one
1
0
0
@plodq
Kabir
8 months
Wimmelbench is the image analogue: a model is asked to describe a small object (the needle) in a complex scene (the haystack) and draw a bounding box around it. (4/9)
Tweet media one
1
0
0
@plodq
Kabir
8 months
Wimmelbench takes inspiration from needle in a haystack. In that benchmark, a random fact (the needle) is inserted into the middle of a large piece of text (the haystack), and an LLM is asked to retrieve the fact. Most LLMs these days score close to 100% on this task. (3/9).
1
0
0
@plodq
Kabir
8 months
For example, can an LLM find the cartoon tiger in this picture? (2/9)
Tweet media one
1
0
0
@plodq
Kabir
8 months
How well can LLMs see?.While I’m convinced that language models can really “read” language, I’m less sure that they “see” images. To measure sight quantitatively, I created a small benchmark that I’m calling Wimmelbench. 🧵(1/9).
1
0
4