Kabir
@plodq
Followers
30
Following
20
Media
12
Statuses
42
Although London's rent is the highest in the UK, its rent *inflation* is slower than 4 other regions.
0
0
1
Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!
0
10
41
Every cardinal has a coat of arms and a latin motto. Pope Leo's motto translates to "In the One, we are one"
1
0
2
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
24
142
658
More details on my blog at https://t.co/9GRSHVvaYP. Thanks to @jyangballin, @KLieret, and @OfirPress for their help! (5/5)
kabirk.com
Introducing a new dataset in the SWE-bench family with 300 curated tasks in 9 programming languages to evaluate LLMs on software engineering tasks.
0
0
8
Observations: • Smaller patches are correlated with higher success • Action mix in successful vs failed runs is nearly identical. Suggests that model capability, not agent design, is the bottleneck • Most tasks < 10 LOC changed, so still far from real‑world complexity (4/5)
1
0
4
Why is Multilingual useful? Python-centric evals allow agent frameworks use strategies like ast inspection. On the model side, benchmarks evaluate models on their ability to write python, missing out on different languages, tools, and software domains. (3/5)
1
0
6
Multilingual consists 300 curated tasks from 42 repositories in 9 languages: C, C++, Go, Java, JS, TS, PHP, Ruby, Rust. Code is integrated into the the SWE-bench repo and everything follows the same format, so its drop-in compatible with existing tools. (2/5)
1
0
6
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵
2
16
66
For a period some frontier llms were being released in 3 sizes. Eg Gemini flash, pro, ultra; Claude haiku, sonnet, opus. It looks they're trending towards 2 sizes now - no more ultra or opus; o1/mini.
0
0
2
It feels like the 2 main product improvements cursor brings are 1/ context automatically in chat and 2/ good autocomplete. Feels like this should be easy to implement in other domains. But products like Microsoft's Copilot havent panned out, so I must be missing something
0
0
0
I like how @AnthropicAI's Claude gracefully degrades when there are a lot of users. First, it switches to concise mode for shorter responses. If there's no capacity at all, Claude will return an error - but it saves your prompt so you don't need to type it in again.
0
0
0
This isn't intended to be a rigorous benchmark, but it does seem like LLMs can't really "see" as well they "read". Read the full write up on my blog at https://t.co/Z7yKCZ8dG6!
kabirk.com
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
0
0
2
Gemini probably scored well because of special post-training sauce. From their docs: “For object detection, the Gemini model has been trained to provide these coordinates as relative widths or heights…” https://t.co/SKzOn8ewtM (8/9)
ai.google.dev
Get started building with Gemini's multimodal capabilities in the Gemini API
1
0
0
I also asked the LLMs to find objects that didn't exist to see how much they hallucinate. GPT-4o found a non-existent telescope in 96% (!) of images. Gemini Pro was a much better 12%. (7/9)
1
0
0
Size mattered... sometimes. Bigger objects got more accurate bounding boxes, but size didn't affect description quality much - models could describe small objects yet miss large ones. (6/9)
1
0
0