Kabir Profile
Kabir

@plodq

Followers
30
Following
20
Media
12
Statuses
42

London
Joined March 2020
Don't wanna be here? Send us removal request.
@lukechampine
Luke Champine
7 days
all this factoring and 2-factoring
0
17
254
@plodq
Kabir
3 days
Always interesting to see which songs I binge listen to
0
0
0
@plodq
Kabir
19 days
The 18 elemental workplace motions, or "therbligs" https://t.co/ToBRh575pq
0
1
1
@plodq
Kabir
4 months
Although London's rent is the highest in the UK, its rent *inflation* is slower than 4 other regions.
0
0
1
@OfirPress
Ofir Press
5 months
Congrats to the Kimi team on the super strong SWE-bench Verified and SWE-bench Multilingual numbers!!
0
10
41
@plodq
Kabir
6 months
Every cardinal has a coat of arms and a latin motto. Pope Leo's motto translates to "In the One, we are one"
1
0
2
@jyangballin
John Yang ✈️ NeurIPS
7 months
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
24
142
658
@plodq
Kabir
7 months
Observations: • Smaller patches are correlated with higher success • Action mix in successful vs failed runs is nearly identical. Suggests that model capability, not agent design, is the bottleneck • Most tasks < 10 LOC changed, so still far from real‑world complexity (4/5)
1
0
4
@plodq
Kabir
7 months
Why is Multilingual useful? Python-centric evals allow agent frameworks use strategies like ast inspection. On the model side, benchmarks evaluate models on their ability to write python, missing out on different languages, tools, and software domains. (3/5)
1
0
6
@plodq
Kabir
7 months
Multilingual consists 300 curated tasks from 42 repositories in 9 languages: C, C++, Go, Java, JS, TS, PHP, Ruby, Rust. Code is integrated into the the SWE-bench repo and everything follows the same format, so its drop-in compatible with existing tools. (2/5)
1
0
6
@plodq
Kabir
7 months
Introducing SWE-bench Multilingual: a new eval in the SWE-bench family to test LLM coding abilities in *9* programming languages, fully integrated with SB so it can plug into existing workflows. Claude 3.7 gets 43% on SB Multilingual vs 63% on SB Verified, a 20 pt drop!🧵
2
16
66
@plodq
Kabir
10 months
For a period some frontier llms were being released in 3 sizes. Eg Gemini flash, pro, ultra; Claude haiku, sonnet, opus. It looks they're trending towards 2 sizes now - no more ultra or opus; o1/mini.
0
0
2
@plodq
Kabir
11 months
It feels like the 2 main product improvements cursor brings are 1/ context automatically in chat and 2/ good autocomplete. Feels like this should be easy to implement in other domains. But products like Microsoft's Copilot havent panned out, so I must be missing something
0
0
0
@plodq
Kabir
1 year
I like how @AnthropicAI's Claude gracefully degrades when there are a lot of users. First, it switches to concise mode for shorter responses. If there's no capacity at all, Claude will return an error - but it saves your prompt so you don't need to type it in again.
0
0
0
@plodq
Kabir
1 year
This isn't intended to be a rigorous benchmark, but it does seem like LLMs can't really "see" as well they "read". Read the full write up on my blog at https://t.co/Z7yKCZ8dG6!
Tweet card summary image
kabirk.com
Creating a small benchmark to test how well multimodal language models can find specific objects in complex Where's Waldo-style illustrations.
0
0
2
@plodq
Kabir
1 year
Gemini probably scored well because of special post-training sauce. From their docs: “For object detection, the Gemini model has been trained to provide these coordinates as relative widths or heights…” https://t.co/SKzOn8ewtM (8/9)
ai.google.dev
Get started building with Gemini's multimodal capabilities in the Gemini API
1
0
0
@plodq
Kabir
1 year
I also asked the LLMs to find objects that didn't exist to see how much they hallucinate. GPT-4o found a non-existent telescope in 96% (!) of images. Gemini Pro was a much better 12%. (7/9)
1
0
0
@plodq
Kabir
1 year
Size mattered... sometimes. Bigger objects got more accurate bounding boxes, but size didn't affect description quality much - models could describe small objects yet miss large ones. (6/9)
1
0
0