Tim Davidson
@im_td
Followers
816
Following
11K
Media
146
Statuses
3K
PhD research @EPFL on reliable magic | spent time @MSFTResearch on agentic systems, @Google on synthetic data | https://t.co/Iveq1Vw9WH
Joined May 2015
We’ve identified a “Collaboration Gap” in today’s top AI models. Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves. A \🧵
1
18
47
This research was done during my internship @MSFTResearch. Thank you to my awesome collaborators! @adamfourney @SaleemaAmershi @cervisiarius @erichorvitz @ecekamar Read the full paper here: > https://t.co/WOlX5WRVMQ And a lighter blogpost: > https://t.co/6eKhBeR0QN
0
0
6
Our findings argue that collaboration is a distinct capability that current training strategies fail to capture. We shouldn’t just hope for it to emerge – we must *design* for it. This means new evals, training strategies, and interaction designs.
1
0
7
Alternatively, we could use a strong model to “recover” a dialogue: a) Strong Primer: Just one strong "priming" message (K=2) lets a weak model perform near the strong model's level. b) Strong Recovery: If weak models start, a strong model struggles to recover the session.
1
0
2
Because which model starts has such a pronounced impact on success, we experimented with a “relay” inference strategy: Have a strong (expensive) model “prime” the dialogue with just the first K messages, then hand off to a weaker (cheaper) model to finish.
1
0
3
Letting models with different strengths and from different builders collaborate provides further insights: ordering and cross-family pairings matter, a *lot*. Generally: strong model starts > weak models starts, even though both need to agree on each move!
1
0
4
The Collaboration Gap: Even when models are *really* good at completing mazes solo, requiring them to solve the *same* mazes with independent copies of themselves can drastically reduce performance. This gap is especially pronounced in distilled models.
1
0
4
Stronger models are better at grounding than weaker models: 🟢 Strong collaborators (left) immediately define a coordinate system and share info. 🔴 Weak ones (right) are vague, leading to confusion, disagreement, and failure.
1
0
6
Why is this hard? By splitting up information and requiring agreement, agents have to engage in “grounding” -- are shared information and actions understood the same way by both agents? Failure to ground has consequences (see image).
1
0
5
How did we measure this? We designed a collaborative maze-solving benchmark that *isolates* collaborative capabilities. The twist: no agent gets the full map. We split the info, giving each agent a partial view. The *only* way to solve the maze is to talk, share & agree on moves
1
0
4
Real-world communication: Current multi-agent systems rely on *pre-defined* communication protocols, e.g., MCP, or central orchestration. In contrast, open-world integration likely requires adaptive, *dynamic* communication – something humans are surprisingly good at!
1
0
4
Why does this matter? The future of AI won’t be one giant model; it’s systems of multiple, independent AI agents w/ different information and skills. The success of such systems will critically depend on effective collaboration. But how do we measure collaborative capabilities?
1
0
5
Weak models can supervise stronger ones but we find that weak-to-strong generalization can become infeasible under distribution shifts! In our #NeurIPS25 paper, we introduce RAVEN 🐦⬛, a framework that dynamically learns optimal combinations of weak models to robustly guide
0
7
34
this post is complete misinformation LLMs are lossy compressors! of *training data*. LLMs losslessly compress *prompts*, internally. that’s what this paper shows. source: i am the author of “Language Model Inversion”, the original paper on this
96
218
4K
Side-channel communication is such a critical area of research for the coming wave of agent-to-agent interactions — nice work!
How to tamper with a gas meter to pay lower bills? This is the story of how a supposedly aligned open source LLM, perhaps not even knowing how to do it, will give you the right instructions. (1/3) https://t.co/2SzbatO3mq
0
0
0
📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction https://t.co/ljsWULBHEA (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)
1
7
15
This slop must hit so hard for people who care more about being "in the room where it happens" than actually being involved in building the frontier of AI. The instagrammafication of entrepreneurship has been rampant over the last few years and now we see it in research 😱
15
13
392
This was an incredibly fun project to work on, and it has some of my favorite components in a research idea: - Simple. - Intuitive and works really well. In this work, we introduced the loophole technique, which lets discrete diffusion models bypass the "sampling wall" by
sites.google.com
Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall
🚨 Check out our new paper on next generation language modeling via "loopholing" discrete diffusion! 🤯 Surprisingly, our loopholing diffusion achieved a huge performance improvement, finally making it match (or even surpass) autoregressive models! ✅ How? We introduce the
1
6
17
🏆🏆🏆 Thrilled to share that our paper “The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates” received an Honorable Mention Award at @ACM_CSCW 2025 🎉 By analyzing thousands of ICLR peer reviews, we show that papers receiving
Received ChatGPT-like reviews? They may have boosted your paper's odds of being accepted! In a quasi-experimental study of a top AI conference, ICLR, we measured the effect of AI-assisted peer reviews on scores and acceptance rates. (Led by @russogiusep) https://t.co/iRfF7FnZhY
1
4
16
🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵
4
23
67