Tim Davidson @im_td X Profile

Tim Davidson

@im_td

Followers

816

Following

11K

Media

146

Statuses

3K

PhD research @EPFL on reliable magic | spent time @MSFTResearch on agentic systems, @Google on synthetic data | https://t.co/Iveq1Vw9WH

https://t.co/SF2T3WZyrd

Joined May 2015

Don't wanna be here? Send us removal request.

Tim Davidson

@im_td

3 days

We’ve identified a “Collaboration Gap” in today’s top AI models. Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves. A \🧵

1

18

47

Tim Davidson

@im_td

3 days

This research was done during my internship @MSFTResearch. Thank you to my awesome collaborators! @adamfourney @SaleemaAmershi @cervisiarius @erichorvitz @ecekamar Read the full paper here: > https://t.co/WOlX5WRVMQ And a lighter blogpost: > https://t.co/6eKhBeR0QN

0

6

Tim Davidson

@im_td

3 days

Our findings argue that collaboration is a distinct capability that current training strategies fail to capture. We shouldn’t just hope for it to emerge – we must *design* for it. This means new evals, training strategies, and interaction designs.

1

0

7

Tim Davidson

@im_td

3 days

Alternatively, we could use a strong model to “recover” a dialogue: a) Strong Primer: Just one strong "priming" message (K=2) lets a weak model perform near the strong model's level. b) Strong Recovery: If weak models start, a strong model struggles to recover the session.

1

0

2

Tim Davidson

@im_td

3 days

Because which model starts has such a pronounced impact on success, we experimented with a “relay” inference strategy: Have a strong (expensive) model “prime” the dialogue with just the first K messages, then hand off to a weaker (cheaper) model to finish.

1

0

3

Tim Davidson

@im_td

3 days

Letting models with different strengths and from different builders collaborate provides further insights: ordering and cross-family pairings matter, a *lot*. Generally: strong model starts > weak models starts, even though both need to agree on each move!

1

0

4

Tim Davidson

@im_td

3 days

The Collaboration Gap: Even when models are *really* good at completing mazes solo, requiring them to solve the *same* mazes with independent copies of themselves can drastically reduce performance. This gap is especially pronounced in distilled models.

1

0

4

Tim Davidson

@im_td

3 days

Stronger models are better at grounding than weaker models: 🟢 Strong collaborators (left) immediately define a coordinate system and share info. 🔴 Weak ones (right) are vague, leading to confusion, disagreement, and failure.

1

0

6

Tim Davidson

@im_td

3 days

Why is this hard? By splitting up information and requiring agreement, agents have to engage in “grounding” -- are shared information and actions understood the same way by both agents? Failure to ground has consequences (see image).

1

0

5

Tim Davidson

@im_td

3 days

How did we measure this? We designed a collaborative maze-solving benchmark that *isolates* collaborative capabilities. The twist: no agent gets the full map. We split the info, giving each agent a partial view. The *only* way to solve the maze is to talk, share & agree on moves

1

0

4

Tim Davidson

@im_td

3 days

Real-world communication: Current multi-agent systems rely on *pre-defined* communication protocols, e.g., MCP, or central orchestration. In contrast, open-world integration likely requires adaptive, *dynamic* communication – something humans are surprisingly good at!

1

0

4

Tim Davidson

@im_td

3 days

Why does this matter? The future of AI won’t be one giant model; it’s systems of multiple, independent AI agents w/ different information and skills. The success of such systems will critically depend on effective collaboration. But how do we measure collaborative capabilities?

1

0

5

Maria Brbic

@mariabrbic

4 days

Weak models can supervise stronger ones but we find that weak-to-strong generalization can become infeasible under distribution shifts! In our #NeurIPS25 paper, we introduce RAVEN 🐦‍⬛, a framework that dynamically learns optimal combinations of weak models to robustly guide

0

7

34

dr. jack morris

@jxmnop

10 days

this post is complete misinformation LLMs are lossy compressors! of *training data*. LLMs losslessly compress *prompts*, internally. that’s what this paper shows. source: i am the author of “Language Model Inversion”, the original paper on this

96

218

4K

Tim Davidson

@im_td

11 days

Side-channel communication is such a critical area of research for the coming wave of agent-to-agent interactions — nice work!

Antonio Norelli

@noranta4

11 days

How to tamper with a gas meter to pay lower bills? This is the story of how a supposedly aligned open source LLM, perhaps not even knowing how to do it, will give you the right instructions. (1/3) https://t.co/2SzbatO3mq

0

Bob West

@cervisiarius

11 days

📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction https://t.co/ljsWULBHEA (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)

1

7

15

Edward Grefenstette

@egrefen

12 days

This slop must hit so hard for people who care more about being "in the room where it happens" than actually being involved in building the frontier of AI. The instagrammafication of entrepreneurship has been rampant over the last few years and now we see it in research 😱

15

13

392

Caglar Gulcehre

@caglarml

14 days

This was an incredibly fun project to work on, and it has some of my favorite components in a research idea: - Simple. - Intuitive and works really well. In this work, we introduced the loophole technique, which lets discrete diffusion models bypass the "sampling wall" by

sites.google.com

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

Sungjin Ahn

@SungjinAhn_

16 days

🚨 Check out our new paper on next generation language modeling via "loopholing" discrete diffusion! 🤯 Surprisingly, our loopholing diffusion achieved a huge performance improvement, finally making it match (or even surpass) autoregressive models! ✅ How? We introduce the

1

6

17

Giuseppe (Peppe) Russo

@russogiusep

19 days

🏆🏆🏆 Thrilled to share that our paper “The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates” received an Honorable Mention Award at @ACM_CSCW 2025 🎉 By analyzing thousands of ICLR peer reviews, we show that papers receiving

Manoel

@manoelribeiro

2 years

Received ChatGPT-like reviews? They may have boosted your paper's odds of being accepted! In a quasi-experimental study of a top AI conference, ICLR, we measured the effect of AI-assisted peer reviews on scores and acceptance rates. (Led by @russogiusep) https://t.co/iRfF7FnZhY

1

4

16

Bob West

@cervisiarius

22 days

🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵

4

23

67