Jimmy Lin @lintool X Profile

Jimmy Lin

@lintool

Followers

14K

Following

0

Media

365

Statuses

4K

I profess CS-ly at @UWaterloo about NLP/IR/LLM-ish things. I science at @yupp_ai and @Primal. Previously, I monkeyed code for @Twitter and slides for @Cloudera.

Nearby data lake

Joined February 2010

Don't wanna be here? Send us removal request.

Jimmy Lin

@lintool

30 days

In December 2024 @pankaj @gilad @willhorn and I put out a rather cryptic arXiv paper musing about the future of search: I’m now able to share what I’ve been up to! 🧵(1/9).

9

29

166

Jimmy Lin

@lintool

2 hours

RT @natolambert: I spent a while with Grok 4 trying to process what is truly leading performance, unlocking some new things AI hasn't done….

0

21

0

Jimmy Lin

@lintool

2 days

@yupp_ai @xai 9/9 Kudos to the @xai team for releasing a powerful model! We’ll provide updates as we learn more from our users. In the meantime, check out our leaderboard at And if you haven’t tried @yupp_ai yet, I hope you give it a try at

2

0

30

Jimmy Lin

@lintool

2 days

@yupp_ai @xai 8/9 This goes to show that high scores on benchmark datasets may not translate into actual user preferences in real-world use cases. Robust and trustworthy AI evaluation is difficult!.

3

0

42

Jimmy Lin

@lintool

2 days

@yupp_ai @xai 7/9 And furthermore, based on preliminary analysis, if we control for speed, Grok 4 responses get ranked quite a bit higher. All of these issues that we raised are fixable and we hope the @xai team does so in the coming weeks.

3

1

57

Jimmy Lin

@lintool

2 days

@yupp_ai @xai 6/9 Users prefer fast, responsive models, and at least currently, Grok 4 doesn’t deliver on that. But it is indeed very smart – just look at this interaction from one of our users.

Duck Weider

@MirchaOrlov

2 days

Tried to break AI brains today on @yupp_ai.I hit Grok 4 and Claude Sonnet 3.7 with this question:. “What if 2+2 = 5… and that made total sense?” . Claude was like: .“Easy. Modular math, redefined operations, infinitesimals - let’s go.” . Grok 4, on the other hand, stared into

4

2

47

Jimmy Lin

@lintool

2 days

@yupp_ai @xai 5/9 But it’s clear that Grok 4 is slow – it has low tokens/sec and high time-to-first-token (TTFT). The API errors frequently and does not return the model’s thought traces. There’s no way to set the reasoning level either. Users abort Grok 4 mid-stream a lot.

1

4

52

Jimmy Lin

@lintool

2 days

@yupp_ai 4/9 Grok 4 claims incredible performance on static benchmark datasets such as Humanity’s Last Exam. Definitely hats off to the @xai team! But as we’ve pointed out before, such evaluations can be problematic.

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 10/42 Current AI evals have serious problems. Static benchmarks become stale and don't reflect real use. Crowdsourced platforms lack transparency and are vulnerable to gaming. The AI community is rightfully asking hard questions about fairness and neutrality.

1

2

50

Jimmy Lin

@lintool

2 days

@yupp_ai 3/9 Of course, we’re very early at @yupp_ai and we have much to learn. Our leaderboard is still in Beta, and as we’ve already emphasized previously, everything should be taken with a generous dose of salt 🧂– tbh we could be totally off base….

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 17/42 But since we’re only getting started, there might still be oddities in the leaderboard while it is in Beta, that we’re hard at work trying to refine… please bear with us and thanks for being patient!.

1

2

48

Jimmy Lin

@lintool

2 days

2/9 To set the stage: Within 10m after the Grok 4 API launched, it was added to @yupp_ai and available to all users for free! 36 hours later, it has accumulated 6K+ human preference feedback datapoints from users all over the world.

Pankaj Gupta

@pankaj

3 days

10.05pm PT - Grok 4 model available via API.10.15pm PT - Grok 4 model available on

1

4

71

Jimmy Lin

@lintool

2 days

It’s been 36 hours since Grok 4 launched and we have an early verdict based on 6K+ preferences of @yupp_ai users globally on real use cases. ‼️ Grok 4 is worse than other leading models: OpenAI o3, Claude Opus 4, and Gemini 2.5 Pro. Grok 4 is liked even less than Grok 3. 🧵

108

165

1K

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 42/42 Let’s work together to empower humanity to shape the future of AI! If you haven’t tried @yupp_ai yet, I hope you give it a try!.

Yupp

@yupp_ai

1 month

Introducing Yupp: a fun and easy way to discover, compare, and use the latest AIs – while helping to shape the future of the field. Sign up at

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 41/42 As a reminder, this tweet thread is also available in blog form at

1

0

4

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 40/42 We’re just getting started, and wish to engage the AI community in collaborations that produce orders of magnitude more high-quality data – to tackle the challenge of robust and trustworthy AI evaluation! If you are interested, please reach out at research@yupp.ai.

1

0

3

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 39/42 We’re also preparing a data release to share a subset of our public prompts in the next few months, but even today you can access samples from our model pages. If you are interested in this, please send an email to research@yupp.ai

1

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 38/42 There are many more interesting questions we seek to explore. 2 more examples:. ▶️What is the right way to share data while respecting the privacy of users?.▶️How do we demonstrate adherence to stated principles in a provable manner, perhaps using cryptographic primitives?.

1

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 37/42 To that end, we will soon ship a novel feature: permissionless model evals, allowing anyone (students, AI hobbyists, etc.) to submit an AI to Yupp. We’ll orchestrate comparative evaluations and then give you feedback on how your AI stacked up against the others.

1

2

7

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 36/42 As a concrete example: We have been thinking about how we can provide equitable access to all AI developers, from those at frontier labs pushing the state of the art to resource-limited graduate students who are also training and fine-tuning models.

1

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 35/42 Beyond building a compelling consumer product, we're asking hard questions the AI community has raised: How does one truly build robust and trustworthy evaluation? @yupp_ai seeks to be provably fair, radically transparent, permissionless, and accessible to all.

1

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 34/42 ❓How do we better model user demographics and incentives? We are running experiments to explore many product features using professional raters with validated user profiles. With multiple layers of quality testing, these raters give us a reference for calibration.

1

0

5

Jimmy Lin

@lintool

4 days

@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 33/42 ❓And on the flip side, how do we automatically detect “bad” users who provide low-quality data? We’re leveraging our experience in tackling spam and bots at Twitter and have developed sophisticated algorithms to ensure data quality.

1

0

5