lintool Profile Banner
Jimmy Lin Profile
Jimmy Lin

@lintool

Followers
14K
Following
0
Media
365
Statuses
4K

I profess CS-ly at @UWaterloo about NLP/IR/LLM-ish things. I science at @yupp_ai and @Primal. Previously, I monkeyed code for @Twitter and slides for @Cloudera.

Nearby data lake
Joined February 2010
Don't wanna be here? Send us removal request.
@lintool
Jimmy Lin
30 days
In December 2024 @pankaj @gilad @willhorn and I put out a rather cryptic arXiv paper musing about the future of search: I’m now able to share what I’ve been up to! 🧵(1/9).
9
29
166
@lintool
Jimmy Lin
2 hours
RT @natolambert: I spent a while with Grok 4 trying to process what is truly leading performance, unlocking some new things AI hasn't done….
0
21
0
@lintool
Jimmy Lin
2 days
@yupp_ai @xai 9/9 Kudos to the @xai team for releasing a powerful model! We’ll provide updates as we learn more from our users. In the meantime, check out our leaderboard at And if you haven’t tried @yupp_ai yet, I hope you give it a try at
2
0
30
@lintool
Jimmy Lin
2 days
@yupp_ai @xai 8/9 This goes to show that high scores on benchmark datasets may not translate into actual user preferences in real-world use cases. Robust and trustworthy AI evaluation is difficult!.
3
0
42
@lintool
Jimmy Lin
2 days
@yupp_ai @xai 7/9 And furthermore, based on preliminary analysis, if we control for speed, Grok 4 responses get ranked quite a bit higher. All of these issues that we raised are fixable and we hope the @xai team does so in the coming weeks.
Tweet media one
3
1
57
@lintool
Jimmy Lin
2 days
@yupp_ai @xai 6/9 Users prefer fast, responsive models, and at least currently, Grok 4 doesn’t deliver on that. But it is indeed very smart – just look at this interaction from one of our users.
@MirchaOrlov
Duck Weider
2 days
Tried to break AI brains today on @yupp_ai.I hit Grok 4 and Claude Sonnet 3.7 with this question:. “What if 2+2 = 5… and that made total sense?” . Claude was like: .“Easy. Modular math, redefined operations, infinitesimals - let’s go.” . Grok 4, on the other hand, stared into
4
2
47
@lintool
Jimmy Lin
2 days
@yupp_ai @xai 5/9 But it’s clear that Grok 4 is slow – it has low tokens/sec and high time-to-first-token (TTFT). The API errors frequently and does not return the model’s thought traces. There’s no way to set the reasoning level either. Users abort Grok 4 mid-stream a lot.
Tweet media one
1
4
52
@lintool
Jimmy Lin
2 days
@yupp_ai 4/9 Grok 4 claims incredible performance on static benchmark datasets such as Humanity’s Last Exam. Definitely hats off to the @xai team! But as we’ve pointed out before, such evaluations can be problematic.
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 10/42 Current AI evals have serious problems. Static benchmarks become stale and don't reflect real use. Crowdsourced platforms lack transparency and are vulnerable to gaming. The AI community is rightfully asking hard questions about fairness and neutrality.
1
2
50
@lintool
Jimmy Lin
2 days
@yupp_ai 3/9 Of course, we’re very early at @yupp_ai and we have much to learn. Our leaderboard is still in Beta, and as we’ve already emphasized previously, everything should be taken with a generous dose of salt 🧂– tbh we could be totally off base….
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 17/42 But since we’re only getting started, there might still be oddities in the leaderboard while it is in Beta, that we’re hard at work trying to refine… please bear with us and thanks for being patient!.
1
2
48
@lintool
Jimmy Lin
2 days
2/9 To set the stage: Within 10m after the Grok 4 API launched, it was added to @yupp_ai and available to all users for free! 36 hours later, it has accumulated 6K+ human preference feedback datapoints from users all over the world.
@pankaj
Pankaj Gupta
3 days
10.05pm PT - Grok 4 model available via API.10.15pm PT - Grok 4 model available on
1
4
71
@lintool
Jimmy Lin
2 days
It’s been 36 hours since Grok 4 launched and we have an early verdict based on 6K+ preferences of @yupp_ai users globally on real use cases. ‼️ Grok 4 is worse than other leading models: OpenAI o3, Claude Opus 4, and Gemini 2.5 Pro. Grok 4 is liked even less than Grok 3. 🧵
Tweet media one
108
165
1K
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 42/42 Let’s work together to empower humanity to shape the future of AI! If you haven’t tried @yupp_ai yet, I hope you give it a try!.
@yupp_ai
Yupp
1 month
Introducing Yupp: a fun and easy way to discover, compare, and use the latest AIs – while helping to shape the future of the field. Sign up at
0
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 41/42 As a reminder, this tweet thread is also available in blog form at
1
0
4
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 40/42 We’re just getting started, and wish to engage the AI community in collaborations that produce orders of magnitude more high-quality data – to tackle the challenge of robust and trustworthy AI evaluation! If you are interested, please reach out at research@yupp.ai.
1
0
3
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 39/42 We’re also preparing a data release to share a subset of our public prompts in the next few months, but even today you can access samples from our model pages. If you are interested in this, please send an email to research@yupp.ai
Tweet media one
1
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 38/42 There are many more interesting questions we seek to explore. 2 more examples:. ▶️What is the right way to share data while respecting the privacy of users?.▶️How do we demonstrate adherence to stated principles in a provable manner, perhaps using cryptographic primitives?.
1
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 37/42 To that end, we will soon ship a novel feature: permissionless model evals, allowing anyone (students, AI hobbyists, etc.) to submit an AI to Yupp. We’ll orchestrate comparative evaluations and then give you feedback on how your AI stacked up against the others.
1
2
7
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 36/42 As a concrete example: We have been thinking about how we can provide equitable access to all AI developers, from those at frontier labs pushing the state of the art to resource-limited graduate students who are also training and fine-tuning models.
1
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 35/42 Beyond building a compelling consumer product, we're asking hard questions the AI community has raised: How does one truly build robust and trustworthy evaluation? @yupp_ai seeks to be provably fair, radically transparent, permissionless, and accessible to all.
1
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 34/42 ❓How do we better model user demographics and incentives? We are running experiments to explore many product features using professional raters with validated user profiles. With multiple layers of quality testing, these raters give us a reference for calibration.
1
0
5
@lintool
Jimmy Lin
4 days
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 33/42 ❓And on the flip side, how do we automatically detect “bad” users who provide low-quality data? We’re leveraging our experience in tackling spam and bots at Twitter and have developed sophisticated algorithms to ensure data quality.
1
0
5