
Jimmy Lin
@lintool
Followers
14K
Following
0
Media
365
Statuses
4K
I profess CS-ly at @UWaterloo about NLP/IR/LLM-ish things. I science at @yupp_ai and @Primal. Previously, I monkeyed code for @Twitter and slides for @Cloudera.
Nearby data lake
Joined February 2010
RT @natolambert: I spent a while with Grok 4 trying to process what is truly leading performance, unlocking some new things AI hasn't done….
0
21
0
@yupp_ai @xai 6/9 Users prefer fast, responsive models, and at least currently, Grok 4 doesn’t deliver on that. But it is indeed very smart – just look at this interaction from one of our users.
Tried to break AI brains today on @yupp_ai.I hit Grok 4 and Claude Sonnet 3.7 with this question:. “What if 2+2 = 5… and that made total sense?” . Claude was like: .“Easy. Modular math, redefined operations, infinitesimals - let’s go.” . Grok 4, on the other hand, stared into
4
2
47
@yupp_ai 4/9 Grok 4 claims incredible performance on static benchmark datasets such as Humanity’s Last Exam. Definitely hats off to the @xai team! But as we’ve pointed out before, such evaluations can be problematic.
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 10/42 Current AI evals have serious problems. Static benchmarks become stale and don't reflect real use. Crowdsourced platforms lack transparency and are vulnerable to gaming. The AI community is rightfully asking hard questions about fairness and neutrality.
1
2
50
@yupp_ai 3/9 Of course, we’re very early at @yupp_ai and we have much to learn. Our leaderboard is still in Beta, and as we’ve already emphasized previously, everything should be taken with a generous dose of salt 🧂– tbh we could be totally off base….
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 17/42 But since we’re only getting started, there might still be oddities in the leaderboard while it is in Beta, that we’re hard at work trying to refine… please bear with us and thanks for being patient!.
1
2
48
2/9 To set the stage: Within 10m after the Grok 4 API launched, it was added to @yupp_ai and available to all users for free! 36 hours later, it has accumulated 6K+ human preference feedback datapoints from users all over the world.
1
4
71
It’s been 36 hours since Grok 4 launched and we have an early verdict based on 6K+ preferences of @yupp_ai users globally on real use cases. ‼️ Grok 4 is worse than other leading models: OpenAI o3, Claude Opus 4, and Gemini 2.5 Pro. Grok 4 is liked even less than Grok 3. 🧵
108
165
1K
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 42/42 Let’s work together to empower humanity to shape the future of AI! If you haven’t tried @yupp_ai yet, I hope you give it a try!.
Introducing Yupp: a fun and easy way to discover, compare, and use the latest AIs – while helping to shape the future of the field. Sign up at
0
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 41/42 As a reminder, this tweet thread is also available in blog form at
1
0
4
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 40/42 We’re just getting started, and wish to engage the AI community in collaborations that produce orders of magnitude more high-quality data – to tackle the challenge of robust and trustworthy AI evaluation! If you are interested, please reach out at research@yupp.ai.
1
0
3
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 39/42 We’re also preparing a data release to share a subset of our public prompts in the next few months, but even today you can access samples from our model pages. If you are interested in this, please send an email to research@yupp.ai
1
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 38/42 There are many more interesting questions we seek to explore. 2 more examples:. ▶️What is the right way to share data while respecting the privacy of users?.▶️How do we demonstrate adherence to stated principles in a provable manner, perhaps using cryptographic primitives?.
1
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 37/42 To that end, we will soon ship a novel feature: permissionless model evals, allowing anyone (students, AI hobbyists, etc.) to submit an AI to Yupp. We’ll orchestrate comparative evaluations and then give you feedback on how your AI stacked up against the others.
1
2
7
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 36/42 As a concrete example: We have been thinking about how we can provide equitable access to all AI developers, from those at frontier labs pushing the state of the art to resource-limited graduate students who are also training and fine-tuning models.
1
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 35/42 Beyond building a compelling consumer product, we're asking hard questions the AI community has raised: How does one truly build robust and trustworthy evaluation? @yupp_ai seeks to be provably fair, radically transparent, permissionless, and accessible to all.
1
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 34/42 ❓How do we better model user demographics and incentives? We are running experiments to explore many product features using professional raters with validated user profiles. With multiple layers of quality testing, these raters give us a reference for calibration.
1
0
5
@UWCheritonCS @UWaterloo @pankaj @gilad @yupp_ai 33/42 ❓And on the flip side, how do we automatically detect “bad” users who provide low-quality data? We’re leveraging our experience in tackling spam and bots at Twitter and have developed sophisticated algorithms to ensure data quality.
1
0
5