
Saurabh Shah
@saurabh_shah2
Followers
2K
Following
10K
Media
102
Statuses
1K
training olmos @allen_ai prev @Apple @Penn 🎤dabbler of things🎸 🐈⬛enjoyer of cats 🐈 and mountains🏔️he/him
Seattle, WA
Joined December 2022
The marks of a good benchmark (IMO):.- measures something that makes a meaningful difference in user experience of the model.- you'd expect frontier models to nail it.- frontier models don't nail it. IFBench hits all 3!! Great work by @valentina__py and some of my teammates :).
Introducing IFBench, a benchmark to measure how well AI models follow new, challenging, and diverse verifiable instructions. Top models like Gemini 2.5 Pro or Claude 4 Sonnet are only able to score up to 50%, presenting an open frontier for post-training. 🧵
0
0
11
oh, also shoutout Rohan for heavily inspiring this post. When I was down in the bay I asked him what he thought was the next big thing to scale and he just said "simulation" without hesitating. Idk if you all know this but this guy is legit.
bro got simulationpilled. what’s your excuse for not yet taking the simulationpill, anon?. if you want to understand what kinds of problems will be solved by RL in the next 2 years, check out his essay.
2
2
18
New post! . Reinforcement learning, deterministic chaos, and scaling simulation. I think this is my favorite post so far, do check it out!. Shoutout @robertghrist for teaching an awesome course on dynamic systems. I took it 4 years ago but I think about these ideas all the time
2
2
55
something for everyone in the new office - come hang.
1
0
8
A new open pre-trainer joins the fight! .Closed labs fight against each other. Open labs fight together. Very cool release from Acree!!.
Our customers needed a better base model <10B parameters. We spent the last 5 months building one. I'm delighted to share a preview of our first Arcee Foundation Model: AFM-4.5B-Preview.
2
1
33
super cool work from the goat @michaelryan207 on personalization for LM's. Personalization is a (maybe even 'the') core problem in how humans will interact with language models going forward. Michael is a (maybe even 'the') goat researcher + presenter. This is very cool work!.
New #ACL2025NLP Paper! 🎉. Curious what AI thinks about YOU?. We interact with AI every day, offering all kinds of feedback, both implicit ✏️ and explicit 👍. What if we used this feedback to personalize your AI assistant to you?. Introducing SynthesizeMe! An approach for
1
1
6
RT @sama: also, here is one part that people not interested in the rest of the post might still be interested in:
0
458
0
So @finbarrtimbers wrote this simple script to give cursor logs of any experiment and now I'm officially a middle manager 🤠
2
0
7
Holy shit @dwarkesh_sp is only 24?? Absolute goat. (Just listened to Tyler Cowan interview, I particularly liked this one.). I turn 24 in a month. Who wants to start a pod?.
2
0
21
Prediction: there will be a paper (or OAI blog post) with a graph showing as you scale simulation compute + reduce sim2real gap, you get similar scaling laws as pre-training or inference time compute we're seeing now. (This is not a new idea). Snippet from @natolambert
4
2
38