
JunShern
@junshernchan
Followers
393
Following
6K
Media
18
Statuses
185
Trying to make AI go well @AnthropicAI. Previously @OpenAI, @CHAI_Berkeley, @nyuniversity, autonomous vehicles @motionaldrive. 🇲🇾
Joined March 2018
Wake up babe, new agent benchmark dropped! MLE-bench - or, as I like to say it, EMILY BENCH 🙆🏻♀️. Main summary on Neil’s thread, but adding some highlights of my own:
Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.
1
14
103
RT @FakePsyho: Humanity has prevailed (for now!). I'm completely exhausted. I figured, I had 10h of sleep in the last 3 days and I'm barely….
0
947
0
RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….
0
92
0
We are at ICLR, come chat!.
Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!
0
0
10
RT @TransluceAI: To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken….
0
65
0
RT @ilumine_ai: This quick experiment I just did made my jaw drop. You can literally create and play any game by iterating over images w….
0
352
0
RT @tyler_m_john: It's hard to live like short ASI timelines are true, even if you're deeply intellectually convinced. It means going again….
0
66
0
RT @DimitrisPapail: AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination. AIME 2025 part I was conducted yesterday,….
0
44
0
RT @aleks_madry: We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks. At the same time, on the….
0
9
0
RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP!. We're seeking proposals across 21 research areas to help make AI system….
0
83
0
RT @joannejang: i find mishearing 'agents' as 'asians' way more entertaining. "a swarm of asians working 24/7 while you sleep". "the future….
0
179
0
RT @OwainEvans_UK: New paper:.We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *d….
0
153
0
Neat! I haven't looked too closely but this seems thoughtfully done and probably very useful to many people. :). (Similar in spirit to our release of MLE-bench Lite yesterday too -- tis the season for minifying!).
I made SWEBench-verified-mini. SWEBench-verified-mini is a subset of SWEBench-verified that uses 50 instead of 500 datapoints, requires 5GB instead of 130GB of storage, and has approximately the same distribution of performance, test pass rates, and difficulty as the original.
0
0
5
RT @zhengyaojiang: Cool stuff! The full MLE-Bench was nearly impossible for anyone outside the big players to run. This will benefit the co….
0
3
0
Finally, here's @ChowdhuryNeil's original thread on MLE-bench, in case you missed it:
Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.
0
0
3