JunShern Profile
JunShern

@junshernchan

Followers
393
Following
6K
Media
18
Statuses
185

Trying to make AI go well @AnthropicAI. Previously @OpenAI, @CHAI_Berkeley, @nyuniversity, autonomous vehicles @motionaldrive. 🇲🇾

Joined March 2018
Don't wanna be here? Send us removal request.
@junshernchan
JunShern
9 months
Wake up babe, new agent benchmark dropped! MLE-bench - or, as I like to say it, EMILY BENCH 🙆🏻‍♀️. Main summary on Neil’s thread, but adding some highlights of my own:
Tweet media one
@ChowdhuryNeil
Neil Chowdhury @ ICML
9 months
Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.
1
14
103
@junshernchan
JunShern
2 days
RT @FakePsyho: Humanity has prevailed (for now!). I'm completely exhausted. I figured, I had 10h of sleep in the last 3 days and I'm barely….
0
947
0
@junshernchan
JunShern
3 days
RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….
0
92
0
@junshernchan
JunShern
3 months
We are at ICLR, come chat!.
@ChowdhuryNeil
Neil Chowdhury @ ICML
3 months
Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!
Tweet media one
0
0
10
@junshernchan
JunShern
4 months
RT @TransluceAI: To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken….
0
65
0
@junshernchan
JunShern
4 months
RT @ilumine_ai: This quick experiment I just did made my jaw drop. You can literally create and play any game by iterating over images w….
0
352
0
@junshernchan
JunShern
4 months
RT @tyler_m_john: It's hard to live like short ASI timelines are true, even if you're deeply intellectually convinced. It means going again….
0
66
0
@junshernchan
JunShern
5 months
RT @elvin_not_11: Offline features like battery level
Tweet media one
0
561
0
@junshernchan
JunShern
5 months
RT @ChowdhuryNeil: Accepted to ICLR :).
0
7
0
@junshernchan
JunShern
5 months
RT @DimitrisPapail: AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination. AIME 2025 part I was conducted yesterday,….
0
44
0
@junshernchan
JunShern
5 months
RT @aleks_madry: We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks. At the same time, on the….
0
9
0
@junshernchan
JunShern
5 months
RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP!. We're seeking proposals across 21 research areas to help make AI system….
0
83
0
@junshernchan
JunShern
6 months
RT @joannejang: i find mishearing 'agents' as 'asians' way more entertaining. "a swarm of asians working 24/7 while you sleep". "the future….
0
179
0
@junshernchan
JunShern
6 months
RT @OwainEvans_UK: New paper:.We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *d….
0
153
0
@junshernchan
JunShern
6 months
Neat! I haven't looked too closely but this seems thoughtfully done and probably very useful to many people. :). (Similar in spirit to our release of MLE-bench Lite yesterday too -- tis the season for minifying!).
@MariusHobbhahn
Marius Hobbhahn
6 months
I made SWEBench-verified-mini. SWEBench-verified-mini is a subset of SWEBench-verified that uses 50 instead of 500 datapoints, requires 5GB instead of 130GB of storage, and has approximately the same distribution of performance, test pass rates, and difficulty as the original.
0
0
5
@junshernchan
JunShern
6 months
RT @zhengyaojiang: Cool stuff! The full MLE-Bench was nearly impossible for anyone outside the big players to run. This will benefit the co….
0
3
0
@junshernchan
JunShern
6 months
Finally, here's @ChowdhuryNeil's original thread on MLE-bench, in case you missed it:
@ChowdhuryNeil
Neil Chowdhury @ ICML
9 months
Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.
0
0
3
@junshernchan
JunShern
6 months
We’ve also updated the README on our repo ( to explain how the new “Lite Evaluation” works, along with other guidelines to standardize MLE-bench benchmarking. Let us know if you have any questions!.
1
0
4
@junshernchan
JunShern
6 months
We’ve updated our paper to report our main results broken down by split for easy comparison. (btw that's where the results in this thread come from - no new results, just re-reporting the paper's results with the split breakdown).
1
0
5
@junshernchan
JunShern
6 months
By nature of using lower-complexity tasks, this “Lite” version is easier than the full benchmark. We expect this to be helpful for the community to focus on while we’re still in an era of agents that haven’t saturated the Lite set.
Tweet media one
1
0
4
@junshernchan
JunShern
6 months
The “Low” complexity split of MLE-bench, already annotated in our original release, provides a focused set of simpler competitions (22/75) which are also less resource-intensive: Full MLE-bench which has a total dataset size of 3.3TB, the total size of Lite is just 158GB.
1
0
5