JunShern @junshernchan X Profile

JunShern

@junshernchan

Followers

393

Following

6K

Media

18

Statuses

185

Trying to make AI go well @AnthropicAI. Previously @OpenAI, @CHAI_Berkeley, @nyuniversity, autonomous vehicles @motionaldrive. 🇲🇾

Joined March 2018

Don't wanna be here? Send us removal request.

JunShern

@junshernchan

9 months

Wake up babe, new agent benchmark dropped! MLE-bench - or, as I like to say it, EMILY BENCH 🙆🏻‍♀️. Main summary on Neil’s thread, but adding some highlights of my own:

Neil Chowdhury @ ICML

@ChowdhuryNeil

9 months

Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.

1

14

103

JunShern

@junshernchan

2 days

RT @FakePsyho: Humanity has prevailed (for now!). I'm completely exhausted. I figured, I had 10h of sleep in the last 3 days and I'm barely….

0

947

0

JunShern

@junshernchan

3 days

RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….

0

92

0

JunShern

@junshernchan

3 months

We are at ICLR, come chat!.

Neil Chowdhury @ ICML

@ChowdhuryNeil

3 months

Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!

0

10

JunShern

@junshernchan

4 months

RT @TransluceAI: To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken….

0

65

0

JunShern

@junshernchan

4 months

RT @ilumine_ai: This quick experiment I just did made my jaw drop. You can literally create and play any game by iterating over images w….

0

352

0

JunShern

@junshernchan

4 months

RT @tyler_m_john: It's hard to live like short ASI timelines are true, even if you're deeply intellectually convinced. It means going again….

0

66

0

JunShern

@junshernchan

5 months

RT @elvin_not_11: Offline features like battery level

0

561

0

JunShern

@junshernchan

5 months

RT @ChowdhuryNeil: Accepted to ICLR :).

0

7

0

JunShern

@junshernchan

5 months

RT @DimitrisPapail: AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination. AIME 2025 part I was conducted yesterday,….

0

44

0

JunShern

@junshernchan

5 months

RT @aleks_madry: We find that across the board, today’s top LLMs still make genuine mistakes on these benchmarks. At the same time, on the….

0

9

0

JunShern

@junshernchan

5 months

RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP!. We're seeking proposals across 21 research areas to help make AI system….

0

83

0

JunShern

@junshernchan

6 months

RT @joannejang: i find mishearing 'agents' as 'asians' way more entertaining. "a swarm of asians working 24/7 while you sleep". "the future….

0

179

0

JunShern

@junshernchan

6 months

RT @OwainEvans_UK: New paper:.We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *d….

0

153

0

JunShern

@junshernchan

6 months

Neat! I haven't looked too closely but this seems thoughtfully done and probably very useful to many people. :). (Similar in spirit to our release of MLE-bench Lite yesterday too -- tis the season for minifying!).

Marius Hobbhahn

@MariusHobbhahn

6 months

I made SWEBench-verified-mini. SWEBench-verified-mini is a subset of SWEBench-verified that uses 50 instead of 500 datapoints, requires 5GB instead of 130GB of storage, and has approximately the same distribution of performance, test pass rates, and difficulty as the original.

0

5

JunShern

@junshernchan

6 months

RT @zhengyaojiang: Cool stuff! The full MLE-Bench was nearly impossible for anyone outside the big players to run. This will benefit the co….

0

3

0

JunShern

@junshernchan

6 months

Finally, here's @ChowdhuryNeil's original thread on MLE-bench, in case you missed it:

Neil Chowdhury @ ICML

@ChowdhuryNeil

9 months

Proud to introduce MLE-bench: A benchmark of 75 real-life Kaggle competitions to test AI agents on ML engineering! When will we have our first AI Kaggle Grandmaster? 🥇🥈🥉.

0

3

JunShern

@junshernchan

6 months

We’ve also updated the README on our repo ( to explain how the new “Lite Evaluation” works, along with other guidelines to standardize MLE-bench benchmarking. Let us know if you have any questions!.

1

0

4

JunShern

@junshernchan

6 months

We’ve updated our paper to report our main results broken down by split for easy comparison. (btw that's where the results in this thread come from - no new results, just re-reporting the paper's results with the split breakdown).

1

0

5

JunShern

@junshernchan

6 months

By nature of using lower-complexity tasks, this “Lite” version is easier than the full benchmark. We expect this to be helpful for the community to focus on while we’re still in an era of agents that haven’t saturated the Lite set.

1

0

4

JunShern

@junshernchan

6 months

The “Low” complexity split of MLE-bench, already annotated in our original release, provides a focused set of simpler competitions (22/75) which are also less resource-intensive: Full MLE-bench which has a total dataset size of 3.3TB, the total size of Lite is just 158GB.

1

0

5