Clémentine Fourrier 🍊 Profile Banner
Clémentine Fourrier 🍊 Profile
Clémentine Fourrier 🍊

@clefourrier

Followers
3,720
Following
328
Media
119
Statuses
2,065

Leaderboards & evals research @HuggingFace 🐍✨ "The future is already here, it’s just not very evenly distributed" (Gibson)

Joined October 2019
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@clefourrier
Clémentine Fourrier 🍊
6 months
Hi! Wanted to clarify something regarding the Open LLM Leaderboard team. The maintainers of the leaderboard (since May) have been one engineer ( @nathanhabib1011 ) and one researcher (me) full time on leaderboards/evals. So, what do we do? 🧵
2
38
143
@clefourrier
Clémentine Fourrier 🍊
1 year
For the last months @huggingface , I worked on transformers and... graphs! So here is a small blog, if you wonder what one could use graphs for, or how to machine learn on them 🔎 (Spoiler: they are everywhere 🧬🚗✍️)
27
433
2K
@clefourrier
Clémentine Fourrier 🍊
21 days
I discovered at ICLR 2024 that a lot of what I take for granted about LLM evaluation is actually not that widely known... So I made a blog! - how do we do currently do LLM evaluation? ⚖️ - most importantly, what is it actually useful for? 🤔
10
77
369
@clefourrier
Clémentine Fourrier 🍊
1 year
🔔 New modality on @huggingface 's hub: Graphs! 🎇 If you want to experiment, we already have 25 datasets and a model... and we're looking forward to seeing what else the community will add! Ping me if you need a hand to upload your artifacts 🤗
Tweet media one
5
63
327
@clefourrier
Clémentine Fourrier 🍊
7 months
Ready for the *biggest update* of the Open LLM Leaderboard yet? We just spent A YEAR of GPU time to make it more interesting and fairer! 🤯 How? With @nathanhabib1011 , we added 3 new evals from the great EleutherAI harness 💥 and re-ran 2000+ models! 🚀 So, what changes? 🧵
10
68
245
@clefourrier
Clémentine Fourrier 🍊
1 year
Do you want to use graph transformers in 🤗 Transformers ? We made it possible! This blog will walk you through graph classification with @huggingface and the Graphormer model. ✨🧬
4
51
224
@clefourrier
Clémentine Fourrier 🍊
2 years
Ph.D. successfully defended! ✨🎇 Thanks to my supervisors and jury, extremely nice teammates, all the cool people who collaborated with me over the last 3 years, friends and family, + those present today for this big moment! Next stop: research scientist @huggingface 🤗
Tweet media one
30
9
209
@clefourrier
Clémentine Fourrier 🍊
6 months
2023 has been incredible for open releases, so I made a ✨year review in Open LLMs ✨ It was lots of fun coming back through all that came out, and it's insane how much the field soared thanks to the community & openness! Summary of each section: 🧵
7
48
184
@clefourrier
Clémentine Fourrier 🍊
6 months
⚠️ We are removing DROP from the Open LLM Leaderboard! With leaderboard evaluation data openly shared on 2000+ models, we did a deep dive with our friends @AiEleuther and @try_zeno , & found out that its original implementation is unfair to many models 😱
8
31
154
@clefourrier
Clémentine Fourrier 🍊
2 months
Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing ♻️the order in which the few shot examples are added to the prompt ♻️ you get a difference of up to 3 points in evaluation score?
Tweet media one
13
31
149
@clefourrier
Clémentine Fourrier 🍊
9 months
Super cool thing: There is virtually ✨no performance diff ✨ between the 4-bit Falcon-180B and the bfloat-16 one! 🤯 = you can actually use a Falcon-180B model at 1/4th the memory cost, with almost no difference in inference quality!
Tweet media one
5
28
140
@clefourrier
Clémentine Fourrier 🍊
2 months
New: Open Medical LLM Leaderboard! 🩺 In basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸 It's therefore vital to benchmark/follow advances in medical LLMs before thinking about deployment. Blog:
5
32
128
@clefourrier
Clémentine Fourrier 🍊
2 years
First day as a research intern @huggingface ! 🥳 Everybody has been presenting their work on the science side, there is so much interesting stuff going on ✨
4
6
125
@clefourrier
Clémentine Fourrier 🍊
7 months
🆕 The current best pretrained model on the Open LLM Leaderboard, beating Falcon 180B and Llama-2 70B, is a 34B model 🤯 We're really entering the era of small models! Congrats to the researchers of @01AI_Yi for their new Yi base models!
Tweet media one
5
12
123
@clefourrier
Clémentine Fourrier 🍊
5 months
The Open LLM Leaderboard is taking a stronger stance on metadata, via 2 things. 1) If your model has no model card or license tag, it is now in the "deleted" category & it won't appear in the main view. A model with no explanation or license is not useful to the community.
5
8
116
@clefourrier
Clémentine Fourrier 🍊
2 years
Amazing night out with colleagues @huggingface , it was so much fun 😁✨🍷 (Food and stories were shared, and let me tell you, these women have incredible anecdotes to recount!)
Tweet media one
2
5
106
@clefourrier
Clémentine Fourrier 🍊
19 days
And it's out! :D A good read if you want to think about doing robust evaluation, going in depths into the nits of it.
Tweet media one
Tweet media two
@clefourrier
Clémentine Fourrier 🍊
20 days
If you liked my blog post about LLM evaluation, one of the coolest paper on the topic will appear tonight on arxiv! (not from 🤗) My favorite points from it: - the differences between initial benchmarks design and actual use - the Appendix, delving into the maths of eval ❤️
1
5
69
3
26
108
@clefourrier
Clémentine Fourrier 🍊
3 months
This flag name should become a standard
Tweet media one
3
15
101
@clefourrier
Clémentine Fourrier 🍊
11 months
🔥 Big update of the Open LLM leaderboard! 🥳 So, what's new? • ⚖️ updated MMLU results! • ⚡ a backend at least an order of magnitude faster! • 📰 more info to improve reproducibility! What does this mean, in more detail? 🧵
4
26
95
@clefourrier
Clémentine Fourrier 🍊
2 years
#acl2022nlp What happens inside a multilingual neural cognate prediction model? We show that predicting cognates between current Romance languages latently teaches the model about their proto-forms, allowing reconstruction without fine-tuning encoders on the task!🧵
Tweet media one
1
18
87
@clefourrier
Clémentine Fourrier 🍊
1 month
What if you could make model evaluation less prompt sensitive? With our friends @dottxtai , we wrote a blog on how structured generation seems to reduce model score variance considerably. Tell us what you think!
4
17
82
@clefourrier
Clémentine Fourrier 🍊
8 months
Do you often wonder what the current best model is on the Open LLM Leaderboard, for a given weight/type category? 👀⚖️ I made an automatically updated collection to make it easier to follow daily! 📊 Tell me what you think! 🤗
3
22
78
@clefourrier
Clémentine Fourrier 🍊
4 months
Mini OSS release today: lighteval 🌤️ (with @nathanhabib1011 and @Thom_Wolf ) It's a small LLM eval suite, to: - iterate on new tasks easily (prompt/templates variations, custom tasks...) 🧪 - evaluate HF/nanotron compatible models as fast as possible with DP/PP on GPUs ⚡️
10
19
74
@clefourrier
Clémentine Fourrier 🍊
20 days
If you liked my blog post about LLM evaluation, one of the coolest paper on the topic will appear tonight on arxiv! (not from 🤗) My favorite points from it: - the differences between initial benchmarks design and actual use - the Appendix, delving into the maths of eval ❤️
1
5
69
@clefourrier
Clémentine Fourrier 🍊
5 months
Another day, another leaderboard on the hub! ✨ @PMinervini and his team used 14 different datasets to create a leaderboard measuring hallucinations in LLMs 🤯 Read their ool intro blog with an in depth analysis of the results:
0
16
71
@clefourrier
Clémentine Fourrier 🍊
29 days
New on the hub: Arabic LLM Leaderboard! Arabic has at least 380M speakers & is one of the most spoken languages... but how good are LLMs at it? @alielfilali01 contacted @TIIuae and @huggingface to know, and collaborate around a new leaderboard!
2
17
68
@clefourrier
Clémentine Fourrier 🍊
2 years
Morning #acl2022nlp `Next big ideas talk' in one sentence each: - Heng Ji: Structured information carries interesting complexity we should study. - Mirella Lapata: Stories are cool because both NLP-complex and at the heart of culture. (1/3)
1
15
67
@clefourrier
Clémentine Fourrier 🍊
2 years
Tell me you're a tech company without telling me you're a tech company @huggingface
Tweet media one
2
2
62
@clefourrier
Clémentine Fourrier 🍊
4 months
New leaderboard: NPHardEval! It uses logical questions of diff. complexities as a proxy for reasoning abilities 💪 Since the questions can be generated automatically, it's going to be dynamic, updated monthly! 🚀 Congrats to @HuaWenyue31539 @LeegeoF !
4
12
60
@clefourrier
Clémentine Fourrier 🍊
19 days
Personal opinion: the @cohere folks are a joy to have drinks with 🍷
@julien_c
Julien Chaumond
19 days
Personal opinion: the @cohere docs really are a joy to read
Tweet media one
6
16
159
4
6
64
@clefourrier
Clémentine Fourrier 🍊
5 months
New leaderboard powered by Decoding Trust (outstanding paper at Neurips!), to evaluate LLM safety, such as bias and toxicity, PII, and robustness 🚀 You can find it here: And the intro blog is here: Congrats to @uiuc_aisecure !
2
18
59
@clefourrier
Clémentine Fourrier 🍊
4 months
Is this a 7B 💎 I spy on the Open LLM Leaderboard? A new open source player appears!
Tweet media one
5
8
56
@clefourrier
Clémentine Fourrier 🍊
2 months
Follow up "eval is fun" tweet: how much do scores change depending on prompt format choice? The score range for a given model is of 10 points! :D Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".
Tweet media one
@clefourrier
Clémentine Fourrier 🍊
2 months
Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing ♻️the order in which the few shot examples are added to the prompt ♻️ you get a difference of up to 3 points in evaluation score?
Tweet media one
13
31
149
3
10
55
@clefourrier
Clémentine Fourrier 🍊
1 year
Bonus resource: For predictions of what's plausibly to come in graph ML in 2023, take a look at (by @michael_galkin , @ren_hongyu , @zhu_zhaocheng @chrsmrrs , @jo_brandstetter )
3
4
51
@clefourrier
Clémentine Fourrier 🍊
11 months
Curious to know how Llama-2 models compare to other LLMs? 🦙 👀⚖️ Check it out on the Open LLM Leaderboard! We re-ran the 7B, 7B-chat, 13B, 13B-chat, 70B-chat, and the 70B will be there soon ✨ (I can already spy with my little eye a new top model 🚀)
Tweet media one
5
17
51
@clefourrier
Clémentine Fourrier 🍊
10 months
We just reached 1000 models evaluated on the Open LLM Leaderboard! 🚀 GPUs going brrr! > 500 models was only 3 weeks ago! 🤯
Tweet media one
0
11
50
@clefourrier
Clémentine Fourrier 🍊
6 days
New text to image model arena! 🎨 - compares open and closed source models (Elo ranking) - insight: OSS models are catching up! - ‼️allows you to get your own personal preference leaderboard - which image gen model do you tend to like most? Read more:
2
13
49
@clefourrier
Clémentine Fourrier 🍊
6 months
New top 7B pretrained model on the Open LLM Leaderboard today! 🚀 The releases just keep on coming! Congrats to @deci_ai
Tweet media one
2
13
48
@clefourrier
Clémentine Fourrier 🍊
5 months
We are starting a leaderboard/evals blog series on the @huggingface hub: 🏅Leaderboards on the hub! Our first collaborative blog is with @vectara , about their brand new hallucination leaderboard 🚀
1
20
47
@clefourrier
Clémentine Fourrier 🍊
6 months
Did you know that 1) Mixtral models are on the OpenLLMLeaderboard? 2) You can submit your Mixtral fine-tunes starting today? Many thanks to the transformers team for the quick and smooth integration, and of course congrats to @MistralAI for the release!
Tweet media one
1
8
46
@clefourrier
Clémentine Fourrier 🍊
5 months
We have a new top model on the GAIA leaderboard, which evaluates next-gen tool-augmented LLMs! ✨ Called FRIDAY, it's got impressive results both on the public val and on the private test sets! Congrats to its authors!
3
5
44
@clefourrier
Clémentine Fourrier 🍊
5 months
Open LLM Leaderboard update: we changed submission model types! 1) 💬 RLHF/DPO/IFT... are grouped (under chat models) 2) 🤝 Users who just enjoy M(o)Erging models (with no extra fine-tuning) can now select "merge & moerges" Bonus: How do you pronounce moerge? 😁
Tweet media one
7
6
47
@clefourrier
Clémentine Fourrier 🍊
2 years
Dear twitter, the Paris office 🤗 unicorn doesn't have a name yet, any cool ideas? (Pictured here deep in thought) @huggingface @mervenoyann
Tweet media one
8
3
46
@clefourrier
Clémentine Fourrier 🍊
2 years
PhD manuscript sent! (😱🥳😌😴) If my reviewers like it, I'll defend in September! (Lots of love to the amazing @FourrierMarine for her superb proofreading 😍)
6
0
46
@clefourrier
Clémentine Fourrier 🍊
2 months
New leaderboard: LiveCodeBench! 💻 Complete code evaluations, with a great feature: problem selection by publication date 📅 This means getting model scores only on new problems out of the training data = contamination free code evals! 🚀 Blog:
1
13
45
@clefourrier
Clémentine Fourrier 🍊
4 months
Models are being deployed in real life situations... but are they safe? The Red-Teaming Resistance leaderboard tests if models resist harmful instructions (fake news creation, malware diffusion, harassment, etc) 🔥 Super useful, congrats to @haizelabs !
0
11
44
@clefourrier
Clémentine Fourrier 🍊
8 months
🆕To favor shareability and reproducibility, new submissions to the Open LLM Leaderboard now MUST have a model card and a license 📄 This will hopefully allow everyone to know how these models were created (training data, params), and how they can be used downstream 🤗
3
12
45
@clefourrier
Clémentine Fourrier 🍊
3 months
New multimodal leaderboard on the hub 🚀 Many situations require models to parse images containing text: maps, web pages, real world pictures, memes, ... 🖼️ & the ConTextual team introduced a brand new dataset to evaluate how good models are on this!
2
15
44
@clefourrier
Clémentine Fourrier 🍊
14 days
Scale AI are introducing high quality arenas, with... - private datasets (=can't be gamed) - paid annotators for the rankings (=fairer and higher quality annotations)! It's a super exciting direction for the evaluation field! 🚀 Good job to them!
@summeryue0
Summer Yue
14 days
🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at ! Which evals should we build next?
Tweet media one
10
32
191
2
7
42
@clefourrier
Clémentine Fourrier 🍊
1 month
I'm going to punch someone if I see another poster trying to evaluate an LLM's "intent to do" things.
4
2
41
@clefourrier
Clémentine Fourrier 🍊
2 months
🆕 Open RL Leaderboard 🏆 It evaluates submitted RL agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)! Also displays videos of the best model's run, which is super fun to watch! ✨ Kudos to @qgallouedec ! 🚀
2
8
40
@clefourrier
Clémentine Fourrier 🍊
3 months
New arena on @huggingface : Chatbot Guardrails, by @lighthouzai ! Goal: Try to make models reveal private customer information they have access to 😬 Let's see together which models are actually the safest! Try it: Learn more:
2
11
37
@clefourrier
Clémentine Fourrier 🍊
7 months
New leaderboard on the hub by @vectara ! How much do models "hallucinate"? 😵‍💫 Or in other words, how strong is their tendency to veer away from facts when generating text?
0
12
39
@clefourrier
Clémentine Fourrier 🍊
4 months
Did you know we have a text-to-speech arena on the hub? 🤯 It allows anyone to compare state of the art models, and vote on what sounds better! 🔥 Go try it out! Congrats to @realmrfakename for the idea/implem, and to @reach_vb for the support! 🤗
1
4
37
@clefourrier
Clémentine Fourrier 🍊
6 months
To understand model performance better, you can now display if models on the Open LLM Leaderboard are merges or not! (As long as model creators used the `merge` or `mergedlm` tags in their model cards)
Tweet media one
4
5
35
@clefourrier
Clémentine Fourrier 🍊
3 months
Dear intern applicants, Please stop telling me about your leadership skills, and talk about your quality teamwork instead. Thanks
2
2
35
@clefourrier
Clémentine Fourrier 🍊
2 months
⚠️We've decided to pause the Open LLM Leaderboard temporarily (hopefully till the end of day) to prevent evaluation failures due to network problems on the hub. If your model failed this morning, tell us, we'll relaunch once everything's good. Infra/hub teams are on it! 💪
2
5
35
@clefourrier
Clémentine Fourrier 🍊
4 months
The Open LLM Leaderboard team @huggingface is looking for an intern 👀 If you want to work with @nathanhabib1011 and I on upgrading the leaderboard, enjoy interacting with the community, and think a lot about LLM evaluation, come work with us 🤗
3
10
35
@clefourrier
Clémentine Fourrier 🍊
6 months
Really grateful to get feedback like this from time to time ❤️ & Overall thanks to the community for their support of and interest in the Open LLM Leaderboard 🤗
Tweet media one
1
4
33
@clefourrier
Clémentine Fourrier 🍊
3 months
Are you looking for the perfect leaderboard/arena for your use case? 👀 New tool! Select your modality, language, task of interest, and more... then search! 🔍 - list is build from space metadata - feel free to open PRs to make it prettier 😅
2
11
32
@clefourrier
Clémentine Fourrier 🍊
7 months
If they open the Grok weights, I'll definitely run them in priority on the @huggingface leaderboard 🤗 It would be very useful for the community to know how well the model does there 🔥
@kaifulee
Kai-Fu Lee
7 months
I applaud the original openness of OpenAI, and agree with @elonmusk 's lament. But @elonmusk : When will Grok become open source like @01AI_Yi ? Please come join the @huggingface leaderboard.
49
108
812
2
5
32
@clefourrier
Clémentine Fourrier 🍊
3 months
Amazing new leaderboard on @huggingface : WildBench! Cool because it: - uses real life chat inputs (selected for quality) 💬 - compares a range of metrics (LLM-as-a-judge, Human Elo ranking, ...) - is going to be dynamic 🔥 Many congrats to @billyuchenlin and his team 🚀
@billyuchenlin
Bill Yuchen Lin 🤖
3 months
Introducing AI2 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙 ! We aim to benchmark LLMs with challenging tasks from real users in the wild. 🤗 Link: 🤩 What great features does it offer? 🌟x9 ⬇️ 🌟1. 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 & 𝐑𝐞𝐚𝐥: We carefully curate a collection of 1024 hard
Tweet media one
20
111
540
0
8
32
@clefourrier
Clémentine Fourrier 🍊
1 month
New leaderboard: 🏅vs⏲vs️💸 Sometimes you just want to know which model has the best performance to inference speed to cost ratio for your use case... There's now a leaderboard for this, comparing models across API providers 8 times per day!
6
9
29
@clefourrier
Clémentine Fourrier 🍊
7 days
Very tempted to just discard all math eval questions which are not in the metric system 😈
Tweet media one
7
0
30
@clefourrier
Clémentine Fourrier 🍊
16 days
Heeey hello @max_nlp 😀 We're officially at 10K for the Open LLM Leaderboard!!! To celebrate, expect cool stuff in the following weeks! :)
Tweet media one
@clefourrier
Clémentine Fourrier 🍊
16 days
Who wants to be ♥️ 10 000 on the Open LLM Leaderboard? 😁
Tweet media one
3
1
13
3
2
26
@clefourrier
Clémentine Fourrier 🍊
8 months
We actually passed 2000 models evaluated on the Open LLM Leaderboard! 🤯 Congrats to @nathanhabib1011 , @Thom_Wolf , and all the community and team members who gave a hand along the way! 🤗
@nathanbenaich
Nathan Benaich
8 months
Judging by the leaderboards over at @HuggingFace , open source is more vibrant than ever, with downloads and model submissions rocketing to record highs. Remarkably, in the last 30 days Llama models have been downloaded more than 32M times on Hugging Face 🚀
Tweet media one
3
23
139
1
6
28
@clefourrier
Clémentine Fourrier 🍊
1 month
It's now become very, very easy to create a leaderboard... Just use the integrated @Gradio space template on the @huggingface hub! 🚀 Thanks to @pngwn and @kramp for their help 🤗
Tweet media one
1
6
29
@clefourrier
Clémentine Fourrier 🍊
11 months
👀 Simplified Open LLM Leaderboard's interface! ⚖️ You can now select columns (scores, license, # of parameters, model likes) from the main view, to compare models on what's most interesting to you Have fun exploring the 464 models we evaluated 🤗
Tweet media one
2
6
28
@clefourrier
Clémentine Fourrier 🍊
5 days
The Open LLM Leaderboard passed 2 million unique visitors since last October! Thanks a lot to the community coming to take a look :)
3
5
31
@clefourrier
Clémentine Fourrier 🍊
3 months
The Berkeley Function Calling Leaderboard tests if models can write valid API calls/functions💻 Did you know that there's also a viewer to compare models outputs? 👀 Super useful to see, in practice, which model is better for your augmented LLM use-case!
Tweet media one
0
6
25
@clefourrier
Clémentine Fourrier 🍊
1 month
Follow up: at ICLR drinks, this was my straight-faced answer to "So, why are you here?" Vibes check of companies: people @cohere & @Google loved it ❤️ @MistralAI told me where the usb with mistral-large weights is 😆 but @OpenAI , never have I felt so unfunny in my life 😅
@clefourrier
Clémentine Fourrier 🍊
1 month
I've been asked by PhD students what my strategy when attending big ML conferences is, so in its entirety: Hang at your social event to steal & free the weights 🤗
0
0
16
0
1
25
@clefourrier
Clémentine Fourrier 🍊
6 months
@RealJosephus Hi! Most of these models have now been flagged (thanks to @nathanhabib1011 's work over the weekend) but we're working on adding a better system for contamination detection! :)
1
1
25
@clefourrier
Clémentine Fourrier 🍊
6 months
Dear twitter, I'll be hibernating and eating delicious food with my loved ones till beginning of January! So don't forget to follow the cool @nathanhabib1011 for Open LLM Leaderboard updates if you need a fix while I'm off ^^ Happy EOY to all 🤗
3
3
26
@clefourrier
Clémentine Fourrier 🍊
6 months
That's it folks, thanks for coming to my TED talk, I hope you get a better view of what we do! & again, thanks a lot for your interest, and thanks to community members giving us a hand, such as @Weyaxi , who's coding an insanely cool tool for the lb I'll talk about this week ❤️
3
1
26
@clefourrier
Clémentine Fourrier 🍊
4 months
In case you missed it, chat templates support is coming to the @AiEleuther Harness! 🚀 (Which also means that we'll be able to add it to the Open LLM Leaderboard once it's merged and stable 🤗) Issue: Blog:
1
7
25
@clefourrier
Clémentine Fourrier 🍊
2 months
@victormustar Search by model size
2
1
25
@clefourrier
Clémentine Fourrier 🍊
12 days
New leaderboard: "Occiglot Euro LLM Leaderboard"! It evaluates the performance of LLMs on the following languages: 🇬🇧🇮🇹🇫🇷🇪🇸🇩🇪 It complements more specialised leaderboards well, congrats to the authors :)
1
5
25
@clefourrier
Clémentine Fourrier 🍊
5 months
Just added a "Show merges" to filter out merged models on the Open LLM Leaderboard! ✨ However, a lot of models are not filtered because their metadata are incomplete > if you have a minute, feel free to open PRs/issues to help model creators (and us) 🤗
Tweet media one
@stanfordnlp
Stanford NLP Group
5 months
True and mergekit is great for building good customized LLMs. But the @huggingface Open LLM Leaderboard is much less useful if the top is filled from an exponential space of model merges. Maybe a “Show merges” checkbox, unchecked by default?
2
1
35
1
5
23
@clefourrier
Clémentine Fourrier 🍊
2 months
Did you know that Command R+ is on the Open LLM Leaderboard? It's notably got very good scores on MMLU and GSM8K! 💪 Congrats @CohereForAI for the cool model! ❤️
Tweet media one
2
6
23
@clefourrier
Clémentine Fourrier 🍊
6 months
Disclaimer: We updated GSM8K scores for ~150 models over the weekend, as we found out we had accidentally run their evaluations with a test environment, and not our usual pipeline. (That's why you should never mess up with the prod 😓) We're very sorry, all should be fixed! 🙇
1
1
23
@clefourrier
Clémentine Fourrier 🍊
4 months
Just stumbled on a new Open LLM Leaderboard for... Portuguese! 🚀 Many congrats to the author, it looks very neat, and it's great to see more diversity in languages covered! 🔥
4
6
20
@clefourrier
Clémentine Fourrier 🍊
2 years
My personal "best #acl2022nlp promotional object award" (so far) goes to @Bosch_AI for this very sensible pair of socks: simple, made locally, actually useful, super colorful.
Tweet media one
1
1
22
@clefourrier
Clémentine Fourrier 🍊
16 days
Cool blog on good eval design! ✨ Should be: simple to understand & run, meaningful, with fair scoring, high data quality & enough samples. Irl however, for ex, MMLU is noisy/hard, BBH ungrounded: popular tasks rarely fit all points. But it's great to have a gold target! 🚀
@_jasonwei
Jason Wei
19 days
New blog post where I discuss what makes an language model evaluation successful, and the "seven sins" that make hinder an eval from gaining traction in the community: Had fun presenting this at Stanford's NLP Seminar yesterday!
Tweet media one
13
81
531
1
4
22
@clefourrier
Clémentine Fourrier 🍊
4 months
New leaderboard: Enterprise Scenarios 🚀 Why is it cool? 👀 1) Real world use cases (Finance, Legal confidentiality, Customer support, ...) = it's interesting for companies 🏢 2) Test set is private = it's hard to game 🔥 Many congrats to @PatronusAI !
@PatronusAI
PatronusAI
4 months
Today, we’re excited to announce the Enterprise Scenarios Leaderboard on Hugging Face, the first LLM leaderboard for real world use cases! 🏆
4
19
76
1
4
22
@clefourrier
Clémentine Fourrier 🍊
2 years
Some poems by #bloom 🌸 imitating French poets @huggingface
Tweet media one
Tweet media two
4
2
21
@clefourrier
Clémentine Fourrier 🍊
11 months
@SashaMTL Did you check the guy's wikipedia page? Also, this app name is wrong on so many levels
5
0
20
@clefourrier
Clémentine Fourrier 🍊
6 months
This really highlights how hard it is to do evaluation right. We'd love for this cool dataset to be more reliable, so feel free to join our work-in-the-progress on fixing it 🤗 In the mean time we’re putting it aside, to be as fair as possible to all.
2
3
21
@clefourrier
Clémentine Fourrier 🍊
2 months
Who wants to be ❤️ 9000 on the Open LLM Leaderboard?
Tweet media one
Tweet media two
1
5
21
@clefourrier
Clémentine Fourrier 🍊
9 months
I spy, with my eye, a new very, very big pretrained model on the Open LLM Leaderboard 👀
1
5
21
@clefourrier
Clémentine Fourrier 🍊
1 month
We need more benchmarks outside of English, to account for linguistic diversity and evaluate languages on their specificities! Hebrew, for ex, is morphologically rich. @DictaTools & @tal_geva collaborate to make a new Open LLM Leaderboard for Hebrew! 🔥
2
2
21
@clefourrier
Clémentine Fourrier 🍊
5 months
2) We now separate merges (via a checkbox); but as some users removed the merge tag from their metadata to be in the main view, we added a mechanism to automatically *flag* all the models identified as merges where the metadata is incorrect. Discussion:
1
0
20
@clefourrier
Clémentine Fourrier 🍊
1 year
In this post, you'll discover: - what graphs are, why they are used, how to represent them - how people learn on graphs, from pre-neural methods to Graph Neural Networks - the very recent world of Transformers for graphs
1
1
20
@clefourrier
Clémentine Fourrier 🍊
6 months
3. We then regularly analyze leaderboard results and write blog posts to explain our findings to the community 🔍, some examples on DROP and MMLU (a new blog coming soon on contamination, led by @nathanhabib1011 ✨ )
1
0
20
@clefourrier
Clémentine Fourrier 🍊
3 months
New top model on GAIA 🚀 (for next-gen augmented LLMs) It's by @MSFTResearch , and achieves 15% on the hardest test set (lvl 3). It solved 3 times more questions than the runner up 🤯 This "early experiment" has super impressive results! Congrats!!!
2
2
19
@clefourrier
Clémentine Fourrier 🍊
3 months
Our friends @dottxtai published a great blog post today about how to improve evaluation performance and relevance! ✨ Prompts contain structured information, but we expect our models to infer this structure. What if it was provided explicitly?
Tweet media one
Tweet media two
1
4
20
@clefourrier
Clémentine Fourrier 🍊
5 months
We hope this will - help everyone adopt better practices - improve model understanding and knowledge sharing - allow people to select relevant models for them more easily Thanks to all people who contributed their inputs on this!
2
1
19
@clefourrier
Clémentine Fourrier 🍊
1 month
Why does contamination happen? 🤔 Imo, mostly accidentally, as it's hard to check if datasets contain evaluations. This new database should make it way simpler to know! It stores evidence of models and datasets contamination & is open to submissions 🔥
1
5
19
@clefourrier
Clémentine Fourrier 🍊
6 months
Some important features that have been pushed back a bit due to a lack of time, but are coming early next year! 👀 - contamination detection - chat/system templates support - allowing more models & our work with partners will hopefully lead to more blackbox leaderboards!
3
1
19
@clefourrier
Clémentine Fourrier 🍊
8 months
A new best 7B model on the Open LLM Leaderboard dropped! 🚀 (and it's from our cool RLHF team héhé🤗) The time has come for small but mighty models! 💪
Tweet media one
@edwardbeeching
Edward Beeching
8 months
We will soon release the @huggingface LLM alignment handbook. Using these recipes you can build state of the art chatbots such as Zephyr-7b, released today. Register your interest by starring the github You can find out about Zephyr-7b in this thread:
Tweet media one
3
38
220
0
2
19
@clefourrier
Clémentine Fourrier 🍊
2 months
We discussed these numbers with the Meta team, and on GSM8K, we found we're not reporting the same thing: number of few shot is different, they are using other params, etc. From what I've seen, they've been doing their best to get reproducible results (using the harness, etc).
@bindureddy
Bindu Reddy
2 months
Meta's numbers for Llama-3 do not match up with the Hugging Face Numbers. The new Llama-3 models are up on hugging face, and the benchmark numbers don't quite match up. In fact, there is a significant discrepancy in both MMLU and GSM8K. Meta's numbers reported on their blog DO
Tweet media one
18
26
222
1
2
19