Dieuwke Hupkes Profile
Dieuwke Hupkes

@_dieuwke_

Followers
2K
Following
1K
Media
85
Statuses
463

Joined September 2017
Don't wanna be here? Send us removal request.
@_dieuwke_
Dieuwke Hupkes
3 hours
Many thanks for this big honour! 🤩.
@IJCAIconf
IJCAIconf
4 hours
Congratulations to the winners of theĀ 2025 IJCAI–JAIR PrizeĀ for their paperĀ ā€œCompositionality Decomposed: How Do Neural Networks Generalise?ā€Ā ā€”Ā Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni!Ā  Ā .#IJCAI2025
Tweet media one
2
0
25
@_dieuwke_
Dieuwke Hupkes
3 hours
RT @IJCAIconf: Congratulations to the winners of theĀ 2025 IJCAI–JAIR PrizeĀ for their paperĀ ā€œCompositionality Decomposed: How Do Neural Netw….
0
4
0
@_dieuwke_
Dieuwke Hupkes
3 hours
RT @vernadankers: Proud to accept a 5y outstanding paper award @IJCAIconf šŸ† from JAIR for the impact Compositionality Decomposed has had,….
0
3
0
@_dieuwke_
Dieuwke Hupkes
27 days
RT @WiAIR_podcast: 🧠 What does it really mean for an LLM to generalize? And are we even measuring it right?.In the latest #WiAIR episode, w….
0
1
0
@_dieuwke_
Dieuwke Hupkes
27 days
Could not be more thrilled about this partnership, allowing us to keep MultiLoKo's test set truly hidden and have experts at Kaggle independently run the leaderboard šŸ˜šŸ”„šŸ’Ŗ.
@kaggle
Kaggle
28 days
Exciting collaboration! We've partnered with @AIatMeta's to launch the MultiLoKo Benchmark, now live on our platform. Measure model performance across 31 languages with truly private holdout sets – just like Kaggle Competitions, ensuring accurate results. Explore MultiLoKo and
Tweet media one
0
1
9
@_dieuwke_
Dieuwke Hupkes
27 days
Thrilled about the launch of this platform 🤩, the feature to host secret test sets is a deal breaker in the game against contamination and a gift to both benchmark builders and modellers šŸ”„šŸ”„ Excited to be one of the first to use it for @AIatMeta's MultiLoKo's test set šŸ’Ŗ!.
@kaggle
Kaggle
28 days
šŸš€ Kaggle Benchmarks is here! Get competition-grade rigor for AI model evaluation. Let Kaggle handle infrastructure while you focus on AI breakthroughs. View model performance on 70+ leaderboards, including @AIatMeta's MultiLoKo. Dive in:
0
1
6
@_dieuwke_
Dieuwke Hupkes
27 days
RT @kaggle: šŸš€ Kaggle Benchmarks is here! Get competition-grade rigor for AI model evaluation. Let Kaggle handle infrastructure while you f….
0
26
0
@_dieuwke_
Dieuwke Hupkes
1 month
RT @WiAIR_podcast: How do we know if a language model really generalizes - or is just repeating patterns it’s memorized?.Let’s talk about c….
0
1
0
@_dieuwke_
Dieuwke Hupkes
4 months
Want to know more? Have a look at our paper or our github repository:.- - Don't forget to check the related work section for other awesome work on multilingual evaluation for LLMs. @metaai.
Tweet card summary image
github.com
A benchmark with locally sourced multilingual questions for 31 languages. - facebookresearch/multiloko
0
0
2
@_dieuwke_
Dieuwke Hupkes
4 months
But if you want to appear on our github leaderboard, reach out via a github issue on our repo
Tweet media one
1
0
2
@_dieuwke_
Dieuwke Hupkes
4 months
Lastly, MultiLoKo has an OOD test split! We split the data based on frequency of the sourcing topic on Wikipedia and is more difficult than dev. We keep the test split secret until further notice to prevent overfitting.
Tweet media one
1
0
1
@_dieuwke_
Dieuwke Hupkes
4 months
As for the difference between human vs machine translations: also this changes language difficulty rankings, but not as drastically. It does create a substantial average drop for most languages.
Tweet media one
1
1
0
@_dieuwke_
Dieuwke Hupkes
4 months
We find that sourcing changes language prioritisation! Rank correlations between language difficulty are between 0.54 - 0.70 for the best models; locality effect ranges from -13 to +17 for Llama 3.1 405B, from -21 to +15 for Gemini 2.0 Flash, and from -22 to +14 for GPT4-o.
1
0
0
@_dieuwke_
Dieuwke Hupkes
4 months
We also consider the impact of local sourcing vs translating English data (the norm for most benchmarks used by LLM releases), which we express in the Locality Effect metric (locally sourced EM - translated English data EM).
Tweet media one
1
0
2
@_dieuwke_
Dieuwke Hupkes
4 months
We also run some response-level consistency tests inspired by those previous works (adapted version though), and find that even for the best performing models, the overlap of questions correctly answered across languages is not even 50%.
Tweet media one
1
0
1
@_dieuwke_
Dieuwke Hupkes
4 months
Next, we studied the effect of the question language and found that generally, performance is higher when asked in the 'native' language. In the plot, *mother tongue effect* = performance when question is asked in the language to which it is relevant - performance in English
Tweet media one
1
1
2
@_dieuwke_
Dieuwke Hupkes
4 months
For multilingual benchmarks, average can hide a lot. We therefore add a second metric to MultiLoKo: the gap between the best and worst language. Unsurprisingly, there is quite a lot of variance across languages, even when only the topN languages apart from English are considered.
Tweet media one
1
0
1
@_dieuwke_
Dieuwke Hupkes
4 months
Perhaps the most boring result: MultiLoKo is hard. Best average EM on the dev set is EM of around 34.
Tweet media one
1
0
1