Dieuwke Hupkes @_dieuwke_ X Profile

Dieuwke Hupkes

@_dieuwke_

Followers

2K

Following

1K

Media

85

Statuses

463

Joined September 2017

Don't wanna be here? Send us removal request.

Dieuwke Hupkes

@_dieuwke_

3 hours

Many thanks for this big honour! 🤩.

IJCAIconf

@IJCAIconf

4 hours

Congratulations to the winners of the 2025 IJCAI–JAIR Prize for their paper “Compositionality Decomposed: How Do Neural Networks Generalise?” — Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni! .#IJCAI2025

2

0

25

Dieuwke Hupkes

@_dieuwke_

3 hours

RT @IJCAIconf: Congratulations to the winners of the 2025 IJCAI–JAIR Prize for their paper “Compositionality Decomposed: How Do Neural Netw….

0

4

0

Dieuwke Hupkes

@_dieuwke_

3 hours

RT @vernadankers: Proud to accept a 5y outstanding paper award @IJCAIconf 🏆 from JAIR for the impact Compositionality Decomposed has had,….

0

3

0

Dieuwke Hupkes

@_dieuwke_

27 days

RT @WiAIR_podcast: 🧠 What does it really mean for an LLM to generalize? And are we even measuring it right?.In the latest #WiAIR episode, w….

0

1

0

Dieuwke Hupkes

@_dieuwke_

27 days

Could not be more thrilled about this partnership, allowing us to keep MultiLoKo's test set truly hidden and have experts at Kaggle independently run the leaderboard 😍🔥💪.

Kaggle

@kaggle

28 days

Exciting collaboration! We've partnered with @AIatMeta's to launch the MultiLoKo Benchmark, now live on our platform. Measure model performance across 31 languages with truly private holdout sets – just like Kaggle Competitions, ensuring accurate results. Explore MultiLoKo and

0

1

9

Dieuwke Hupkes

@_dieuwke_

27 days

Thrilled about the launch of this platform 🤩, the feature to host secret test sets is a deal breaker in the game against contamination and a gift to both benchmark builders and modellers 🔥🔥 Excited to be one of the first to use it for @AIatMeta's MultiLoKo's test set 💪!.

Kaggle

@kaggle

28 days

🚀 Kaggle Benchmarks is here! Get competition-grade rigor for AI model evaluation. Let Kaggle handle infrastructure while you focus on AI breakthroughs. View model performance on 70+ leaderboards, including @AIatMeta's MultiLoKo. Dive in:

0

1

6

Dieuwke Hupkes

@_dieuwke_

27 days

RT @kaggle: 🚀 Kaggle Benchmarks is here! Get competition-grade rigor for AI model evaluation. Let Kaggle handle infrastructure while you f….

0

26

0

Dieuwke Hupkes

@_dieuwke_

1 month

RT @WiAIR_podcast: How do we know if a language model really generalizes - or is just repeating patterns it’s memorized?.Let’s talk about c….

0

1

0

Dieuwke Hupkes

@_dieuwke_

4 months

RT @ryan_nefdt: My new book on Linguistic Relativity (with Jeff Pelletier) just dropped with @OUPPhilosophy: We di….

academic.oup.com

Abstract. Does your language distinguish between dark and light blues? Do your verbs require a report on where and how you got your information? Can you ea

0

16

0

Dieuwke Hupkes

@_dieuwke_

4 months

Want to know more? Have a look at our paper or our github repository:.- - Don't forget to check the related work section for other awesome work on multilingual evaluation for LLMs. @metaai.

github.com

A benchmark with locally sourced multilingual questions for 31 languages. - facebookresearch/multiloko

0

2

Dieuwke Hupkes

@_dieuwke_

4 months

But if you want to appear on our github leaderboard, reach out via a github issue on our repo

1

0

2

Dieuwke Hupkes

@_dieuwke_

4 months

Lastly, MultiLoKo has an OOD test split! We split the data based on frequency of the sourcing topic on Wikipedia and is more difficult than dev. We keep the test split secret until further notice to prevent overfitting.

1

0

1

Dieuwke Hupkes

@_dieuwke_

4 months

As for the difference between human vs machine translations: also this changes language difficulty rankings, but not as drastically. It does create a substantial average drop for most languages.

1

0

Dieuwke Hupkes

@_dieuwke_

4 months

We find that sourcing changes language prioritisation! Rank correlations between language difficulty are between 0.54 - 0.70 for the best models; locality effect ranges from -13 to +17 for Llama 3.1 405B, from -21 to +15 for Gemini 2.0 Flash, and from -22 to +14 for GPT4-o.

1

0

Dieuwke Hupkes

@_dieuwke_

4 months

We also consider the impact of local sourcing vs translating English data (the norm for most benchmarks used by LLM releases), which we express in the Locality Effect metric (locally sourced EM - translated English data EM).

1

0

2

Dieuwke Hupkes

@_dieuwke_

4 months

We also run some response-level consistency tests inspired by those previous works (adapted version though), and find that even for the best performing models, the overlap of questions correctly answered across languages is not even 50%.

1

0

1

Dieuwke Hupkes

@_dieuwke_

4 months

This pretty result confirms earlier results from a.o. Qi et. al ( @Jirui_Qi @AriannaBisazza @raquel_dmg ) and Ohmer et al ( @xenia_ohmer @eliabruni) that knowledge transfer between languages in LLMs is suboptimal.

direct.mit.edu

Abstract. The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks,...

1

0

1

Dieuwke Hupkes

@_dieuwke_

4 months

Next, we studied the effect of the question language and found that generally, performance is higher when asked in the 'native' language. In the plot, *mother tongue effect* = performance when question is asked in the language to which it is relevant - performance in English

1

2

Dieuwke Hupkes

@_dieuwke_

4 months

For multilingual benchmarks, average can hide a lot. We therefore add a second metric to MultiLoKo: the gap between the best and worst language. Unsurprisingly, there is quite a lot of variance across languages, even when only the topN languages apart from English are considered.

1

0

1

Dieuwke Hupkes

@_dieuwke_

4 months

Perhaps the most boring result: MultiLoKo is hard. Best average EM on the dev set is EM of around 34.

1

0

1