Workshop on Multilingual Data Quality Signals @wmdqs X Profile

Workshop on Multilingual Data Quality Signals

@wmdqs

Followers

13

Following

1

Media

14

Statuses

19

The first iteration of our workshop will be co-located with @COLM_conf 2025 in Montreal.

https://t.co/qGEgtoYHX6

Montréal

Joined July 2025

Don't wanna be here? Send us removal request.

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

@davlanade @sebnagel @CommonCrawl If you were able to join us, let us know about your experience: https://t.co/f4REPlfIBs

0

1

2

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

@davlanade @sebnagel @CommonCrawl Thank you everyone for coming to WMDQS (pronounced "whim ducks")!

1

2

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

@davlanade @sebnagel @CommonCrawl Then we had our second poster session for our paper submissions. The full papers are available on our website!

1

0

2

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

@davlanade After lunch, @sebnagel gave a keynote about the data collected by @commoncrawl!

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

@davlanade gave a keynote about text quality for low-resource languages.

1

2

11

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

We had our first poster session, hearing from some of our shared task participants!

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

We presented the results of our shared task! We received annotations for over 30,000 document representing over 60 languages. We also showed the results of our LangID dataset and system shared task tracks. Thank you everyone who participated!

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

We started with a keynote from Julia Kreutzer about multilingual fine-tuning data!

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

WMDQS is underway! Come join us in Room 520A at @COLM_conf !

1

3

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

See our updated website for more details:

wmdqs.org

A workshop addressing multilingual data quality. Held on the 10th October 2025 in Montréal.

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

We will also have a session on our shared task, which was about improving language identification models. Participants of the shared task contributed annotations to create a new LangID dataset and also submitted new LangID systems.

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

Our third and final keynote will be from @sebnagel about the data in Common Crawl.

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

Our second keynote will be by @davlanade about text quality for low-resource languages.

1

3

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

Our first keynote will be from Julia Kreutzer about data for multilingual fine-tuning.

1

0

2

Workshop on Multilingual Data Quality Signals

@wmdqs

2 months

In collaboration with @CommonCrawl @MLCommons @AiEleuther, the first edition of WMDQS at @COLM_conf starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

1

5

7

Workshop on Multilingual Data Quality Signals

@wmdqs

4 months

Contribute annotations here:

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

4 months

For context:

Catherine Arnett @ NeurIPS (San Diego)

@linguist_cat

6 months

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data!

1

0

1

Workshop on Multilingual Data Quality Signals

@wmdqs

4 months

We've added lots more documents/languages and extended the deadline for the first round of annotations until July 23rd. Check out the details below 👇

1

0

1

Catherine Arnett @ NeurIPS (San Diego)

@linguist_cat

6 months

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data!

1

2

6