OSCAR @oscarnlp X Profile

OSCAR

@oscarnlp

Followers

326

Following

48

Media

1

Statuses

46

The Open Super-large Crawled Aggregated coRpus

Joined April 2021

Don't wanna be here? Send us removal request.

OSCAR

@oscarnlp

2 years

📣 The OSCAR Project and @DFKI are happy to announce the release of Colossal OSCAR 1.0 📚, which is now available on the @huggingface Hub 🤗 at Colossal OSCAR 1.0 was put together by @pjox13 as part of the @OpenGPTX collaboration.

huggingface.co

2

14

53

OSCAR

@oscarnlp

2 months

RT @CommonCrawl:

commoncrawl.org

The first Workshop on Multilingual Data Quality Signals (WMDQS), hosted by Common Crawl with MLCommons, EleutherAI, and Johns Hopkins, will be held alongside COLM 2025 on 10 October 2025 in Montreal,...

0

1

0

OSCAR

@oscarnlp

1 year

Check out the new multimodal OSCAR by @FuteralMatthieu ! 🚀📚🖼️.

Matthieu Futeral-Peter

@FuteralMatthieu

1 year

Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: Dataset: Doc: 1/6

0

4

OSCAR

@oscarnlp

1 year

RT @FuteralMatthieu: Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: .

0

26

0

OSCAR

@oscarnlp

2 years

👀 We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community 💬 on Discord:

discord.com

Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.

0

OSCAR

@oscarnlp

2 years

✨ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of @Inria, @inria_paris, @InriaParisNLP and @CommonCrawl. Specially thanks to the contributions of @Uinelj, @imrua__ ,@sobamchan, @sebnagel and @bensagot.

1

0

OSCAR

@oscarnlp

2 years

As Colossal OSCAR 1.0 is based on @CommonCrawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual content, users agree to the Common Crawl Terms of use 📄.👉

commoncrawl.org

Explore Common Crawl's terms of use: understand our policies, guidelines, and your rights when accessing our web data.

1

0

OSCAR

@oscarnlp

2 years

Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 @CommonCrawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. 🗣️.

1

0

OSCAR

@oscarnlp

2 years

Colossal OSCAR 1.0 is our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. 🤓🧑‍🔬📊.

1

0

1

OSCAR

@oscarnlp

2 years

RT @translation_eu: Everybody is talking about @OpenAI - we should talk more about cool projects like @silo_AI, @oscarnlp (for multilingual….

0

4

0

OSCAR

@oscarnlp

2 years

👀 We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community 💬 on Discord:

discord.com

Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.

1

0

1

OSCAR

@oscarnlp

2 years

OSCAR 23.01 has been made possible thanks to @Uinelj @pjox13 @imrua__ @sobamchan @sebnagel and @bensagot.

1

0

3

OSCAR

@oscarnlp

2 years

🎉 OSCAR 23.01 is for now only available for researchers and academics, but will be available later on 🤗HuggingFace. To access the data, please follow the steps from our documentation:

1

0

2

OSCAR

@oscarnlp

2 years

📄 We also now have a more in-depth, technical documentation available that we will update with tutorials, how-tos, corpus documentation and info about the whole project.

1

0

1

OSCAR

@oscarnlp

2 years

🚨Other changes include metadata naming changes, language naming changes to better respect the BCP47 standard, and a compression change: OSCAR is now compressed using zstandard rather than gzip.

1

0

1

OSCAR

@oscarnlp

2 years

😮 Perplexity scores of the KenLM models are pre-computed, but it is up to the user to set a threshold for selecting the documents. ⚠️ Please use with caution, and do not hesitate to send feedback Please refer to this pre-print for more information: 📝.

arxiv.org

As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous...

1

0

3

OSCAR

@oscarnlp

2 years

👀 KenLM-based Adult Content Filtering. On a select (73) group of languages, computed using a KenLM model trained on harmful content. While being still experimental, this shows promising results in English.

1

0

1

OSCAR

@oscarnlp

2 years

📍Precomputed document-level Locality Sensitive Hashes! This will make both near and exact deduplication easier for you 😁.

1

0

1

OSCAR

@oscarnlp

2 years

💬 OSCAR 23.01 is also the first version ever to introduce a language-specific feature: A new blocklist specifically made for Japanese 🇯🇵. With the help of our community, we hope this will be the first of many language-specific features to come 🌐.

1

0

3

OSCAR

@oscarnlp

2 years

📚Categories! OSCAR 22.01 leveraged the UT1 Blocklists project to attempt to classify some adult content present in OSCAR. The OSCAR 23.01 iterates on this to include all of the categories provided by UT1: blogs, press, etc. Full list (in French) here:

1

0

4