oscarnlp Profile Banner
OSCAR Profile
OSCAR

@oscarnlp

Followers
326
Following
48
Media
1
Statuses
46

The Open Super-large Crawled Aggregated coRpus

Joined April 2021
Don't wanna be here? Send us removal request.
@oscarnlp
OSCAR
2 years
๐Ÿ“ฃ The OSCAR Project and @DFKI are happy to announce the release of Colossal OSCAR 1.0 ๐Ÿ“š, which is now available on the @huggingface Hub ๐Ÿค— at Colossal OSCAR 1.0 was put together by @pjox13 as part of the @OpenGPTX collaboration.
Tweet card summary image
huggingface.co
2
14
53
@oscarnlp
OSCAR
1 year
Check out the new multimodal OSCAR by @FuteralMatthieu ! ๐Ÿš€๐Ÿ“š๐Ÿ–ผ๏ธ.
@FuteralMatthieu
Matthieu Futeral-Peter
1 year
Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: Dataset: Doc: 1/6
Tweet media one
0
0
4
@oscarnlp
OSCAR
1 year
RT @FuteralMatthieu: Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: .
0
26
0
@oscarnlp
OSCAR
2 years
๐Ÿ‘€ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐Ÿ’ฌ on Discord:
discord.com
Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.
0
0
0
@oscarnlp
OSCAR
2 years
โœจ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of @Inria, @inria_paris, @InriaParisNLP and @CommonCrawl. Specially thanks to the contributions of @Uinelj, @imrua__ ,@sobamchan, @sebnagel and @bensagot.
1
0
0
@oscarnlp
OSCAR
2 years
As Colossal OSCAR 1.0 is based on @CommonCrawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual content, users agree to the Common Crawl Terms of use ๐Ÿ“„.๐Ÿ‘‰
commoncrawl.org
Explore Common Crawl's terms of use: understand our policies, guidelines, and your rights when accessing our web data.
1
0
0
@oscarnlp
OSCAR
2 years
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 @CommonCrawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. ๐Ÿ—ฃ๏ธ.
1
0
0
@oscarnlp
OSCAR
2 years
Colossal OSCAR 1.0 is our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. ๐Ÿค“๐Ÿง‘โ€๐Ÿ”ฌ๐Ÿ“Š.
1
0
1
@oscarnlp
OSCAR
2 years
RT @translation_eu: Everybody is talking about @OpenAI - we should talk more about cool projects like @silo_AI, @oscarnlp (for multilingualโ€ฆ.
0
4
0
@oscarnlp
OSCAR
2 years
๐Ÿ‘€ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐Ÿ’ฌ on Discord:
discord.com
Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.
1
0
1
@oscarnlp
OSCAR
2 years
OSCAR 23.01 has been made possible thanks to @Uinelj @pjox13 @imrua__ @sobamchan @sebnagel and @bensagot.
1
0
3
@oscarnlp
OSCAR
2 years
๐ŸŽ‰ OSCAR 23.01 is for now only available for researchers and academics, but will be available later on ๐Ÿค—HuggingFace. To access the data, please follow the steps from our documentation:
1
0
2
@oscarnlp
OSCAR
2 years
๐Ÿ“„ We also now have a more in-depth, technical documentation available that we will update with tutorials, how-tos, corpus documentation and info about the whole project.
1
0
1
@oscarnlp
OSCAR
2 years
๐ŸšจOther changes include metadata naming changes, language naming changes to better respect the BCP47 standard, and a compression change: OSCAR is now compressed using zstandard rather than gzip.
1
0
1
@oscarnlp
OSCAR
2 years
๐Ÿ˜ฎ Perplexity scores of the KenLM models are pre-computed, but it is up to the user to set a threshold for selecting the documents. โš ๏ธ Please use with caution, and do not hesitate to send feedback Please refer to this pre-print for more information: ๐Ÿ“.
arxiv.org
As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous...
1
0
3
@oscarnlp
OSCAR
2 years
๐Ÿ‘€ KenLM-based Adult Content Filtering. On a select (73) group of languages, computed using a KenLM model trained on harmful content. While being still experimental, this shows promising results in English.
1
0
1
@oscarnlp
OSCAR
2 years
๐Ÿ“Precomputed document-level Locality Sensitive Hashes! This will make both near and exact deduplication easier for you ๐Ÿ˜.
1
0
1
@oscarnlp
OSCAR
2 years
๐Ÿ’ฌ OSCAR 23.01 is also the first version ever to introduce a language-specific feature: A new blocklist specifically made for Japanese ๐Ÿ‡ฏ๐Ÿ‡ต. With the help of our community, we hope this will be the first of many language-specific features to come ๐ŸŒ.
1
0
3
@oscarnlp
OSCAR
2 years
๐Ÿ“šCategories! OSCAR 22.01 leveraged the UT1 Blocklists project to attempt to classify some adult content present in OSCAR. The OSCAR 23.01 iterates on this to include all of the categories provided by UT1: blogs, press, etc. Full list (in French) here:
1
0
4