
OSCAR
@oscarnlp
Followers
326
Following
48
Media
1
Statuses
46
The Open Super-large Crawled Aggregated coRpus
Joined April 2021
๐ฃ The OSCAR Project and @DFKI are happy to announce the release of Colossal OSCAR 1.0 ๐, which is now available on the @huggingface Hub ๐ค at Colossal OSCAR 1.0 was put together by @pjox13 as part of the @OpenGPTX collaboration.
huggingface.co
2
14
53
Check out the new multimodal OSCAR by @FuteralMatthieu ! ๐๐๐ผ๏ธ.
Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: Dataset: Doc: 1/6
0
0
4
RT @FuteralMatthieu: Announcing mOSCAR, multilingual interleaved text-image corpus as part of @oscarnlp project. Paper: .
0
26
0
๐ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐ฌ on Discord:
discord.com
Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.
0
0
0
โจ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of @Inria, @inria_paris, @InriaParisNLP and @CommonCrawl. Specially thanks to the contributions of @Uinelj, @imrua__ ,@sobamchan, @sebnagel and @bensagot.
1
0
0
As Colossal OSCAR 1.0 is based on @CommonCrawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual content, users agree to the Common Crawl Terms of use ๐.๐
commoncrawl.org
Explore Common Crawl's terms of use: understand our policies, guidelines, and your rights when accessing our web data.
1
0
0
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 @CommonCrawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. ๐ฃ๏ธ.
1
0
0
RT @translation_eu: Everybody is talking about @OpenAI - we should talk more about cool projects like @silo_AI, @oscarnlp (for multilingualโฆ.
0
4
0
๐ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐ฌ on Discord:
discord.com
Check out the OSCAR Project community on Discord - hang out with 492 other members and enjoy free voice and text chat.
1
0
1
๐ฎ Perplexity scores of the KenLM models are pre-computed, but it is up to the user to set a threshold for selecting the documents. โ ๏ธ Please use with caution, and do not hesitate to send feedback Please refer to this pre-print for more information: ๐.
arxiv.org
As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous...
1
0
3