CommonCrawl Profile Banner
Common Crawl Foundation Profile
Common Crawl Foundation

@CommonCrawl

Followers
8K
Following
573
Media
38
Statuses
1K

Common Crawl is a non-profit foundation dedicated to the Open Web.

San Francisco, CA
Joined February 2010
Don't wanna be here? Send us removal request.
@fishon_amos
Fishon đź’ˇ
22 hours
My latest article "Train-Set SEO: Why Embedding Your Brand in AI’s DNA is the Future of Search Optimization" just got published on @hackernoon 💚💚 https://t.co/2tWokLM80B
Tweet card summary image
hackernoon.com
Train-Set SEO is a new approach to search engine optimisation. The goal is to make your content surface as the source of a generated answer, not just retrieved.
0
8
15
@CommonCrawl
Common Crawl Foundation
16 days
Thank you, Leo, Jeff and Paris, for the opportunity to talk about AI and Common Crawl on TWiT! I really enjoyed the conversation! @leolaporte @jeffjarvis @parismartineau https://t.co/k7mJgzZXlx
Tweet card summary image
twit.tv
AI data wars push Reddit to block the Wayback MachineChina Launches Three-Day Robot Olympics Featuring Football and Table TennisUS government agency drops Grok after
1
2
4
@ClementDelangue
clem 🤗
19 days
We don’t give nearly enough credit to the people and organizations who build and share open AI datasets. In fact, I’d argue they matter even more than open models: - they’re foundational, enabling hundreds of different models - they remove not only technical but also legal
@gui_penedo
Guilherme Penedo
19 days
> SmolLM3 > GLM-4.5 > NVIDIA-Nemotron-Nano These are just some of the recent OS releases relying on 🥂 FineWeb2 for their multilingual data Proud that the community trusts us for their data supply 🫡
Tweet media one
Tweet media two
Tweet media three
31
57
495
@CommonCrawl
Common Crawl Foundation
19 days
Common Crawl Foundation wants to expand its language diversity. We're currently 43% English. Pedro Ortiz Suarez from our team published a paper related to this. We are excited to push this forward! https://t.co/5ScqinTnw9
Tweet card summary image
arxiv.org
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more...
0
0
5
@CommonCrawl
Common Crawl Foundation
2 months
https://t.co/3vgRWbeabQ "MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common
Tweet card summary image
blogs.microsoft.com
Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.
4
9
49
@linguist_cat
Catherine Arnett
3 months
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data!
Tweet media one
1
2
6