
Common Crawl Foundation
@CommonCrawl
Followers
8K
Following
573
Media
38
Statuses
1K
Common Crawl is a non-profit foundation dedicated to the Open Web.
San Francisco, CA
Joined February 2010
My latest article "Train-Set SEO: Why Embedding Your Brand in AI’s DNA is the Future of Search Optimization" just got published on @hackernoon 💚💚 https://t.co/2tWokLM80B
hackernoon.com
Train-Set SEO is a new approach to search engine optimisation. The goal is to make your content surface as the source of a generated answer, not just retrieved.
0
8
15
Thank you, Leo, Jeff and Paris, for the opportunity to talk about AI and Common Crawl on TWiT! I really enjoyed the conversation! @leolaporte @jeffjarvis @parismartineau
https://t.co/k7mJgzZXlx
twit.tv
AI data wars push Reddit to block the Wayback MachineChina Launches Three-Day Robot Olympics Featuring Football and Table TennisUS government agency drops Grok after
1
2
4
We don’t give nearly enough credit to the people and organizations who build and share open AI datasets. In fact, I’d argue they matter even more than open models: - they’re foundational, enabling hundreds of different models - they remove not only technical but also legal
> SmolLM3 > GLM-4.5 > NVIDIA-Nemotron-Nano These are just some of the recent OS releases relying on 🥂 FineWeb2 for their multilingual data Proud that the community trusts us for their data supply 🫡
31
57
495
Common Crawl Foundation wants to expand its language diversity. We're currently 43% English. Pedro Ortiz Suarez from our team published a paper related to this. We are excited to push this forward! https://t.co/5ScqinTnw9
arxiv.org
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more...
0
0
5
https://t.co/3vgRWbeabQ "MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common
blogs.microsoft.com
Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.
4
9
49
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data!
1
2
6