Common Crawl Foundation @CommonCrawl X Profile

Common Crawl Foundation

@CommonCrawl

Followers

8K

Following

573

Media

38

Statuses

1K

Common Crawl is a non-profit foundation dedicated to the Open Web.

https://t.co/8b3gzn7cbe

San Francisco, CA

Joined February 2010

Don't wanna be here? Send us removal request.

Fishon 💡

@fishon_amos

22 hours

My latest article "Train-Set SEO: Why Embedding Your Brand in AI’s DNA is the Future of Search Optimization" just got published on @hackernoon 💚💚 https://t.co/2tWokLM80B

hackernoon.com

Train-Set SEO is a new approach to search engine optimisation. The goal is to make your content surface as the source of a generated answer, not just retrieved.

0

8

15

Common Crawl Foundation

@CommonCrawl

1 day

https://t.co/RfOlRTfXf8

techdirt.com

A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what h…

0

4

7

Common Crawl Foundation

@CommonCrawl

1 day

https://t.co/XgqAi2jGAy

commoncrawl.org

Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new...

0

1

Common Crawl Foundation

@CommonCrawl

15 days

https://t.co/JNhMuZx8ep

commoncrawl.org

We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.

0

2

Common Crawl Foundation

@CommonCrawl

16 days

Thank you, Leo, Jeff and Paris, for the opportunity to talk about AI and Common Crawl on TWiT! I really enjoyed the conversation! @leolaporte @jeffjarvis @parismartineau https://t.co/k7mJgzZXlx

twit.tv

AI data wars push Reddit to block the Wayback MachineChina Launches Three-Day Robot Olympics Featuring Football and Table TennisUS government agency drops Grok after

1

2

4

clem 🤗

@ClementDelangue

19 days

We don’t give nearly enough credit to the people and organizations who build and share open AI datasets. In fact, I’d argue they matter even more than open models: - they’re foundational, enabling hundreds of different models - they remove not only technical but also legal

Guilherme Penedo

@gui_penedo

19 days

> SmolLM3 > GLM-4.5 > NVIDIA-Nemotron-Nano These are just some of the recent OS releases relying on 🥂 FineWeb2 for their multilingual data Proud that the community trusts us for their data supply 🫡

31

57

495

Common Crawl Foundation

@CommonCrawl

19 days

Common Crawl Foundation wants to expand its language diversity. We're currently 43% English. Pedro Ortiz Suarez from our team published a paper related to this. We are excited to push this forward! https://t.co/5ScqinTnw9

arxiv.org

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more...

0

5

Common Crawl Foundation

@CommonCrawl

19 days

https://t.co/uAssacDrZl

commoncrawl.org

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The host-level graph consists of 691.1 million nodes and 5.0...

0

1

Common Crawl Foundation

@CommonCrawl

23 days

https://t.co/5m5qWiV22F

commoncrawl.org

We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).

0

1

3

Common Crawl Foundation

@CommonCrawl

30 days

https://t.co/8yInNak5gG

commoncrawl.org

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can...

2

4

17

Common Crawl Foundation

@CommonCrawl

1 month

https://t.co/ymvXQpWni9

digitalmedusa.org

Cloudflare recently proposed a system where AI companies and crawlers would pay websites for the right to crawl their content, a move framed as “content independence day”, a response to growing...

0

2

Common Crawl Foundation

@CommonCrawl

1 month

https://t.co/4QrCYI12Sa

commoncrawl.org

A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.

0

1

5

Common Crawl Foundation

@CommonCrawl

1 month

https://t.co/Fk2oExvAT0

searchengineworld.com

What Is Common Crawl?It is one of the most influential data sources on the web and the mass majority of site owners don't even realize their content is in it. So what is it? Common Crawl is a nonpro

0

1

2

Common Crawl Foundation

@CommonCrawl

1 month

https://t.co/7mZVwEgbXO

commoncrawl.org

Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.

0

4

Common Crawl Foundation

@CommonCrawl

2 months

https://t.co/ILYWy3fwy5

commoncrawl.org

The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.

0

1

3

Common Crawl Foundation

@CommonCrawl

2 months

https://t.co/89gadm9nSR

commoncrawl.org

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language...

0

3

Common Crawl Foundation

@CommonCrawl

2 months

https://t.co/3vgRWbeabQ "MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common

blogs.microsoft.com

Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.

4

9

49

Catherine Arnett

@linguist_cat

3 months

One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data!

1

2

6

Common Crawl Foundation

@CommonCrawl

2 months

https://t.co/AZXtXMZXZi

commoncrawl.org

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotat...

0

1

3