Dr. Stefan Baack | @[email protected]
@tweetbaack
Followers
774
Following
28
Media
0
Statuses
18
This account is inactive. Please follow me on Mastodon at @[email protected] or Bluesky at @sbaack.com
Joined June 2009
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
mozillafoundation.org
Mozilla research finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models.
2
36
124
Excellent report from @tweetbaack @mozilla on Common Crawl, used to train many LLMs. Throwaway line for news publishers to ponder: "We will focus on the main crawl because the news crawl is rarely used by AI builders to train their LLMs (only once in our sample of 47 [models])."
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
0
2
7
Really useful paper describing the use, effects and limitations of Common Crawl as a building block for LLMs
in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions
0
1
3
Common Crawl data is likely used in most large language models (AI), as far as we know. This is *crucial* work.
in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions
1
14
100
Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)
0
2
3
A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. We need dedicated intermediaries that filter Common Crawl in transparent and accountable ways that are continuously updated (9/10)
1
2
1
AI builders should put more effort into filtering Common Crawl, establish industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data (8/10)
1
1
2
Both Common Crawl and AI builders can help making generative AI less harmful. Common Crawl should highlight the limitations and biases of its data, be more transparent and inclusive about its governance, and enforce transparency by requiring to attribute using Common Crawl (7/10)
1
2
1
Due to Common Crawl’s deliberate lack of curation, AI builders need to filter it with care, but such care is often lacking. Popular filtered versions like C4 are especially problematic as their filtering techniques are simplistic and leave lots of harmful content untouched (6/10)
1
1
5
In addition, relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. These blocks are increasing, creating new biases in the crawled data https://t.co/ZpDhkurpVR (5/10)
wired.com
Nearly 90 percent of top news outlets like 'The New York Times' now block AI data collection bots from OpenAI and others. Leading right-wing outlets like NewsMax and Breitbart mostly permit them.
1
2
4
Common Crawl's archive is massive, but far from being a “copy of the internet.” Its crawls are automated to prioritize pages on domains that are frequently linked to, making digitally marginalized communities less likely to be included. Most captured content is English (4/10)
1
1
4
Using Common Crawl's data does not easily align with trustworthy and responsible AI development because Common Crawl deliberately does not curate its data. It doesn't remove hate speech, for example, because it wants its data to be useful for studying hate speech (3/10)
1
1
2
Common Crawl is created by a nonprofit of the same name founded in 2007. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have. Proving data for AI training was never a primary goal (2/10)
1
1
2