Dr. Stefan Baack | @tootbaack@infosec.exchange @tweetbaack X Profile

Dr. Stefan Baack | @[email protected]

@tweetbaack

Followers

774

Following

28

Media

0

Statuses

18

This account is inactive. Please follow me on Mastodon at @[email protected] or Bluesky at @sbaack.com

https://t.co/03ODSFUCWb

Joined June 2009

Don't wanna be here? Send us removal request.

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)

mozillafoundation.org

Mozilla research finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models.

2

36

124

Rasmus Kleis Nielsen

@rasmus_kleis

2 years

Excellent report from @tweetbaack @mozilla on Common Crawl, used to train many LLMs. Throwaway line for news publishers to ponder: "We will focus on the main crawl because the news crawl is rarely used by AI builders to train their LLMs (only once in our sample of 47 [models])."

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)

0

2

7

emily bell

@emilybell

2 years

Really useful paper describing the use, effects and limitations of Common Crawl as a building block for LLMs

Abeba Birhane

@Abebab

2 years

in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions

0

1

3

MMitchell

@mmitchell_ai

2 years

Common Crawl data is likely used in most large language models (AI), as far as we know. This is *crucial* work.

Abeba Birhane

@Abebab

2 years

in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions

1

14

100

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)

0

2

3

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. We need dedicated intermediaries that filter Common Crawl in transparent and accountable ways that are continuously updated (9/10)

1

2

1

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

AI builders should put more effort into filtering Common Crawl, establish industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data (8/10)

1

2

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Both Common Crawl and AI builders can help making generative AI less harmful. Common Crawl should highlight the limitations and biases of its data, be more transparent and inclusive about its governance, and enforce transparency by requiring to attribute using Common Crawl (7/10)

1

2

1

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Due to Common Crawl’s deliberate lack of curation, AI builders need to filter it with care, but such care is often lacking. Popular filtered versions like C4 are especially problematic as their filtering techniques are simplistic and leave lots of harmful content untouched (6/10)

1

5

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

In addition, relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. These blocks are increasing, creating new biases in the crawled data https://t.co/ZpDhkurpVR (5/10)

wired.com

Nearly 90 percent of top news outlets like 'The New York Times' now block AI data collection bots from OpenAI and others. Leading right-wing outlets like NewsMax and Breitbart mostly permit them.

1

2

4

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Common Crawl's archive is massive, but far from being a “copy of the internet.” Its crawls are automated to prioritize pages on domains that are frequently linked to, making digitally marginalized communities less likely to be included. Most captured content is English (4/10)

1

4

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Using Common Crawl's data does not easily align with trustworthy and responsible AI development because Common Crawl deliberately does not curate its data. It doesn't remove hate speech, for example, because it wants its data to be useful for studying hate speech (3/10)

1

2

Dr. Stefan Baack | @[email protected]

@tweetbaack

2 years

Common Crawl is created by a nonprofit of the same name founded in 2007. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have. Proving data for AI training was never a primary goal (2/10)

1

2