tweetbaack Profile Banner
Dr. Stefan Baack | @tootbaack@infosec.exchange Profile
Dr. Stefan Baack | @[email protected]

@tweetbaack

Followers
774
Following
28
Media
0
Statuses
18

This account is inactive. Please follow me on Mastodon at @[email protected] or Bluesky at @sbaack.com

Joined June 2009
Don't wanna be here? Send us removal request.
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
Tweet card summary image
mozillafoundation.org
Mozilla research finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models.
2
36
124
@rasmus_kleis
Rasmus Kleis Nielsen
2 years
Excellent report from @tweetbaack @mozilla on Common Crawl, used to train many LLMs. Throwaway line for news publishers to ponder: "We will focus on the main crawl because the news crawl is rarely used by AI builders to train their LLMs (only once in our sample of 47 [models])."
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development https://t.co/mHO37yPT7l (1/10)
0
2
7
@emilybell
emily bell
2 years
Really useful paper describing the use, effects and limitations of Common Crawl as a building block for LLMs
@Abebab
Abeba Birhane
2 years
in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions
0
1
3
@mmitchell_ai
MMitchell
2 years
Common Crawl data is likely used in most large language models (AI), as far as we know. This is *crucial* work.
@Abebab
Abeba Birhane
2 years
in-depth dive into the Common Crawl, the massive datadump where training data for current sota generative models is fetched from. report includes background history, interviews with the "curators," & critical examination of underlying values & assumptions
1
14
100
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways (10/10)
0
2
3
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
A key issue is that filtered Common Crawl versions are not updated after their original publication to take feedback and criticism into account. We need dedicated intermediaries that filter Common Crawl in transparent and accountable ways that are continuously updated (9/10)
1
2
1
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
AI builders should put more effort into filtering Common Crawl, establish industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data (8/10)
1
1
2
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Both Common Crawl and AI builders can help making generative AI less harmful. Common Crawl should highlight the limitations and biases of its data, be more transparent and inclusive about its governance, and enforce transparency by requiring to attribute using Common Crawl (7/10)
1
2
1
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Due to Common Crawl’s deliberate lack of curation, AI builders need to filter it with care, but such care is often lacking. Popular filtered versions like C4 are especially problematic as their filtering techniques are simplistic and leave lots of harmful content untouched (6/10)
1
1
5
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
In addition, relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages. These blocks are increasing, creating new biases in the crawled data https://t.co/ZpDhkurpVR (5/10)
Tweet card summary image
wired.com
Nearly 90 percent of top news outlets like 'The New York Times' now block AI data collection bots from OpenAI and others. Leading right-wing outlets like NewsMax and Breitbart mostly permit them.
1
2
4
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Common Crawl's archive is massive, but far from being a “copy of the internet.” Its crawls are automated to prioritize pages on domains that are frequently linked to, making digitally marginalized communities less likely to be included. Most captured content is English (4/10)
1
1
4
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Using Common Crawl's data does not easily align with trustworthy and responsible AI development because Common Crawl deliberately does not curate its data. It doesn't remove hate speech, for example, because it wants its data to be useful for studying hate speech (3/10)
1
1
2
@tweetbaack
Dr. Stefan Baack | @[email protected]
2 years
Common Crawl is created by a nonprofit of the same name founded in 2007. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have. Proving data for AI training was never a primary goal (2/10)
1
1
2