Guilherme Penedo Profile
Guilherme Penedo

@gui_penedo

Followers
4K
Following
27K
Media
54
Statuses
976

Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

Paris 🇫🇷
Joined April 2012
Don't wanna be here? Send us removal request.
@gui_penedo
Guilherme Penedo
2 months
New dataset release: 🌐FineWiki This is an updated and better extracted version of Wikipedia, covering 325+ languages. Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts. Examples and highlights below.
17
76
554
@clefourrier
Clémentine Fourrier 🍊 is off till Dec 2026 hiking
9 days
Hey twitter! I'm releasing the LLM Evaluation Guidebook v2! Updated, nicer to read, interactive graphics, etc! https://t.co/xG4VQOj2wN After this, I'm off: I'm taking a sabbatical to go hike with my dogs :D (back @huggingface in Dec *2026*) See you all next year!
22
163
983
@gui_penedo
Guilherme Penedo
11 days
Remi is super knowledgeable and did great things at 🤗 HF, can't wait to see what he'll cool at UMA!
@RemiCadene
Remi Cadene
11 days
Humanity is at a turning point. I am launching UMA to build general-purpose mobile and humanoid robots from Europe. Proud to start with people I admired for years, and grateful for all your support! Reach out to us @UMA_Robots ❤️
1
1
14
@gui_penedo
Guilherme Penedo
1 month
We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT. 350B+ tokens of top tier mid-training data in multiple languages. You can also download the classifiers (all 69 of them!)
@HKydlicek
Hynek Kydlíček
1 month
We releasing a large update to 📄FinePDFs! - 350B+ highly education tokens in 69 languages, with incredible perf 🚀 - 69 edu classifiers, powered by ModernBert and mmBERT - 300k+ EDU annotations for each of 69 languages from Qwen3-235B
0
3
34
@LoubnaBenAllal1
Loubna Ben Allal
1 month
After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you
36
159
1K
@gui_penedo
Guilherme Penedo
2 months
The dataset itself: https://t.co/8VwFzul4fo
Tweet card summary image
huggingface.co
1
4
26
@gui_penedo
Guilherme Penedo
2 months
We made a small space where you can explore the dataset: https://t.co/Cot9f4BLDV
Tweet card summary image
huggingface.co
1
1
16
@gui_penedo
Guilherme Penedo
2 months
Filtering Low-resource language Wikipedias have a large prevalence of content from other languages, particularly English. We apply language‑ and script‑aware checks to each wiki. This can be useful if you want to train a language classifier on Wikipedia, for example.
1
0
13
@gui_penedo
Guilherme Penedo
2 months
Infoboxes While not keeping infoboxes in the main text content, we extracted them into key-value formats in the metadata so that they can be used as an info summary for each article. You can use these for quick facts/question answering.
1
0
9
@gui_penedo
Guilherme Penedo
2 months
Content filtering Sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control) are also removed.
1
0
11
@gui_penedo
Guilherme Penedo
2 months
Tables A lot of Wikipedia content is locked inside tables. We took extra care to make sure we could extract tables into well formatted markdown. While they may not help on natural language understanding benchmarks, these contain most of the relevant facts in many articles.
1
0
14
@gui_penedo
Guilherme Penedo
2 months
Math We carefully extracted math and latex content. Previously, you could train a model on the Wikipedia page of some famous equation and the equation would not actually be on the page. A metadata flag `has_math` also makes it easy to filter for math content.
1
0
14
@gui_penedo
Guilherme Penedo
2 months
Templates Wikipedia markdown has a lot of "function calls" (templates) to render specific pieces of text. {{s-|XV}} -> "XVe siècle" (French for 15th century) Most pipelines drop them. We processed HTML dumps directly, where these are are already rendered.
1
0
16
@gui_penedo
Guilherme Penedo
2 months
There's a lot of hype around OCR models going around, but how do you actually use them to make a useful dataset? Today, we're releasing our codebase and additional data for 📄FinePDFs!
@HKydlicek
Hynek Kydlíček
2 months
We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀 📚 Datasets: • OCR-Annotations — 1.6k PDFs labeled for OCR need • Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B) 🤖 Models: • XGB-OCR — OCR classifier for PDFs
0
2
45
@HKydlicek
Hynek Kydlíček
2 months
Did I hear correctly that @gui_penedo announced that we will have free HF merch at our COLM poster session 🤔
0
2
9
@gui_penedo
Guilherme Penedo
2 months
We're in 🇨🇦 for @COLM_conf Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58) @LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)
1
6
62
@gui_penedo
Guilherme Penedo
3 months
Incredible to see FineWeb2 used in this amazing model that will power many many many use cases
@ruyimarone
Marc Marone
3 months
XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!
0
2
31
@lvwerra
Leandro von Werra
3 months
The FinePDFs pipeline really shows the massive scale needed for LLM data processing: Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage Extraction: > rolmOCR: 368M docs → 250k GPUh, $750k > docling: 918M docs → 2.4M vCPUh, $35k Total liberation cost (incl. ablations): ~$1M
14
11
161
@HKydlicek
Hynek Kydlíček
3 months
We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.
24
116
715