wjb_mattingly Profile Banner
William J.B. Mattingly Profile
William J.B. Mattingly

@wjb_mattingly

Followers
4K
Following
5K
Media
684
Statuses
4K

Digital Nomad · Historian · Data Scientist · NLP · Machine Learning Cultural Heritage Data Scientist @Yale Former @SIDataScience @huggingface Fellow 🤗

Fort Myers, FL
Joined May 2020
Don't wanna be here? Send us removal request.
@wjb_mattingly
William J.B. Mattingly
3 months
Want to do a full-finetune of Dots.OCR? I've got a fork working! It handles the conversion of data from PageXML (Transkribus) to Dots.OCR format for you! (Link down below). The first models are already on @huggingface and working as expected. Still training them.
5
5
61
@googledevs
Google for Developers
4 days
Google Colab is officially coming to @code! ⚡️ You can now connect VS Code notebooks directly to @GoogleColab runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → https://t.co/prgImNfEd2
114
763
5K
@wjb_mattingly
William J.B. Mattingly
13 hours
In Warsaw for the week. Old Town is beautiful!
0
0
4
@wjb_mattingly
William J.B. Mattingly
6 days
Woah! Omnilingual ASR from Meta! As someone who works in low-resource language ASR, this looks incredible. Will certainly be testing this week!!
2
2
15
@wjb_mattingly
William J.B. Mattingly
11 days
Model:
Tweet card summary image
huggingface.co
0
0
0
@wjb_mattingly
William J.B. Mattingly
11 days
You can now run OCR over Yiddish archival pages with Qwen 3 VL! **Take the CER and WER with a grain of salt. The test data mimics the training data very closely. This should work very well for clean docs, but the big benefit of a VLM, that knowledge will transfer to messy
1
1
7
@vanstriendaniel
Daniel van Strien
13 days
Finetuning @LightOnIO OCR model on HF Jobs for better OCR on historic books. Think (as usual) the main gap to better open OCR models is more open OCR datasets for lots of domains and task types.
2
4
50
@wjb_mattingly
William J.B. Mattingly
13 days
Vibe coding small Flask apps is my new favorite thing. I do it weekly. I can have Claude 4.5 designing a user-friendly way to interact with and modify the data in a database while my pipeline is processing the data. I only need one or two outputs to begin designing the actual
1
1
22
@wjb_mattingly
William J.B. Mattingly
18 days
If anyone wants to test it out: https://t.co/OUuKiKf6OE -- Let me know how it works. I'm also curious how it works on handwriting at this stage, though I don't expect it to be good.
0
0
1
@wjb_mattingly
William J.B. Mattingly
18 days
Qwen 3 VL-8B is still training, but working on page-level Hebrew with archival documents! The goal is to use this model to finetune further on handwriting. CER and WER on test set are quite good. Archival documents that are very different from the training data area also showing
1
0
7
@wjb_mattingly
William J.B. Mattingly
19 days
Full page Qwen 3 VL finetunes for Hebrew are coming soon! Still training, but 2B is looking pretty good on the heldout data with a CER of 1.7% and a WER of 6.4%. Models will be available on @huggingface @Alibaba_Qwen
2
1
15
@wjb_mattingly
William J.B. Mattingly
20 days
Does anyone know of a font for German shorthand? Or a way to programmatically go from German to German shorthand characters?
0
0
0
@wjb_mattingly
William J.B. Mattingly
21 days
Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.
2
3
3
@wjb_mattingly
William J.B. Mattingly
21 days
Training some Qwen 3-VL finetunes that will work better on English archival documents, including handwriting from the 18th-century to present. I've seen a few spots already where this model finds errors in the ground truth.
2
1
17
@wjb_mattingly
William J.B. Mattingly
23 days
Just got access to ~600,000 archival documents that have been manually corrected. These are handwritten, cursive, and typed. Get ready for some serious GLAM Qwen 3 VL finetunes! =)
5
0
32
@wjb_mattingly
William J.B. Mattingly
24 days
0
1
2
@wjb_mattingly
William J.B. Mattingly
24 days
1
1
2
@wjb_mattingly
William J.B. Mattingly
24 days
1
1
4
@wjb_mattingly
William J.B. Mattingly
24 days
Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @huggingface . The first version of the models are now available on the Small Models for GLAM organization with @vanstriendaniel ! (Link below). These are designed to work
4
13
105