William J.B. Mattingly
@wjb_mattingly
Followers
4K
Following
5K
Media
684
Statuses
4K
Digital Nomad · Historian · Data Scientist · NLP · Machine Learning Cultural Heritage Data Scientist @Yale Former @SIDataScience @huggingface Fellow 🤗
Fort Myers, FL
Joined May 2020
Want to do a full-finetune of Dots.OCR? I've got a fork working! It handles the conversion of data from PageXML (Transkribus) to Dots.OCR format for you! (Link down below). The first models are already on @huggingface and working as expected. Still training them.
5
5
61
Google Colab is officially coming to @code! ⚡️ You can now connect VS Code notebooks directly to @GoogleColab runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → https://t.co/prgImNfEd2
114
763
5K
Woah! Omnilingual ASR from Meta! As someone who works in low-resource language ASR, this looks incredible. Will certainly be testing this week!!
2
2
15
You can now run OCR over Yiddish archival pages with Qwen 3 VL! **Take the CER and WER with a grain of salt. The test data mimics the training data very closely. This should work very well for clean docs, but the big benefit of a VLM, that knowledge will transfer to messy
1
1
7
Finetuning @LightOnIO OCR model on HF Jobs for better OCR on historic books. Think (as usual) the main gap to better open OCR models is more open OCR datasets for lots of domains and task types.
2
4
50
Vibe coding small Flask apps is my new favorite thing. I do it weekly. I can have Claude 4.5 designing a user-friendly way to interact with and modify the data in a database while my pipeline is processing the data. I only need one or two outputs to begin designing the actual
1
1
22
If anyone wants to test it out: https://t.co/OUuKiKf6OE -- Let me know how it works. I'm also curious how it works on handwriting at this stage, though I don't expect it to be good.
0
0
1
Qwen 3 VL-8B is still training, but working on page-level Hebrew with archival documents! The goal is to use this model to finetune further on handwriting. CER and WER on test set are quite good. Archival documents that are very different from the training data area also showing
1
0
7
Full page Qwen 3 VL finetunes for Hebrew are coming soon! Still training, but 2B is looking pretty good on the heldout data with a CER of 1.7% and a WER of 6.4%. Models will be available on @huggingface @Alibaba_Qwen
2
1
15
Does anyone know of a font for German shorthand? Or a way to programmatically go from German to German shorthand characters?
0
0
0
Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.
2
3
3
Training some Qwen 3-VL finetunes that will work better on English archival documents, including handwriting from the 18th-century to present. I've seen a few spots already where this model finds errors in the ground truth.
2
1
17
Just got access to ~600,000 archival documents that have been manually corrected. These are handwritten, cursive, and typed. Get ready for some serious GLAM Qwen 3 VL finetunes! =)
5
0
32
Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @huggingface . The first version of the models are now available on the Small Models for GLAM organization with @vanstriendaniel ! (Link below). These are designed to work
4
13
105