William J.B. Mattingly @wjb_mattingly X Profile

William J.B. Mattingly

@wjb_mattingly

Followers

4K

Following

5K

Media

684

Statuses

4K

Digital Nomad · Historian · Data Scientist · NLP · Machine Learning Cultural Heritage Data Scientist @Yale Former @SIDataScience @huggingface Fellow 🤗

https://t.co/WreIdAN8jb

Fort Myers, FL

Joined May 2020

Don't wanna be here? Send us removal request.

William J.B. Mattingly

@wjb_mattingly

3 months

Want to do a full-finetune of Dots.OCR? I've got a fork working! It handles the conversion of data from PageXML (Transkribus) to Dots.OCR format for you! (Link down below). The first models are already on @huggingface and working as expected. Still training them.

5

61

Google for Developers

@googledevs

4 days

Google Colab is officially coming to @code! ⚡️ You can now connect VS Code notebooks directly to @GoogleColab runtimes. Get the best of both worlds: the editor you love, powered by the compute (GPUs/TPUs) you need. → https://t.co/prgImNfEd2

114

763

5K

William J.B. Mattingly

@wjb_mattingly

13 hours

In Warsaw for the week. Old Town is beautiful!

0

4

William J.B. Mattingly

@wjb_mattingly

6 days

Link to blog:

ai.meta.com

We’re introducing Meta Omnilingual Automatic Speech Recognition, a suite of models providing automatic speech recognition capabilities for over 1,600 languages.

0

2

William J.B. Mattingly

@wjb_mattingly

6 days

Link to repo:

github.com

Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages - facebookresearch/omnilingual-asr

1

0

2

William J.B. Mattingly

@wjb_mattingly

6 days

Woah! Omnilingual ASR from Meta! As someone who works in low-resource language ASR, this looks incredible. Will certainly be testing this week!!

2

15

William J.B. Mattingly

@wjb_mattingly

11 days

Model:

huggingface.co

0

William J.B. Mattingly

@wjb_mattingly

11 days

You can now run OCR over Yiddish archival pages with Qwen 3 VL! **Take the CER and WER with a grain of salt. The test data mimics the training data very closely. This should work very well for clean docs, but the big benefit of a VLM, that knowledge will transfer to messy

1

7

Daniel van Strien

@vanstriendaniel

13 days

Finetuning @LightOnIO OCR model on HF Jobs for better OCR on historic books. Think (as usual) the main gap to better open OCR models is more open OCR datasets for lots of domains and task types.

2

4

50

William J.B. Mattingly

@wjb_mattingly

13 days

Vibe coding small Flask apps is my new favorite thing. I do it weekly. I can have Claude 4.5 designing a user-friendly way to interact with and modify the data in a database while my pipeline is processing the data. I only need one or two outputs to begin designing the actual

1

22

William J.B. Mattingly

@wjb_mattingly

18 days

If anyone wants to test it out: https://t.co/OUuKiKf6OE -- Let me know how it works. I'm also curious how it works on handwriting at this stage, though I don't expect it to be good.

0

1

William J.B. Mattingly

@wjb_mattingly

18 days

Qwen 3 VL-8B is still training, but working on page-level Hebrew with archival documents! The goal is to use this model to finetune further on handwriting. CER and WER on test set are quite good. Archival documents that are very different from the training data area also showing

1

0

7

William J.B. Mattingly

@wjb_mattingly

19 days

Full page Qwen 3 VL finetunes for Hebrew are coming soon! Still training, but 2B is looking pretty good on the heldout data with a CER of 1.7% and a WER of 6.4%. Models will be available on @huggingface @Alibaba_Qwen

2

1

15

William J.B. Mattingly

@wjb_mattingly

20 days

Does anyone know of a font for German shorthand? Or a way to programmatically go from German to German shorthand characters?

0

William J.B. Mattingly

@wjb_mattingly

21 days

Does anyone have a dataset of 1,000 + pages of handwritten text on Transkribus that they want to use for finetuning a VLM? If so, please let me know. This would be for any language and any script.

2

3

William J.B. Mattingly

@wjb_mattingly

21 days

Training some Qwen 3-VL finetunes that will work better on English archival documents, including handwriting from the 18th-century to present. I've seen a few spots already where this model finds errors in the ground truth.

2

1

17

William J.B. Mattingly

@wjb_mattingly

23 days

Just got access to ~600,000 archival documents that have been manually corrected. These are handwritten, cursive, and typed. Get ready for some serious GLAM Qwen 3 VL finetunes! =)

5

0

32

William J.B. Mattingly

@wjb_mattingly

24 days

8B model: https://t.co/JUAVyEpjv4

0

1

2

William J.B. Mattingly

@wjb_mattingly

24 days

4B model: https://t.co/PJIiH3k0mJ

1

2

William J.B. Mattingly

@wjb_mattingly

24 days

2B Model: https://t.co/VWSMSWydAf

1

4

William J.B. Mattingly

@wjb_mattingly

24 days

Over the last 24 hours, I have finetuned three Qwen3-VL models (2B, 4B, and 8B) on the CATmuS dataset on @huggingface . The first version of the models are now available on the Small Models for GLAM organization with @vanstriendaniel ! (Link below). These are designed to work

4

13

105