tomaarsen Profile Banner
tomaarsen Profile
tomaarsen

@tomaarsen

Followers
4K
Following
4K
Media
311
Statuses
1K

Sentence Transformers, SetFit & NLTK maintainer Machine Learning Engineer at 🤗 Hugging Face

Netherlands
Joined December 2023
Don't wanna be here? Send us removal request.
@tomaarsen
tomaarsen
1 month
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! . Details in 🧵
Tweet media one
6
65
479
@tomaarsen
tomaarsen
4 days
Big thanks to all of the contributors for helping with the release, many of the features from this release were proposed by others. I have a big list of future potential features that I'd love to add, but I'm unsure what to prioritize now. Exciting times!.
0
0
1
@tomaarsen
tomaarsen
4 days
Plus many more smaller features & fixes (crash fixes, compatibility with datasets v4, FIPS compatibility, etc.). 🧵
Tweet media one
1
0
0
@tomaarsen
tomaarsen
4 days
We've added some documentation on evaluating SentenceTransformer models properly with MTEB. It's rudimentary as the documentation on the MTEB side is already great, but it should get you started. 🧵
Tweet media one
1
0
0
@tomaarsen
tomaarsen
4 days
If you also upgrade `transformers`, and you install `trackio` with `pip install trackio`, then your experiments will also automatically be tracked locally with trackio. Just open up localhost and have a look at your losses/evals, no logins, no metric uploading. 🧵
Tweet media one
1
0
2
@tomaarsen
tomaarsen
4 days
When doing multi-GPU training using a loss that has in-batch negatives (e.g. MultipleNegativesRankingLoss), you can now use `gather_across_devices=True` to load in-batch negatives from the other devices too! Essentially a free lunch, pretty big impact potential in my evals. 🧵
Tweet media one
1
0
3
@tomaarsen
tomaarsen
4 days
There's a new `n-tuple-scores` output format from `mine_hard_negatives`. This new output format is immediately compatible with the MarginMSELoss and SparseMarginMSELoss for training SentenceTransformer, CrossEncoder, and SparseEncoder losses. 🧵
Tweet media one
1
0
5
@tomaarsen
tomaarsen
4 days
Plus I ran benchmarks for CPUs (see first picture of the thread) and GPUs, averaged across a couple of datasets and batch sizes. 🧵
Tweet media one
1
0
1
@tomaarsen
tomaarsen
4 days
I added faster ONNX and OpenVINO backends for SparseEncoder models. The usage is as simple as `backend="onnx"` or `backend="openvino"` when initializing a SparseEncoder to get started, but I also included utility functions for optimization, dynamic & static quantization. 🧵
Tweet media one
1
0
4
@tomaarsen
tomaarsen
4 days
😎 I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more!. See 🧵for the deets:
Tweet media one
1
16
131
@tomaarsen
tomaarsen
5 days
RT @dylan_ebert_: OpenAI just released GPT-OSS: An Open Source Language Model on Hugging Face. Open source meaning:.💸 Free.🔒 Private.🔧 Cust….
0
37
0
@tomaarsen
tomaarsen
5 days
OpenAI is back with open releases on Hugging Face. Check out their latest here:
Tweet media one
@OpenAI
OpenAI
5 days
Our open models are here. Both of them.
0
1
20
@tomaarsen
tomaarsen
5 days
And inference is extraordinarily simple:. 🧵
Tweet media one
1
0
2
@tomaarsen
tomaarsen
5 days
SetFit is still my go-to for classifying anything: it's so much faster and cheaper than LLM-based solutions. I've used this to classify 3k+ texts per seconds: it's so quick it's even viable on CPUs. 🧵
Tweet media one
1
0
4
@tomaarsen
tomaarsen
5 days
I've just updated SetFit to v1.1.3, bringing compatibility with the recent datasets v4.0+ and Sentence Transformers v5.0+. You'll again be able to train tiny classifiers using very little training data!. 🧵
Tweet media one
1
8
75
@tomaarsen
tomaarsen
6 days
P.s. I haven't tested this on non-reasoning tasks, I'm not sure how well it holds up on more "standard" retrieval tasks. It looks to be mostly evaluated on BRIGHT (reasoning-intensive retrieval). It's also evaluated on NanoBEIR, but I'm not sure how other models do there.
2
0
2
@tomaarsen
tomaarsen
6 days
There's a new, strong multilingual ColBERT model! Trained for English, German, Spanish, French, Italian, Dutch, and Portuguese. I think this'll be my new recommendation for a multilingual Late Interaction/ColBERT model currently.
@gm8xx8
𝚐𝔪𝟾𝚡𝚡𝟾
7 days
SauerkrautLM-Multi-Reason-ModernColBERT. Multilingual, reasoning-capable late interaction retriever family. - First ColBERT-style retriever to apply LaserRMT for low-rank approximation .- Distilled from Qwen/Qwen3-32B-AWQ using 200K synthetic query-document pairs, scored by a
Tweet media one
2
15
101
@tomaarsen
tomaarsen
9 days
RT @lvwerra: Excited to share the preview of the ultra-scale book! . The past few months we worked with a graphic designer to bring the blo….
0
21
0
@tomaarsen
tomaarsen
10 days
Huge, nicely done @Cohere 👏.
@nickfrosst
Nick Frosst
10 days
cohere vision model :) . weights on huggingface .
Tweet media one
1
0
38