Thomas Sounack Profile
Thomas Sounack

@tsounack

Followers
93
Following
38
Media
1
Statuses
32

AI/ML Engineer @ Dana-Farber Cancer Institute | Stanford alum

Joined May 2024
Don't wanna be here? Send us removal request.
@tsounack
Thomas Sounack
5 months
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
4
14
67
@tsounack
Thomas Sounack
2 months
If you would like to see more details about a certain aspect of the guide, please don't hesitate to reach out! Your contributions are welcome and will be acknowledged. Link to our HF collection: https://t.co/6zXS33GrY9 Link to our paper: https://t.co/UzEtzltiRr 5/5
0
1
2
@tsounack
Thomas Sounack
2 months
If you are working with a lot of biomedical and/or clinical text, consider continuing MLM training of BioClinical ModernBERT on your own data! The resulting encoder will be much easier to fine-tune on your various downstream tasks (embedding model for RAG, classifier...) 4/5
1
0
1
@tsounack
Thomas Sounack
2 months
You can use this same setup to continue training BioClinical ModernBERT itself. Generalization is one of the biggest strengths of our model - after being trained on a very diverse biomedical and clinical dataset, we found that it performed great on our DFCI oncology notes. 3/5
1
0
1
@tsounack
Thomas Sounack
2 months
Link: https://t.co/R87dUvCyAN The guide goes over setting up your training environment, pre-tokenizing your dataset, configuring the Masked Language Modeling training and domain adaptation considerations. 2/5
huggingface.co
1
0
3
@tsounack
Thomas Sounack
2 months
Want to continue training an encoder on your own data, but not sure where to start? Our step-by-step guide for reproducing the BioClinical ModernBERT training was just released! 1/5
2
3
13
@tsounack
Thomas Sounack
4 months
Exciting work from @neumll !
@neumll
NeuML
4 months
🧬🔬⚕️ Building on the popularity of our PubMedBERT Embeddings model, we're excited to release a long context medical embeddings model! It's built on the great work below from @tsounack Model: https://t.co/AFF9CKa8Tb Paper: https://t.co/JJH6Tx30GJ https://t.co/pSXJg2nBBa
0
0
4
@tsounack
Thomas Sounack
5 months
Exciting to see BioClinical ModernBERT (base) ranked #2 among trending fill-mask models - right after BERT! The large version is currently at #4. Grateful for the interest, and can’t wait to see what projects people apply it to!
0
7
12
@tsounack
Thomas Sounack
5 months
BioClinical ModernBERT github repo is online! It contains: - Our continued pretraining config files - Performance eval code - Inference speed eval code Step-by-step guide on how to continue ModernBERT or BioClinical ModernBERT pretraining coming in the next few days!
1
3
18
@introsp3ctor
Mike Dupont
5 months
https://t.co/xGJeik3UZb https://t.co/2vHAxRfLX2 next demo visualizing BioClinical-ModernBERT-base embeddings on a sphere
3
1
7
@gm8xx8
𝚐𝔪𝟾𝚡𝚡𝟾
5 months
BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP → Built on ModernBERT with 8K context, RoPE, and fast unpadded inference Trained via two-phase continued pretraining: - Phase 1: 160.5B tokens (PubMed + PMC + 20 diverse clinical
0
3
16
@josephpollack
Joseph Pollack #Ï 🎗️
5 months
we are so back "Mitochondria is the powerhouse of the [MASK]."
@tsounack
Thomas Sounack
5 months
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
5
2
9
@joshp_davis
Josh Davis
5 months
BioClinical ModernBERT is out! Built on the largest, most diverse biomedical/clinical dataset to date ‼️Delivers SOTA across the board Thrilled to be part of this effort led by @tsounack
@tsounack
Thomas Sounack
5 months
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
0
2
5
@jeremyphoward
Jeremy Howard
5 months
Your daily reminder that fine tuning is just continued pretraining. Super cool results from @antoine_chaffin who is putting this knowledge into practice to improve medical AI:
@antoine_chaffin
Antoine Chaffin
5 months
You can just continue pre-train things ✨ Happy to announce the release of BioClinical ModernBERT, a ModernBERT model whose pre-training has been continued on medical data The result: SOTA performance on various medical tasks with long context support and ModernBERT efficiency
7
55
418
@antoine_chaffin
Antoine Chaffin
5 months
You can just continue pre-train things ✨ Happy to announce the release of BioClinical ModernBERT, a ModernBERT model whose pre-training has been continued on medical data The result: SOTA performance on various medical tasks with long context support and ModernBERT efficiency
@tsounack
Thomas Sounack
5 months
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
4
33
214
@bclavie
Ben Clavié
5 months
Clinical encoders are joining the ModernBERT family ☺️
@tsounack
Thomas Sounack
5 months
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
0
7
29
@LightOnIO
LightOn
5 months
🚀Announcing BioClinical ModernBERT, a SOTA encoder for healthcare AI, developed by Thomas Sounack @tsounack for Dana-Farber Cancer Institute in collaboration with @Harvard, @LightOnIO, @MIT, @mcgillu, @AlbanyMed, @MSFTResearch. Seamless continued pre-training enables SOTA
1
8
22
@tsounack
Thomas Sounack
5 months
During benchmarking, we also observed substantially faster fine-tuning and inference with BioClinical ModernBERT. Combined with its long context support, enabling full clinical note processing in a single pass, it offers strong scaling potential for clinical NLP. (7/8)
1
0
6
@tsounack
Thomas Sounack
5 months
Excited to see how it performs on your data! In our internal evaluations, BioClinical ModernBERT significantly outperformed existing encoders - thanks to its training on diverse clinical data spanning multiple institutions, specialties, and countries. (6/8)
1
0
7