Thomas Sounack
@tsounack
Followers
93
Following
38
Media
1
Statuses
32
AI/ML Engineer @ Dana-Farber Cancer Institute | Stanford alum
Joined May 2024
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
4
14
67
If you would like to see more details about a certain aspect of the guide, please don't hesitate to reach out! Your contributions are welcome and will be acknowledged. Link to our HF collection: https://t.co/6zXS33GrY9 Link to our paper: https://t.co/UzEtzltiRr 5/5
0
1
2
If you are working with a lot of biomedical and/or clinical text, consider continuing MLM training of BioClinical ModernBERT on your own data! The resulting encoder will be much easier to fine-tune on your various downstream tasks (embedding model for RAG, classifier...) 4/5
1
0
1
You can use this same setup to continue training BioClinical ModernBERT itself. Generalization is one of the biggest strengths of our model - after being trained on a very diverse biomedical and clinical dataset, we found that it performed great on our DFCI oncology notes. 3/5
1
0
1
Link: https://t.co/R87dUvCyAN The guide goes over setting up your training environment, pre-tokenizing your dataset, configuring the Masked Language Modeling training and domain adaptation considerations. 2/5
huggingface.co
1
0
3
Want to continue training an encoder on your own data, but not sure where to start? Our step-by-step guide for reproducing the BioClinical ModernBERT training was just released! 1/5
2
3
13
Exciting work from @neumll !
🧬🔬⚕️ Building on the popularity of our PubMedBERT Embeddings model, we're excited to release a long context medical embeddings model! It's built on the great work below from @tsounack Model: https://t.co/AFF9CKa8Tb Paper: https://t.co/JJH6Tx30GJ
https://t.co/pSXJg2nBBa
0
0
4
BioClinical ModernBERT github repo is online! It contains: - Our continued pretraining config files - Performance eval code - Inference speed eval code Step-by-step guide on how to continue ModernBERT or BioClinical ModernBERT pretraining coming in the next few days!
1
3
18
https://t.co/xGJeik3UZb
https://t.co/2vHAxRfLX2 next demo visualizing BioClinical-ModernBERT-base embeddings on a sphere
3
1
7
BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP → Built on ModernBERT with 8K context, RoPE, and fast unpadded inference Trained via two-phase continued pretraining: - Phase 1: 160.5B tokens (PubMed + PMC + 20 diverse clinical
0
3
16
we are so back "Mitochondria is the powerhouse of the [MASK]."
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
5
2
9
BioClinical ModernBERT is out! Built on the largest, most diverse biomedical/clinical dataset to date ‼️Delivers SOTA across the board Thrilled to be part of this effort led by @tsounack
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
0
2
5
Your daily reminder that fine tuning is just continued pretraining. Super cool results from @antoine_chaffin who is putting this knowledge into practice to improve medical AI:
You can just continue pre-train things ✨ Happy to announce the release of BioClinical ModernBERT, a ModernBERT model whose pre-training has been continued on medical data The result: SOTA performance on various medical tasks with long context support and ModernBERT efficiency
7
55
418
You can just continue pre-train things ✨ Happy to announce the release of BioClinical ModernBERT, a ModernBERT model whose pre-training has been continued on medical data The result: SOTA performance on various medical tasks with long context support and ModernBERT efficiency
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
4
33
214
Clinical encoders are joining the ModernBERT family ☺️
Very excited to share the release of BioClinical ModernBERT! Highlights: - biggest and most diverse biomedical and clinical dataset for an encoder - 8192 context - fastest throughput with a variety of inputs - sota results across several tasks - base and large sizes (1/8)
0
7
29
🚀Announcing BioClinical ModernBERT, a SOTA encoder for healthcare AI, developed by Thomas Sounack @tsounack for Dana-Farber Cancer Institute in collaboration with @Harvard, @LightOnIO, @MIT, @mcgillu, @AlbanyMed, @MSFTResearch. Seamless continued pre-training enables SOTA
1
8
22
Link to the models: - https://t.co/NHGBZMS3bT - https://t.co/qCw3rCrbJg - https://t.co/mCD7GLzjDd (8/8)
huggingface.co
0
0
9
During benchmarking, we also observed substantially faster fine-tuning and inference with BioClinical ModernBERT. Combined with its long context support, enabling full clinical note processing in a single pass, it offers strong scaling potential for clinical NLP. (7/8)
1
0
6
Excited to see how it performs on your data! In our internal evaluations, BioClinical ModernBERT significantly outperformed existing encoders - thanks to its training on diverse clinical data spanning multiple institutions, specialties, and countries. (6/8)
1
0
7