Duarte Alves
@DuarteMRAlves
Followers
56
Following
10
Media
2
Statuses
34
Joined November 2023
I'm heading soon to Montreal for @COLM_conf ! Our lab is presenting the following 5 papers: ๐งต
1
6
34
5) EuroBERT: Scaling Multilingual Encoders for European Languages w/ @N1colAIs @gisship @DuarteMRAlves
@AyoubHammal @UndefBehavior @Fannyjrd_ @ManuelFaysse @peyrardMax @psanfernandes
@RicardoRei7 @PierreColombo6
@tomaarsen - Poster session 5, Thu Oct 9, 11:00 AM โ 1:00 PM
1
6
8
EuroBERT is going to @COLM_conf 2025! Canโt wait to be in Montreal with @gisship and @DuarteMRAlves to see all the great research everyoneโs bringing!
0
4
21
๐จ Should you only pretrain encoder models with Masked Language Modeling (MLM)? Spoiler: definitely not! Letโs revisit a foundational NLP question: Is MLM still the best way to pretrain encoder models for text representations? ๐: https://t.co/kaPLch1o3V x @gisship 1/7 ๐งต
arxiv.org
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence...
1
2
3
๐จ New paper drop: Should We Still Pretrain Encoders with Masked Language Modeling? We revisit a foundational question in NLP: Is masked language modeling (MLM) still the best way to pretrain encoder models for text representations? ๐ https://t.co/W1p5mjTTf2 (1/8)
arxiv.org
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence...
1
4
25
๐ Proud moment! Prof. @andre_t_martins represented @UTTERProject & #EuroLLM at #GTCParis + #VivaTech2025, showcasing their role in Europeโs sovereign AI future. And the highlight? Both projects were featured in Jensen Huangโs keynote! ๐ #EU #NVIDIA #LLMs #AIResearch
0
3
3
The EuroBERT training library is live! ๐ Additionally, as weekends are perfect for experimentation, weโve released a tutorial on continuous pre-training to add languages to EuroBERT. ๐Tutorial: https://t.co/nMleTzF7A7 ๐จGithub:
github.com
Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including CPU, AMD, and NVIDIA GPUs. - Nicolas-BZRD/EuroBERT
1
1
7
An assembly of 18 European companies, labs, and universities have banded together to launch ๐ช๐บ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc. Details in ๐งต
4
16
92
๐ช๐บ One month after the AI Action Summit 2025 in Paris, I am thrilled to announce EuroBERT, a family of multilingual encoder exhibiting the strongest multilingual performance for task such as retrieval, classification and regression over 15 languages, mathematics and code. โฌ๏ธ 1/6
15
46
186
๐งต (7/7) ๐ Check out our blog post for more insights: https://t.co/7oe2ZPdQtB ๐ Read more in our paper:
arxiv.org
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide...
0
1
6
๐งต (6/7) ๐ Huge thanks also to all our collaborators: @CentraleSupelec @Diabolocom @artefact @sardine_lab_it @istecnico @itnewspt @Lisbon_ELLIS @Unbabel @AMD @CINESFrance
1
0
5
๐งต (5/7) @N1colAIs @gisship @andre_t_martins @AyoubHammal @UndefBehavior Cรฉline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf @Fannyjrd_ Gabriel Hautreux @joao97_alves Kevin El-Haddad @ManuelFaysse @peyrardMax Nuno M. Guerreiro @psanfernandes @RicardoRei7 @PierreColombo6
1
1
8
๐งต (4/7) ๐ค This work is the result of an incredible joint effort by a talented team from multiple institutions, props to everyone!
1
0
3
๐งต (3/7) ๐ EuroBERT is open-source: ๐ Models (210M, 610M, 2.1B params) ๐ Training snapshots ๐ Full training framework Explore here: [ https://t.co/SZHKDordRg](https://t.co/SZHKDordRg) Code coming soon! [ https://t.co/7o8CpqOfRV](https://t.co/7o8CpqOfRV)
github.com
Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including CPU, AMD, and NVIDIA GPUs. - Nicolas-BZRD/EuroBERT
1
2
7
๐งต (2/7) ๐ EuroBERT shines across benchmarks: โ๏ธ Retrieval (MIRACL, MLDR) โ๏ธ Classification (XNLI, PAWS-X) โ๏ธ Regression (SeaHorse) โ๏ธ Strong in code/math understanding (CodeSearchNet)
1
0
4
๐งต (1/7) ๐ Why EuroBERT? โ
Extensive multilingual coverage โ
Longer context handling (up to 8,192 tokens) โ
Improved architecture โ
Specialized for math and coding Ideal for retrieval, classification, and regression tasks!
1
2
5
๐ Excited to announce EuroBERT: a new multilingual encoder model family for European & global languages! ๐ ๐น EuroBERT is trained on a massive 5 trillion-token dataset across 15 languages and includes recent architecture advances such as GQA, RoPE & RMSNorm.ย ๐น
1
12
59
Good to see @EU_Commission promoting OS LLMs in Europe. However (1) "OpenEuroLLM" is appropriating a name (#EuroLLM) which already exists, (2) it is certainly *not* the "first family of open-source LLMs covering all EU languages" ๐งต
AI made in ๐ช๐บ OpenEuroLLM, the first family of open source Large Language Models covering all EU languages, has earned the first STEP Seal for its excellence. It brings together EU startups, research labs and supercomputing hosts to train AI on European supercomputers โ
2
13
47
What an incredible year for the team @ManuelFaysse @nunonmg @gisship @N1colAIs @DuarteMRAlves @andre_t_martins @UndefBehavior! The retrospective from @ManuelFaysse captures some. Plus, there's plenty of exciting news from @equallaiโso much to celebrate and be proud of! ๐
2024 was a super active year where I had the chance to explore many things: document embeddings, LLM pretraining, VLMs, ML Privacy... It's also the year of my first citation - and soon my 100th ?! A thread where I quickly go over some of my work from the year (1/N) ๐งต
0
4
13