Zihao Li
@realzihaolee
Followers
71
Following
4K
Media
50
Statuses
626
Doctoral Researcher @HelsinkiNLP | MSc @UnivHelsinkiCS | Multilingual NLP
Helsinki, Finland
Joined June 2019
Wrapping @emnlpmeeting main conf. #HPLT, funded by the EU and UKRI, has supported it as a silver sponsor, disseminating HPLT results from our booth and through several papers. We'll keep shaping the future of multilingual datasets and models here and in @OpenEuroLLM. Stay tuned!
0
3
12
Describing HPLT datasets in depth is an essential part of our commitment as data curators: ๐HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on๐ฅat #HPLT
arxiv.org
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...
0
6
8
Test-Time Scaling of Reasoning Models for Machine Translation https://t.co/He5Cr1ZhbF
#LLM
arxiv.org
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This...
0
1
0
I couldnโt make it to Montreal due to visa is still โunder reviewโ since forever โ shoutout to IRCC for pushing the limits of bureaucracy! @CitImmCanada
0
0
0
๐ง We explore how monolingual, bilingual, and code-augmented data shape multilingual continual pretraining across high- to low-resource languages. Big thanks to my supervisor @TiedemannJoerg, who will present our poster. Come chat with him!
1
0
0
๐ Excited to share our #COLM2025 work: ๐๐๐ญ๐ก๐ข๐ง๐ค๐ข๐ง๐ ๐๐ฎ๐ฅ๐ญ๐ข๐ฅ๐ข๐ง๐ ๐ฎ๐๐ฅ ๐๐จ๐ง๐ญ๐ข๐ง๐ฎ๐๐ฅ ๐๐ซ๐๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ : ๐๐๐ญ๐ ๐๐ข๐ฑ๐ข๐ง๐ ๐๐จ๐ซ ๐๐๐๐ฉ๐ญ๐ข๐ง๐ ๐๐๐๐ฌ ๐๐๐ซ๐จ๐ฌ๐ฌ ๐๐๐ง๐ ๐ฎ๐๐ ๐๐ฌ ๐๐ง๐ ๐๐๐ฌ๐จ๐ฎ๐ซ๐๐๐ฌ ๐Poster Session 3 - Wednesday 11โ13
1
0
3
Agree
@zephyr_z9 My favourite European AI labs tier list: S: Mistral A: B: C: Stability AI Left the race: Aleph Alpha
0
0
1
Camera-ready version of our work: Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/XpfzScKnJi
COLM 2025 accepted submissions are now public: https://t.co/yWL007rSU7 Congratulations to all the authors, and see you all in Montreal!
0
0
4
Accepted by #COLM2025 ๐
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/gUpyQ36rvX
#LLMs
0
0
4
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data https://t.co/qP4JofNDNG
#LLM
arxiv.org
This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of...
0
0
6
Scaling Low-Resource MT via Synthetic Data Generation with LLMs https://t.co/HFD9FoIM3h
#LLMs
arxiv.org
We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level...
0
0
4
Improvements in multilingual translation capabilities are noticeable. Flores-200 X-Eng 3-shots BLEU Score๐
Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models.
0
0
1
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models https://t.co/H9VA0d4U9A
#LLMs
arxiv.org
Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models...
0
0
1
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/gUpyQ36rvX
#LLMs
arxiv.org
Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual...
0
0
4
AI made in ๐ช๐บ OpenEuroLLM, the first family of open source Large Language Models covering all EU languages, has earned the first STEP Seal for its excellence. It brings together EU startups, research labs and supercomputing hosts to train AI on European supercomputers โ
1K
877
6K