Zihao Li @realzihaolee X Profile

Zihao Li

@realzihaolee

Followers

71

Following

4K

Media

50

Statuses

626

Doctoral Researcher @HelsinkiNLP | MSc @UnivHelsinkiCS | Multilingual NLP

https://t.co/DtGDr55f7l

Helsinki, Finland

Joined June 2019

Don't wanna be here? Send us removal request.

HPLT

@hplt_eu

16 days

Wrapping @emnlpmeeting main conf. #HPLT, funded by the EU and UKRI, has supported it as a silver sponsor, disseminating HPLT results from our booth and through several papers. We'll keep shaping the future of multilingual datasets and models here and in @OpenEuroLLM. Stay tuned!

0

3

12

HPLT

@hplt_eu

17 days

Describing HPLT datasets in depth is an essential part of our commitment as data curators: 🆕HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on🔥at #HPLT

arxiv.org

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...

0

6

8

Zihao Li

@realzihaolee

21 days

Helsinki-NLP @ #EMNLP2025

0

1

20

Zihao Li

@realzihaolee

2 months

Test-Time Scaling of Reasoning Models for Machine Translation https://t.co/He5Cr1ZhbF #LLM

arxiv.org

Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This...

0

1

0

Zihao Li

@realzihaolee

2 months

I couldn’t make it to Montreal due to visa is still “under review” since forever — shoutout to IRCC for pushing the limits of bureaucracy! @CitImmCanada

0

Zihao Li

@realzihaolee

2 months

🧠 We explore how monolingual, bilingual, and code-augmented data shape multilingual continual pretraining across high- to low-resource languages. Big thanks to my supervisor @TiedemannJoerg, who will present our poster. Come chat with him!

1

0

Zihao Li

@realzihaolee

2 months

🎉 Excited to share our #COLM2025 work: 𝐑𝐞𝐭𝐡𝐢𝐧𝐤𝐢𝐧𝐠 𝐌𝐮𝐥𝐭𝐢𝐥𝐢𝐧𝐠𝐮𝐚𝐥 𝐂𝐨𝐧𝐭𝐢𝐧𝐮𝐚𝐥 𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠: 𝐃𝐚𝐭𝐚 𝐌𝐢𝐱𝐢𝐧𝐠 𝐟𝐨𝐫 𝐀𝐝𝐚𝐩𝐭𝐢𝐧𝐠 𝐋𝐋𝐌𝐬 𝐀𝐜𝐫𝐨𝐬𝐬 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞𝐬 𝐚𝐧𝐝 𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞𝐬 📍Poster Session 3 - Wednesday 11–13

1

0

3

Zihao Li

@realzihaolee

2 months

Agree

nazevice

@nazevice

2 months

@zephyr_z9 My favourite European AI labs tier list: S: Mistral A: B: C: Stability AI Left the race: Aleph Alpha

0

1

Zihao Li

@realzihaolee

3 months

Camera-ready version of our work: Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/XpfzScKnJi

Conference on Language Modeling

@COLM_conf

3 months

COLM 2025 accepted submissions are now public: https://t.co/yWL007rSU7 Congratulations to all the authors, and see you all in Montreal!

0

4

Zihao Li

@realzihaolee

5 months

Accepted by #COLM2025 😋

Zihao Li

@realzihaolee

8 months

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/gUpyQ36rvX #LLMs

0

4

Zihao Li

@realzihaolee

6 months

#WeAreHelsinkiUni

0

1

Zihao Li

@realzihaolee

6 months

https://t.co/XMTz1Bo7Ml

linkedin.com

Call for Participation: MT Marathon 2025 in Helsinki (sponsored by eamt.org and hplt-project.org) * MT Lectures and Labs covering the basics and tutorials. * Keynote Talks from experienced research...

0

1

Zihao Li

@realzihaolee

6 months

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data https://t.co/qP4JofNDNG #LLM

arxiv.org

This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of...

0

6

Zihao Li

@realzihaolee

6 months

Scaling Low-Resource MT via Synthetic Data Generation with LLMs https://t.co/HFD9FoIM3h #LLMs

arxiv.org

We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level...

0

4

Zihao Li

@realzihaolee

7 months

Improvements in multilingual translation capabilities are noticeable. Flores-200 X-Eng 3-shots BLEU Score👇

Qwen

@Alibaba_Qwen

7 months

Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models.

0

1

Zihao Li

@realzihaolee

8 months

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models https://t.co/H9VA0d4U9A #LLMs

arxiv.org

Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models...

0

1

Zihao Li

@realzihaolee

8 months

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources https://t.co/gUpyQ36rvX #LLMs

arxiv.org

Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual...

0

4

Zihao Li

@realzihaolee

8 months

Too big!

AI at Meta

@AIatMeta

8 months

Today is the start of a new era of natively multimodal AI innovation. Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality. Llama 4 Scout • 17B-active-parameter model

0

1

Zihao Li

@realzihaolee

10 months

Finally

Mistral AI

@MistralAI

10 months

https://t.co/Q7gn0H6C8N https://t.co/1PmSl5fe8k

0

1

European Commission

@EU_Commission

10 months

AI made in 🇪🇺 OpenEuroLLM, the first family of open source Large Language Models covering all EU languages, has earned the first STEP Seal for its excellence. It brings together EU startups, research labs and supercomputing hosts to train AI on European supercomputers ↓

1K

877

6K