Jan Hajic @HajicJan X Profile

Jan Hajic

@HajicJan

Followers

273

Following

521

Media

16

Statuses

369

Researcher in AI / Natural Language Processing

https://t.co/Nz9LKFcpof

Prague, Czech Republic

Joined March 2012

Don't wanna be here? Send us removal request.

Jan Hajic

@HajicJan

8 days

Proud to present the @OpenEuroLLM project and its results so far, with Sampo Pyysalo (@UniTurku) at the 1st Workshop on Open Source Sovereign LLMs in Berlin https://t.co/gMalOuRGm3 Great opportunity to talk to many OS LLM developers! @CharlesUniPRG @hplt_eu

0

2

6

HPLT

@hplt_eu

18 days

Wrapping @emnlpmeeting main conf. #HPLT, funded by the EU and UKRI, has supported it as a silver sponsor, disseminating HPLT results from our booth and through several papers. We'll keep shaping the future of multilingual datasets and models here and in @OpenEuroLLM. Stay tuned!

0

3

12

OpenEuroLLM

@OpenEuroLLM

18 days

We strongly agree! Let's make it happen! Thanks @EU_Budget & STEP for the support.

EU Budget

@EU_Budget

19 days

Future-proof AI in all EU languages isn’t a dream, it’s OpenEuroLLM 🗣️💬 9 countries, the EU budget & STEP join forces to build transparent, AI Act-compliant tech for Europe’s innovators. Find out how we will turn ambition into action for 2028-2034: https://t.co/ZekDsBaz6M

0

2

4

Jan Hajic

@HajicJan

19 days

Excellent keynote by @HannaHajishirzi at @emnlpmeeting about pre- and post-training of fully open source LLMs! A must watch for anyone working on such models. Also very good questions in the discussion about possible misuse of open models, multilingualiy, and other issues.

0

1

HPLT

@hplt_eu

19 days

Describing HPLT datasets in depth is an essential part of our commitment as data curators: 🆕HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on🔥at #HPLT

arxiv.org

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...

0

6

8

OpenEuroLLM

@OpenEuroLLM

1 month

OpenEuroLLM completing 2 days of sharing progress and next steps pursuing the goal of developing strong multilingual foundation models aligned with European strategic vision & standards. Gathering at BSC nearby MareNostrum 5 supercomputer made us feel home. #Barcelona #NLProc

0

1

4

HPLT

@hplt_eu

2 months

#HPLT v3.0 Dataset is OUT! 🚀 A massive leap in multilingual data quality & scale with: 📈 73% unique segments (up from 52%) 🌐 Better text extraction and langID 🔄 Global deduplication, cleaner corpus Ideal for training better multilingual models 🔗 https://t.co/DzDKYtfWpz

1

7

14

HPLT

@hplt_eu

4 months

It's happening now. Our HPLT v2 dataset language coverage is awesome, provides competitive and stable results and complements other data beautifully. We are at @aclmeeting, come and say hi! #hplt #datasets

0

5

10

OpenEuroLLM

@OpenEuroLLM

4 months

📢 First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM ⚙️Trained on 100B tokens from HPLT v2 dataset 🌍 Cover EU langs + others ⚙️ Based on LLaMA, trained on #LUMI 📈 Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy

1

10

21

HPLT

@hplt_eu

4 months

Great use of HPLT v2 datasets! Eager to hear more about #HPLT? Join us at @aclmeeting: - BoF "Multilingualism: from data crawling to evaluation" on July 29, 16:00 - Poster "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies" on July 30, 11:00

OpenEuroLLM

@OpenEuroLLM

4 months

📢 First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM ⚙️Trained on 100B tokens from HPLT v2 dataset 🌍 Cover EU langs + others ⚙️ Based on LLaMA, trained on #LUMI 📈 Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy

0

4

8

HPLT

@hplt_eu

5 months

HPLT stopped by @MTSummit2025 last week. We exchanged info with participants at a crowded poster session about HPLT v2 datasets while v3 is still in the oven. Next stop, @aclmeeting!

0

2

10

Jan Hajic

@HajicJan

5 months

https://t.co/ow87rxjbIR is underway, Day 1. Highlight: Michiel Leenaars about the necessary changes in EU and importance of world-wide cooperation on open source #ngi #opensource. I will present @OpenEuroLLM tomorrow alongside the @OpenWebSearchEU folks! @LindatClariahCZ

0

1

3

ACL 2025

@aclmeeting

9 months

📢#ACL2025NLP This year we received 8276 submissions 👏 which is the highest number in the history of ACL conferences 🙌 If you are not yet involved as a reviewer, AC or SAC, we would encourage you to volunteer as an (emergency) AC or reviewer https://t.co/UhPTpK7hq6 🙏

docs.google.com

Use this form to volunteer to join the ACL 2025 program committee as an (emergency) reviewer or area chair (AC). The reviewers need to be available in March and early April 2025. ACs need to be...

6

42

155

HPLT

@hplt_eu

9 months

We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.

0

13

16

Institute of Formal and Applied Linguistics

@ufal_cuni

9 months

The 6th International Workshop on Designing Meaning Representation (#DMR2025) is in Prague, Aug 4-5, right after #ACL2025 in Vienna! Submit your work on meaning representations: annotation, parsing, multilinguality, neuro-symbolic methods & more. Details:

dmr2025.github.io

The 6th International Workshop on Designing Meaning Representations

0

3

6

Institute of Formal and Applied Linguistics

@ufal_cuni

10 months

The OpenEuroLLM 🇪🇺 project brings together 20 European research institutions, companies, and computing centers to develop open large language models (LLMs). And we at @ufal_cuni @matfyz @CharlesUniPRG 🇨🇿 are proud to be the main coordinator 🏛️. #AI #LLM 🧵 1/4

2

10

24

Jan Hajic

@HajicJan

10 months

Great keynote by @partha_p_t of @GoogleDeepMind at @Coling2025 about cultural aspects and low-resource languages, especially Indian. Topped with interesting suggestions about LLM composition (CALM) vs. finetuning, including for Machine Translation.

0

1

19

Jan Hajic

@HajicJan

10 months

Very interesting keynote by Katrin Erk (@UTAustin) at @coling2025 on Word Meaning including discussion of very old ideas (like Fodor and Katz's features), contrasting them with possible information coming from LLMs.

0

1

3

Institute of Formal and Applied Linguistics

@ufal_cuni

10 months

@ufal_cuni is represented by @MarieMikulova (paper/poster at main @coling2025), @hajicjan as ICCL member, and three former students: Shantipriya Parida ( https://t.co/PTBiUT5LJy), @adamnohejl (@NAIST_NLP) and Christian Khairallah (Aralect). @matfyz @LindatClariahCZ @UniKarlova

1

4

Institute of Formal and Applied Linguistics

@ufal_cuni

10 months

@coling2025 is underway in Abu Dhabi, UAE, hosted by @mbzuai. Just opened by General Co-chairs Leo Wanner and @RambowOwen! All papers incl. Workshops now available at https://t.co/hPF6FJnNoU. @LindatClariahCZ @matfyz

1

3

6