Jan Hajic
@HajicJan
Followers
273
Following
521
Media
16
Statuses
369
Researcher in AI / Natural Language Processing
Prague, Czech Republic
Joined March 2012
Proud to present the @OpenEuroLLM project and its results so far, with Sampo Pyysalo (@UniTurku) at the 1st Workshop on Open Source Sovereign LLMs in Berlin https://t.co/gMalOuRGm3 Great opportunity to talk to many OS LLM developers! @CharlesUniPRG @hplt_eu
0
2
6
Wrapping @emnlpmeeting main conf. #HPLT, funded by the EU and UKRI, has supported it as a silver sponsor, disseminating HPLT results from our booth and through several papers. We'll keep shaping the future of multilingual datasets and models here and in @OpenEuroLLM. Stay tuned!
0
3
12
We strongly agree! Let's make it happen! Thanks @EU_Budget & STEP for the support.
Future-proof AI in all EU languages isnโt a dream, itโs OpenEuroLLM ๐ฃ๏ธ๐ฌ 9 countries, the EU budget & STEP join forces to build transparent, AI Act-compliant tech for Europeโs innovators. Find out how we will turn ambition into action for 2028-2034: https://t.co/ZekDsBaz6M
0
2
4
Excellent keynote by @HannaHajishirzi at @emnlpmeeting about pre- and post-training of fully open source LLMs! A must watch for anyone working on such models. Also very good questions in the discussion about possible misuse of open models, multilingualiy, and other issues.
0
0
1
Describing HPLT datasets in depth is an essential part of our commitment as data curators: ๐HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on๐ฅat #HPLT
arxiv.org
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...
0
6
8
OpenEuroLLM completing 2 days of sharing progress and next steps pursuing the goal of developing strong multilingual foundation models aligned with European strategic vision & standards. Gathering at BSC nearby MareNostrum 5 supercomputer made us feel home. #Barcelona #NLProc
0
1
4
#HPLT v3.0 Dataset is OUT! ๐ A massive leap in multilingual data quality & scale with: ๐ 73% unique segments (up from 52%) ๐ Better text extraction and langID ๐ Global deduplication, cleaner corpus Ideal for training better multilingual models ๐ https://t.co/DzDKYtfWpz
1
7
14
It's happening now. Our HPLT v2 dataset language coverage is awesome, provides competitive and stable results and complements other data beautifully. We are at @aclmeeting, come and say hi! #hplt #datasets
0
5
10
๐ข First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM โ๏ธTrained on 100B tokens from HPLT v2 dataset ๐ Cover EU langs + others โ๏ธ Based on LLaMA, trained on #LUMI ๐ Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy
1
10
21
Great use of HPLT v2 datasets! Eager to hear more about #HPLT? Join us at @aclmeeting: - BoF "Multilingualism: from data crawling to evaluation" on July 29, 16:00 - Poster "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies" on July 30, 11:00
๐ข First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM โ๏ธTrained on 100B tokens from HPLT v2 dataset ๐ Cover EU langs + others โ๏ธ Based on LLaMA, trained on #LUMI ๐ Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy
0
4
8
HPLT stopped by @MTSummit2025 last week. We exchanged info with participants at a crowded poster session about HPLT v2 datasets while v3 is still in the oven. Next stop, @aclmeeting!
0
2
10
https://t.co/ow87rxjbIR is underway, Day 1. Highlight: Michiel Leenaars about the necessary changes in EU and importance of world-wide cooperation on open source #ngi #opensource. I will present @OpenEuroLLM tomorrow alongside the @OpenWebSearchEU folks! @LindatClariahCZ
0
1
3
๐ข#ACL2025NLP This year we received 8276 submissions ๐ which is the highest number in the history of ACL conferences ๐ If you are not yet involved as a reviewer, AC or SAC, we would encourage you to volunteer as an (emergency) AC or reviewer https://t.co/UhPTpK7hq6 ๐
docs.google.com
Use this form to volunteer to join the ACL 2025 program committee as an (emergency) reviewer or area chair (AC). The reviewers need to be available in March and early April 2025. ACs need to be...
6
42
155
We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) ๐คฉ - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) ๐ฎ Available at the HPLT dataset catalogue and OPUS.
0
13
16
The 6th International Workshop on Designing Meaning Representation (#DMR2025) is in Prague, Aug 4-5, right after #ACL2025 in Vienna! Submit your work on meaning representations: annotation, parsing, multilinguality, neuro-symbolic methods & more. Details:
dmr2025.github.io
The 6th International Workshop on Designing Meaning Representations
0
3
6
The OpenEuroLLM ๐ช๐บ project brings together 20 European research institutions, companies, and computing centers to develop open large language models (LLMs). And we at @ufal_cuni @matfyz @CharlesUniPRG ๐จ๐ฟ are proud to be the main coordinator ๐๏ธ. #AI #LLM ๐งต 1/4
2
10
24
Great keynote by @partha_p_t of @GoogleDeepMind at @Coling2025 about cultural aspects and low-resource languages, especially Indian. Topped with interesting suggestions about LLM composition (CALM) vs. finetuning, including for Machine Translation.
0
1
19
Very interesting keynote by Katrin Erk (@UTAustin) at @coling2025 on Word Meaning including discussion of very old ideas (like Fodor and Katz's features), contrasting them with possible information coming from LLMs.
0
1
3
@ufal_cuni is represented by @MarieMikulova (paper/poster at main @coling2025), @hajicjan as ICCL member, and three former students: Shantipriya Parida ( https://t.co/PTBiUT5LJy),
@adamnohejl (@NAIST_NLP) and Christian Khairallah (Aralect). @matfyz @LindatClariahCZ @UniKarlova
1
1
4
@coling2025 is underway in Abu Dhabi, UAE, hosted by @mbzuai. Just opened by General Co-chairs Leo Wanner and @RambowOwen! All papers incl. Workshops now available at https://t.co/hPF6FJnNoU.
@LindatClariahCZ @matfyz
1
3
6