Jan Hajic Profile
Jan Hajic

@HajicJan

Followers
273
Following
521
Media
16
Statuses
369

Researcher in AI / Natural Language Processing

Prague, Czech Republic
Joined March 2012
Don't wanna be here? Send us removal request.
@HajicJan
Jan Hajic
8 days
Proud to present the @OpenEuroLLM project and its results so far, with Sampo Pyysalo (@UniTurku) at the 1st Workshop on Open Source Sovereign LLMs in Berlin https://t.co/gMalOuRGm3 Great opportunity to talk to many OS LLM developers! @CharlesUniPRG @hplt_eu
0
2
6
@hplt_eu
HPLT
18 days
Wrapping @emnlpmeeting main conf. #HPLT, funded by the EU and UKRI, has supported it as a silver sponsor, disseminating HPLT results from our booth and through several papers. We'll keep shaping the future of multilingual datasets and models here and in @OpenEuroLLM. Stay tuned!
0
3
12
@OpenEuroLLM
OpenEuroLLM
18 days
We strongly agree! Let's make it happen! Thanks @EU_Budget & STEP for the support.
@EU_Budget
EU Budget
19 days
Future-proof AI in all EU languages isnโ€™t a dream, itโ€™s OpenEuroLLM ๐Ÿ—ฃ๏ธ๐Ÿ’ฌ 9 countries, the EU budget & STEP join forces to build transparent, AI Act-compliant tech for Europeโ€™s innovators. Find out how we will turn ambition into action for 2028-2034: https://t.co/ZekDsBaz6M
0
2
4
@HajicJan
Jan Hajic
19 days
Excellent keynote by @HannaHajishirzi at @emnlpmeeting about pre- and post-training of fully open source LLMs! A must watch for anyone working on such models. Also very good questions in the discussion about possible misuse of open models, multilingualiy, and other issues.
0
0
1
@hplt_eu
HPLT
19 days
Describing HPLT datasets in depth is an essential part of our commitment as data curators: ๐Ÿ†•HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models: https://t.co/uN2zoSF251 We are on๐Ÿ”ฅat #HPLT
Tweet card summary image
arxiv.org
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest...
0
6
8
@OpenEuroLLM
OpenEuroLLM
1 month
OpenEuroLLM completing 2 days of sharing progress and next steps pursuing the goal of developing strong multilingual foundation models aligned with European strategic vision & standards. Gathering at BSC nearby MareNostrum 5 supercomputer made us feel home. #Barcelona #NLProc
0
1
4
@hplt_eu
HPLT
2 months
#HPLT v3.0 Dataset is OUT! ๐Ÿš€ A massive leap in multilingual data quality & scale with: ๐Ÿ“ˆ 73% unique segments (up from 52%) ๐ŸŒ Better text extraction and langID ๐Ÿ”„ Global deduplication, cleaner corpus Ideal for training better multilingual models ๐Ÿ”— https://t.co/DzDKYtfWpz
1
7
14
@hplt_eu
HPLT
4 months
It's happening now. Our HPLT v2 dataset language coverage is awesome, provides competitive and stable results and complements other data beautifully. We are at @aclmeeting, come and say hi! #hplt #datasets
0
5
10
@OpenEuroLLM
OpenEuroLLM
4 months
๐Ÿ“ข First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM โš™๏ธTrained on 100B tokens from HPLT v2 dataset ๐ŸŒ Cover EU langs + others โš™๏ธ Based on LLaMA, trained on #LUMI ๐Ÿ“ˆ Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy
1
10
21
@hplt_eu
HPLT
4 months
Great use of HPLT v2 datasets! Eager to hear more about #HPLT? Join us at @aclmeeting: - BoF "Multilingualism: from data crawling to evaluation" on July 29, 16:00 - Poster "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies" on July 30, 11:00
@OpenEuroLLM
OpenEuroLLM
4 months
๐Ÿ“ข First release: 38 monolingual reference LLMs (2.15B params) via @hplt_eu + #OpenEuroLLM โš™๏ธTrained on 100B tokens from HPLT v2 dataset ๐ŸŒ Cover EU langs + others โš™๏ธ Based on LLaMA, trained on #LUMI ๐Ÿ“ˆ Useful for evaluation Downloads + more info at https://t.co/vp1RwD9YFy
0
4
8
@hplt_eu
HPLT
5 months
HPLT stopped by @MTSummit2025 last week. We exchanged info with participants at a crowded poster session about HPLT v2 datasets while v3 is still in the oven. Next stop, @aclmeeting!
0
2
10
@HajicJan
Jan Hajic
5 months
https://t.co/ow87rxjbIR is underway, Day 1. Highlight: Michiel Leenaars about the necessary changes in EU and importance of world-wide cooperation on open source #ngi #opensource. I will present @OpenEuroLLM tomorrow alongside the @OpenWebSearchEU folks! @LindatClariahCZ
0
1
3
@aclmeeting
ACL 2025
9 months
๐Ÿ“ข#ACL2025NLP This year we received 8276 submissions ๐Ÿ‘ which is the highest number in the history of ACL conferences ๐Ÿ™Œ If you are not yet involved as a reviewer, AC or SAC, we would encourage you to volunteer as an (emergency) AC or reviewer https://t.co/UhPTpK7hq6 ๐Ÿ™
Tweet card summary image
docs.google.com
Use this form to volunteer to join the ACL 2025 program committee as an (emergency) reviewer or area chair (AC). The reviewers need to be available in March and early April 2025. ACs need to be...
6
42
155
@hplt_eu
HPLT
9 months
We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) ๐Ÿคฉ - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) ๐Ÿ˜ฎ Available at the HPLT dataset catalogue and OPUS.
0
13
16
@ufal_cuni
Institute of Formal and Applied Linguistics
9 months
The 6th International Workshop on Designing Meaning Representation (#DMR2025) is in Prague, Aug 4-5, right after #ACL2025 in Vienna! Submit your work on meaning representations: annotation, parsing, multilinguality, neuro-symbolic methods & more. Details:
dmr2025.github.io
The 6th International Workshop on Designing Meaning Representations
0
3
6
@ufal_cuni
Institute of Formal and Applied Linguistics
10 months
The OpenEuroLLM ๐Ÿ‡ช๐Ÿ‡บ project brings together 20 European research institutions, companies, and computing centers to develop open large language models (LLMs). And we at @ufal_cuni @matfyz @CharlesUniPRG ๐Ÿ‡จ๐Ÿ‡ฟ are proud to be the main coordinator ๐Ÿ›๏ธ. #AI #LLM ๐Ÿงต 1/4
2
10
24
@HajicJan
Jan Hajic
10 months
Great keynote by @partha_p_t of @GoogleDeepMind at @Coling2025 about cultural aspects and low-resource languages, especially Indian. Topped with interesting suggestions about LLM composition (CALM) vs. finetuning, including for Machine Translation.
0
1
19
@HajicJan
Jan Hajic
10 months
Very interesting keynote by Katrin Erk (@UTAustin) at @coling2025 on Word Meaning including discussion of very old ideas (like Fodor and Katz's features), contrasting them with possible information coming from LLMs.
0
1
3
@ufal_cuni
Institute of Formal and Applied Linguistics
10 months
@ufal_cuni is represented by @MarieMikulova (paper/poster at main @coling2025), @hajicjan as ICCL member, and three former students: Shantipriya Parida ( https://t.co/PTBiUT5LJy), @adamnohejl (@NAIST_NLP) and Christian Khairallah (Aralect). @matfyz @LindatClariahCZ @UniKarlova
1
1
4
@ufal_cuni
Institute of Formal and Applied Linguistics
10 months
@coling2025 is underway in Abu Dhabi, UAE, hosted by @mbzuai. Just opened by General Co-chairs Leo Wanner and @RambowOwen! All papers incl. Workshops now available at https://t.co/hPF6FJnNoU. @LindatClariahCZ @matfyz
1
3
6