Jindřich Libovický Profile
Jindřich Libovický

@jlibovicky

Followers
953
Following
780
Media
38
Statuses
433

🇨🇿 🇪🇺 Researcher at @ufal_cuni. Working on multilingual NLP and neural machine translation. Views my own. He/him

Prague, Czech Republic
Joined July 2011
Don't wanna be here? Send us removal request.
@jlibovicky
Jindřich Libovický
11 months
Join Mu-SHROOM 🍄, a SemEval 2025 shared task on detecting hallucination spans in multilingual LLM outputs! 🌍 Includes Czech with regional Czech questions 🇨🇿. Do you think you can spot when something isn’t true? 🤔 Try it out! 👉 https://t.co/SOU1YTtq2g #SemEval2025 #NLProc
helsinki-nlp.github.io
Hallucinations and Related Observable Overgenerations
0
2
7
@jlibovicky
Jindřich Libovický
11 months
Happy holidays! 🎄🎅🤩🎁
0
0
7
@jlibovicky
Jindřich Libovický
1 year
Highlights from multilingual #NLProc and machine translation papers I found on arXiv in November are now on my blog:
jlibovicky.github.io
Mitigating Metric Bias in Minimum Bayes Risk Decoding
2
0
2
@jlibovicky
Jindřich Libovický
1 year
This is going to be fun! 🤓 We have three years to spend 6.5M CZK on improving multilingual tokenization. The goal is to make subwords more alignable across languages and help languages that suffer from over-segmentation with current models.
@ufal_cuni
Institute of Formal and Applied Linguistics
1 year
Good news! 🥳 GAČR will fund two of our projects: 👉 @jlibovicky proposes to better tokenization for #LLMs and machine translation 👉 Veronika Kolářová will study syntactic features of Czech non-verbal predicates ➕ Dominik Macháček receives Postdoc Individual Fellowship! 💪
0
0
13
@jlibovicky
Jindřich Libovický
1 year
Find me on 🦋 (and the rest of #NLProc folks too).
0
0
1
@jlibovicky
Jindřich Libovický
1 year
There's no clear winner this year's MRL shared task, but we ended up in the cluseer of top-3 teams. I'm so proud of you, folks ☺️
@ufal_cuni
Institute of Formal and Applied Linguistics
1 year
Finally, @kat_haem and Gianluca Vico presented one of the three price-winning 🏆🤑 submissons for the shared task on multilingual named entity recognition and question answering! w/ @AndreiM85400815, @jindra_helcl and @jlibovicky. Congrats! https://t.co/kZxNr3tKpY
0
0
12
@jlibovicky
Jindřich Libovický
1 year
Thanks to everyone who stopped by the poster ☺️
@ufal_cuni
Institute of Formal and Applied Linguistics
1 year
#EMNLP2024 starts today and @ufal_cuni is here! We start with @jlibovicky presenting work with @jindra_helcl: Lexically Grounded Subword Segmentation https://t.co/R5FNXF31MA
0
0
5
@jlibovicky
Jindřich Libovický
1 year
This week I am at #EMNLP2024 in Miami 🌴🇺🇸. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization and stop by our poster on Tuesday at 2 p.m., I'll present our paper on lexically Grounded Subword Segmentation https://t.co/R7W28p5BeZ
0
1
9
@jlibovicky
Jindřich Libovický
1 year
Summaries of #multilingual #LLM and machine translation papers I liked in October are now on my blog https://t.co/Pg6mMtNe9J and also on Medium https://t.co/6clmWCKbLq
Tweet card summary image
medium.com
Here are summaries of a few pre-preprints that I noticed on arXiv during October.
0
0
7
@jindra_helcl
Jindra Helcl
1 year
... starring @jlibovicky and me as young and perspective scientists with their impeccable movie editing skills
0
1
5
@jlibovicky
Jindřich Libovický
1 year
If you liked the video, read our paper https://t.co/nAw9SCuGNh or check our code https://t.co/sGnpr1Uane
github.com
Neural extension to the SentencePiece algorithm. Contribute to ufal/legros-paper development by creating an account on GitHub.
@jlibovicky
Jindřich Libovický
1 year
In our #EMNLP2024 paper with @jindra_helcl, we present a new subword tokenization method that is more morphologically plausible but maintains the nice properties of existing tokenizers. Pre-print: https://t.co/Dqx0N6k7kr Code: https://t.co/s3pztuSk8N 👇🧵1/4
0
0
0
@jlibovicky
Jindřich Libovický
1 year
In a week, @jindra_helcl and I will present our paper Lexically Grounded Subword Segmentation at #EMNLP2024 in Miami 🌴🇺🇸. You can already watch our video 🎥 https://t.co/g88FRIeVoo or stop by our poster 👋 next Tuesday at 2 p.m...
2
1
13
@jlibovicky
Jindřich Libovický
1 year
Summaries of a few papers that I noticed on arXiv during summer are now on my blog: https://t.co/kjHP1Bhj9y and on Medium https://t.co/t94pmSq7Af.
Tweet card summary image
medium.com
Here are summaries of a few papers that I liked during the (long academic) summer.
0
1
9
@jlibovicky
Jindřich Libovický
1 year
👍 It works great for preserving morpheme boundaries. 👍 Does a good job in POS tagging. 👎 No improvement in machine translation. And bad news, @zouharvi, our downstream performance does not correlate with Rényi efficiency. 🤷‍♂️ 🧵4/4
1
0
4
@jlibovicky
Jindřich Libovický
1 year
Then, we find segmentations with subwords with the closest embedding closest to the word embedding. We collect bigram stats from those and use them in a bigram-LM-based segmenter (a generalization of SentencePiece). And we also do some experiments... 🧵3/4
1
0
3
@jlibovicky
Jindřich Libovický
1 year
We do three innovations: 1⃣ Morfessor in pre-tokenization. 2⃣ Novel embedding-based tokenizer. 3⃣Distillation into a bigram model to get rid of slow Morfessor. For embedding-based tokenization, we have a closed-form solution for extending W2V to subwords 🤓 🧵2/4
1
0
2
@jlibovicky
Jindřich Libovický
1 year
In our #EMNLP2024 paper with @jindra_helcl, we present a new subword tokenization method that is more morphologically plausible but maintains the nice properties of existing tokenizers. Pre-print: https://t.co/Dqx0N6k7kr Code: https://t.co/s3pztuSk8N 👇🧵1/4
4
2
26
@jlibovicky
Jindřich Libovický
1 year
In the paper introducing the dataset https://t.co/cj7OrNW5mF, we also present a method based on hard-negative sampling on the text side of the model that significantly improves the model's ability to distinguish details.
0
0
0
@jlibovicky
Jindřich Libovický
1 year
It consists of minimum pairs of images and captions derived from the MS COCO test set. Annotators used object detection and Stable Diffusion Inpanting 👨‍🎨👩‍🎨 to get images with either different objects or objects of different colors and sizes. Everything's 100% human-supervised. 💪
1
0
0