Vilém Zouhar #EMNLP
@zouharvi
Followers
3K
Following
18K
Media
463
Statuses
2K
PhD student @ ETH Zürich | all aspects of #NLProc but mostly HCI, evaluation and MT | go #vegan
Zürich, Switzerland
Joined June 2014
Eval is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese. Yes you got 67 BLEU points but is the result slaying? 💇 See the result on one datapoint (my head) at EMNLP
1
0
11
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
1
0
5
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? https://t.co/WCpqDT0hAt - Estimating Machine Translation Difficulty https://t.co/nsHyi8Hmh9 - COMET-poly: Machine Translation Metric Grounded in Other Candidates
1
0
4
...research problems I was passionate about and planning my research future. You should apply to these fellowships, even if it's for the exercise of periodically refining your reserach statement.
1
0
14
Grateful to receive the Google PhD Fellowship!🙂 I am not secretive about having applied to 4 similar fellowships during my PhD before and didn't succeed. Still, refining my research statement (part of the application) helped me tremendously in finding out the real interesting..
🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: https://t.co/0Pvuv6hsgP
12
13
270
📢 Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) — co-located with #EACL2026 🇲🇦 📅 Mar 24–29, 2026 | Rabat, Morocco MME focuses on resources, metrics & methodologies for evaluating multilingual systems! https://t.co/60yCZUjbzH 🗓️ Submit by
1
19
73
📢Shared task deadline extended: You now have a whole week to go (until August 6 AoE) to register and send us your submissions!!
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework. The following tasks are now open (deadline July 31st but participation has never been easier 🙂)
0
4
11
Organizers are happy to help with any questions. 🙂 Website with all details and contacts:
0
1
1
📐Task 3: Quality-informed segment-level error correction Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections. Description: https://t.co/844QeBTI9A Submission platform:
1
1
1
📐Task 2: Span-level error detection Identify and locate translation errors within each segment (start/end indices) and classify their severity. Description: https://t.co/baKvWUuPGq Submission platform:
1
1
1
📐Task 1: Segment-level quality score prediction Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations. Description: https://t.co/M9oEULegNk Submission platform:
1
1
1
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework. The following tasks are now open (deadline July 31st but participation has never been easier 🙂)
1
6
12
Thank you everyone who helped. 😊 Special thanks to @mrinmayasachan and Peng Cui from @CSatETH and all my friends I bugged with proofreading. 😁
0
0
2
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿 📃 Paper (with nuances and caveats): https://t.co/oBP8qb5Bs0 📦 Package: https://t.co/OdPeoycIHa Feedback welcome!
github.com
Find informative examples to efficiently (human)-evaluate NLG models. - zouharvi/subset2evaluate
1
0
5
Recommendation based on translation and summarization: 1️⃣ if you have a good automatic metric, use variance/consistency 2️⃣ if not, use model output diversity 3️⃣ if outputs not available, use artificial crowd/distilled predictors 4️⃣ if those are not available, use source diversity
1
0
2
We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓 maximize: ∑ zₓ · Utility(x) subject to: ∑ zₓ · Cost(x) ≤ B zₓ ∈ {0, 1} The Utility(x) can be metric average, variance, diversity, etc.
1
0
3
This works even if you don't have the model outputs yet. 1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods. 2️⃣ "utility predictors" estimate usefulness from the source text. 3️⃣ "source-based diversity" remove similar inputs.
1
0
2
So what works? Selecting inputs that expose model differences: 1️⃣ high variance in metric scores 2️⃣ diversity in model outputs 3️⃣ high metric consistency with the rest of the dataset We now need almost 30% fewer annotated examples to get the same model ranking.
1
0
4