Vilém Zouhar #EMNLP @zouharvi X Profile

Vilém Zouhar #EMNLP

@zouharvi

Followers

3K

Following

18K

Media

463

Statuses

2K

PhD student @ ETH Zürich | all aspects of #NLProc but mostly HCI, evaluation and MT | go #vegan

https://t.co/3vzYUsBLpV

Zürich, Switzerland

Joined June 2014

Don't wanna be here? Send us removal request.

Vilém Zouhar #EMNLP

@zouharvi

18 days

Eval is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese. Yes you got 67 BLEU points but is the result slaying? 💇 See the result on one datapoint (my head) at EMNLP

1

0

11

Vilém Zouhar #EMNLP

@zouharvi

24 days

The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡

1

0

5

Vilém Zouhar #EMNLP

@zouharvi

24 days

- How to Select Datapoints for Efficient Human Evaluation of NLG Models? https://t.co/WCpqDT0hAt - Estimating Machine Translation Difficulty https://t.co/nsHyi8Hmh9 - COMET-poly: Machine Translation Metric Grounded in Other Candidates

1

0

4

Vilém Zouhar #EMNLP

@zouharvi

24 days

Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳 - Efficient evaluation (Nov 5, 16:30, poster session 3) - MT difficulty (Nov 7, 12:30, findings 3) - COMET-poly (Nov 8, 11:00, WMT) (DM to meet 🌿 )

9

5

58

Vilém Zouhar #EMNLP

@zouharvi

28 days

...research problems I was passionate about and planning my research future. You should apply to these fellowships, even if it's for the exercise of periodically refining your reserach statement.

1

0

14

Vilém Zouhar #EMNLP

@zouharvi

28 days

Grateful to receive the Google PhD Fellowship!🙂 I am not secretive about having applied to 4 similar fellowships during my PhD before and didn't succeed. Still, refining my research statement (part of the application) helped me tremendously in finding out the real interesting..

Google.org

@Googleorg

29 days

🎉 We're excited to announce the 2025 Google PhD Fellows! @GoogleOrg is providing over $10 million to support 255 PhD students across 35 countries, fostering the next generation of research talent to strengthen the global scientific landscape. Read more: https://t.co/0Pvuv6hsgP

12

13

270

Simran Khanuja

@simi_97k

1 month

📢 Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) — co-located with #EACL2026 🇲🇦 📅 Mar 24–29, 2026 | Rabat, Morocco MME focuses on resources, metrics & methodologies for evaluating multilingual systems! https://t.co/60yCZUjbzH 🗓️ Submit by

1

19

73

Sweta Agrawal

@swetaagrawal20

4 months

📢Shared task deadline extended: You now have a whole week to go (until August 6 AoE) to register and send us your submissions!!

Vilém Zouhar #EMNLP

@zouharvi

4 months

The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework. The following tasks are now open (deadline July 31st but participation has never been easier 🙂)

0

4

11

Diptesh Kanojia

@diptesh

4 months

📢 Test Set RELEASED! 🚀 The test set for the #WMT25 Shared Task on QE-informed Segment-level Error Correction is now LIVE! It's time to put your MT error correction / APE methods to the test. Let's see how well they can correct machine translation! #NLProc #MT #WMT2025

1

5

9

Vilém Zouhar #EMNLP

@zouharvi

4 months

Organizers are happy to help with any questions. 🙂 Website with all details and contacts:

0

1

Vilém Zouhar #EMNLP

@zouharvi

4 months

📐Task 3: Quality-informed segment-level error correction Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections. Description: https://t.co/844QeBTI9A Submission platform:

1

Vilém Zouhar #EMNLP

@zouharvi

4 months

📐Task 2: Span-level error detection Identify and locate translation errors within each segment (start/end indices) and classify their severity. Description: https://t.co/baKvWUuPGq Submission platform:

1

Vilém Zouhar #EMNLP

@zouharvi

4 months

📐Task 1: Segment-level quality score prediction Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations. Description: https://t.co/M9oEULegNk Submission platform:

1

Vilém Zouhar #EMNLP

@zouharvi

4 months

The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework. The following tasks are now open (deadline July 31st but participation has never been easier 🙂)

1

6

12

Vilém Zouhar #EMNLP

@zouharvi

4 months

Thank you everyone who helped. 😊 Special thanks to @mrinmayasachan and Peng Cui from @CSatETH and all my friends I bugged with proofreading. 😁

0

2

Vilém Zouhar #EMNLP

@zouharvi

4 months

"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿 📃 Paper (with nuances and caveats): https://t.co/oBP8qb5Bs0 📦 Package: https://t.co/OdPeoycIHa Feedback welcome!

github.com

Find informative examples to efficiently (human)-evaluate NLG models. - zouharvi/subset2evaluate

1

0

5

Vilém Zouhar #EMNLP

@zouharvi

4 months

Recommendation based on translation and summarization: 1️⃣ if you have a good automatic metric, use variance/consistency 2️⃣ if not, use model output diversity 3️⃣ if outputs not available, use artificial crowd/distilled predictors 4️⃣ if those are not available, use source diversity

1

0

2

Vilém Zouhar #EMNLP

@zouharvi

4 months

We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓 maximize: ∑ zₓ · Utility(x) subject to: ∑ zₓ · Cost(x) ≤ B zₓ ∈ {0, 1} The Utility(x) can be metric average, variance, diversity, etc.

1

0

3

Vilém Zouhar #EMNLP

@zouharvi

4 months

This works even if you don't have the model outputs yet. 1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods. 2️⃣ "utility predictors" estimate usefulness from the source text. 3️⃣ "source-based diversity" remove similar inputs.

1

0

2

Vilém Zouhar #EMNLP

@zouharvi

4 months

So what works? Selecting inputs that expose model differences: 1️⃣ high variance in metric scores 2️⃣ diversity in model outputs 3️⃣ high metric consistency with the rest of the dataset We now need almost 30% fewer annotated examples to get the same model ranking.

1

0

4