zouharvi Profile Banner
Vilém Zouhar Profile
Vilém Zouhar

@zouharvi

Followers
3K
Following
18K
Media
460
Statuses
2K

PhD student @ ETH Zürich | all aspects of #NLProc but mostly HCI, evaluation and MT | go #vegan

Zürich, Switzerland
Joined June 2014
Don't wanna be here? Send us removal request.
@zouharvi
Vilém Zouhar
27 days
RT @swetaagrawal20: 📢Shared task deadline extended: You now have a whole week to go (until August 6 AoE) to register and send us your submi….
0
3
0
@zouharvi
Vilém Zouhar
1 month
RT @diptesh: 📢 Test Set RELEASED! 🚀.The test set for the #WMT25 Shared Task on QE-informed Segment-level Error Correction is now LIVE!. It'….
0
5
0
@zouharvi
Vilém Zouhar
1 month
Organizers are happy to help with any questions. 🙂 .Website with all details and contacts:
0
1
1
@zouharvi
Vilém Zouhar
1 month
📐Task 3: Quality-informed segment-level error correction. Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections. Description: Submission platform:
1
1
1
@zouharvi
Vilém Zouhar
1 month
📐Task 2: Span-level error detection. Identify and locate translation errors within each segment (start/end indices) and classify their severity. Description: Submission platform:
1
1
1
@zouharvi
Vilém Zouhar
1 month
📐Task 1: Segment-level quality score prediction. Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations. Description: Submission platform:
1
1
1
@zouharvi
Vilém Zouhar
1 month
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework. The following tasks are now open (deadline July 31st but participation has never been easier 🙂).
1
6
12
@zouharvi
Vilém Zouhar
1 month
Thank you everyone who helped. 😊. Special thanks to @mrinmayasachan and Peng Cui from @CSatETH and all my friends I bugged with proofreading. 😁.
0
0
2
@zouharvi
Vilém Zouhar
1 month
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿. 📃 Paper (with nuances and caveats): 📦 Package: Feedback welcome!.
Tweet card summary image
github.com
Find informative examples to efficiently (human)-evaluate NLG models. - zouharvi/subset2evaluate
1
0
5
@zouharvi
Vilém Zouhar
1 month
Recommendation based on translation and summarization:.1️⃣ if you have a good automatic metric, use variance/consistency.2️⃣ if not, use model output diversity.3️⃣ if outputs not available, use artificial crowd/distilled predictors.4️⃣ if those are not available, use source diversity.
1
0
2
@zouharvi
Vilém Zouhar
1 month
We frame this as a 0/1 Knapsack problem: find a subset Y ⊆ X with maximum utility while staying under budget B. 🤓. maximize: ∑ zₓ · Utility(x).subject to: ∑ zₓ · Cost(x) ≤ B.zₓ ∈ {0, 1}. The Utility(x) can be metric average, variance, diversity, etc.
1
0
3
@zouharvi
Vilém Zouhar
1 month
This works even if you don't have the model outputs yet. 1️⃣ "artificial crowd" simulate what model outputs would look like; apply the previous methods. 2️⃣ "utility predictors" estimate usefulness from the source text. 3️⃣ "source-based diversity" remove similar inputs.
1
0
2
@zouharvi
Vilém Zouhar
1 month
So what works? Selecting inputs that expose model differences:.1️⃣ high variance in metric scores.2️⃣ diversity in model outputs.3️⃣ high metric consistency with the rest of the dataset. We now need almost 30% fewer annotated examples to get the same model ranking.
Tweet media one
1
0
4
@zouharvi
Vilém Zouhar
1 month
We frame this as finding the smallest subset of data (Y ⊆ X) that gives the same model ranking as on the full dataset. Simply picking the hardest examples (lowest average metric score) is a step up but can backfire by selecting the most expensive items to annotate.
Tweet media one
1
0
3
@zouharvi
Vilém Zouhar
1 month
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅. We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️.(random is still a devilishly good baseline)
Tweet media one
2
14
73
@zouharvi
Vilém Zouhar
2 months
TIL that since python3.4 there's default `statistics` module with things like mean, mode, quantiles, variance, covariance, correlations, zscore, and more!. No more needless numpy imports!
Tweet media one
1
0
8
@zouharvi
Vilém Zouhar
2 months
Thank you for your response. I will keep my score.
3
0
36
@zouharvi
Vilém Zouhar
2 months
For a long time I've been using Google Translate as a gateway to explain machine translation concepts to people as an easily recognizable tool that everyone knows. Now I get to contribute over the summer. 🌞. If you're near Mountain View, let's talk evaluation. 📏
Tweet media one
3
0
75
@zouharvi
Vilém Zouhar
2 months
RT @NC_Renic: I am once again pitching my romantic comedy:. - two academics start dating.- discover they are each other's terrible reviewer….
0
742
0