zmprcp Profile Banner
José Maria Pombal Profile
José Maria Pombal

@zmprcp

Followers
88
Following
185
Media
9
Statuses
58

Senior Research Scientist @swordhealth, PhD student @istecnico.

Lisbon, Portugal
Joined March 2023
Don't wanna be here? Send us removal request.
@zmprcp
José Maria Pombal
11 days
Last week was my final one at @Unbabel. I'm incredibly proud of our work (e.g., Tower, MINT, M-Prometheus, ZSB). Now, alongside my PhD studies at @istecnico, I'm joining @swordhealth as Senior Research Scientist under @RicardoRei7. Super confident in the team we're assembling.
1
0
12
@zmprcp
José Maria Pombal
1 month
RT @ManosZaranis: 🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding!.🤔Do you think your model understands movies?….
0
23
0
@zmprcp
José Maria Pombal
1 month
Check out the latest iteration of Tower models, Tower+. Ideal for translation tasks and beyond, and available at three different scales: 2B, 9B, 72B. All available on huggingface: Kudos to everyone involved!.
Tweet card summary image
huggingface.co
@RicardoRei7
Ricardo Rei
1 month
🚀 Tower+: our latest model in the Tower family — sets a new standard for open-weight multilingual models!.We show how to go beyond sentence-level translation, striking a balance between translation quality and general multilingual capabilities. 1/5.
Tweet media one
0
1
10
@zmprcp
José Maria Pombal
2 months
RT @dongkeun_yoon: 🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My an….
0
50
0
@zmprcp
José Maria Pombal
2 months
RT @psanfernandes: MT metrics excel at evaluating sentence translations, but struggle with complex texts. We introduce *TREQA* a framework….
0
11
0
@zmprcp
José Maria Pombal
4 months
RT @dongkeun_yoon: Introducing M-Prometheus — the latest iteration of the open LLM judge, Prometheus!.Specially trained for multilingual ev….
0
3
0
@zmprcp
José Maria Pombal
4 months
RT @seungonekim: Here's our new paper on m-Prometheus, a series of multulingual judges!. 1/ Effective at safety & translation eval.2/ Also….
0
2
0
@zmprcp
José Maria Pombal
4 months
Models and training data: Paper:
0
0
3
@zmprcp
José Maria Pombal
4 months
1
0
5
@zmprcp
José Maria Pombal
4 months
There were a lot of open questions on what strategies work for building multilingual LLM judges. We perform ablations on our training recipe that highlight the importance of backbone model choice and of using natively multilingual—instead of translated—training data.
Tweet media one
1
0
3
@zmprcp
José Maria Pombal
4 months
We fine-tune Qwen2.5 models with a recipe inspired by Prometheus 2. We release two multilingual datasets: M-Feedback-Collection, and M-Preference-Collection. They contain DA and PWC data for 5 languages, and MT eval data for 8 LPs. Our models perform well on unseen languages.
Tweet media one
1
0
3
@zmprcp
José Maria Pombal
4 months
For their size, M-Prometheus models achieve SotA performance on multilingual reward benchmarks and literary MT evaluation. They can also be used to significantly improve multilingual LLM outputs via best-of-n decoding (QAD)! Very useful for refining synthetic data, for example.
Tweet media one
1
0
5
@zmprcp
José Maria Pombal
4 months
We just released M-Prometheus, a suite of strong open multilingual LLM judges at 3B, 7B, and 14B parameters!. Check out the models and training data on Huggingface: and our paper:
Tweet media one
3
18
77
@zmprcp
José Maria Pombal
4 months
Massive kudos to collaborators @nunonmg @RicardoRei7 and @andre_t_martins . To use our benchmarks and create your own, check out our repository:
Tweet card summary image
github.com
Contribute to deep-spin/zsb development by creating an account on GitHub.
0
0
0
@zmprcp
José Maria Pombal
4 months
We also perform a series of ablations finding that dataset variety (not size!) and judge model size are crucial factors driving benchmark performance.
Tweet media one
Tweet media two
1
0
0
@zmprcp
José Maria Pombal
4 months
We meta-evaluate ZSB across LLM general capabilities in 4 languages (English, Chinese, French, and Korean), translation, and VLM general capabilities in English. ZSB consistently outperforms widely-used standard benchmarks, like MMLU, GPQA, MMBench, etc… at ranking models.
Tweet media one
1
0
1
@zmprcp
José Maria Pombal
4 months
We simplify previous approaches: ZSB only requires the creation of a prompt for data generation and one for evaluation, and the rest is handled by a language model, which functions as a data generator and a judge. This approach is scalable to various tasks and modalities.
1
0
2
@zmprcp
José Maria Pombal
4 months
New paper out 🚀 Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models: We present a framework and release a repository for creating reliable benchmarks for (V)LM tasks quickly and fully automatically.
Tweet media one
1
7
17
@zmprcp
José Maria Pombal
4 months
RT @slatornews: .@Unbabel exposes 🔎 how using the same metrics for both training and evaluation can create misleading ⚠️ #machinetranslatio….
Tweet card summary image
slator.com
Unbabel proposes a method to improve the accuracy of machine translation evaluation.
0
3
0
@zmprcp
José Maria Pombal
4 months
RT @fbk_mt: Our pick of the week by @apierg: "Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation" by José Pomb….
0
2
0