José Maria Pombal @zmprcp X Profile

José Maria Pombal

@zmprcp

Followers

88

Following

185

Media

9

Statuses

58

Senior Research Scientist @swordhealth, PhD student @istecnico.

Lisbon, Portugal

Joined March 2023

Don't wanna be here? Send us removal request.

José Maria Pombal

@zmprcp

11 days

Last week was my final one at @Unbabel. I'm incredibly proud of our work (e.g., Tower, MINT, M-Prometheus, ZSB). Now, alongside my PhD studies at @istecnico, I'm joining @swordhealth as Senior Research Scientist under @RicardoRei7. Super confident in the team we're assembling.

1

0

12

José Maria Pombal

@zmprcp

1 month

RT @ManosZaranis: 🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding!.🤔Do you think your model understands movies?….

0

23

0

José Maria Pombal

@zmprcp

1 month

Check out the latest iteration of Tower models, Tower+. Ideal for translation tasks and beyond, and available at three different scales: 2B, 9B, 72B. All available on huggingface: Kudos to everyone involved!.

huggingface.co

Ricardo Rei

@RicardoRei7

1 month

🚀 Tower+: our latest model in the Tower family — sets a new standard for open-weight multilingual models!.We show how to go beyond sentence-level translation, striking a balance between translation quality and general multilingual capabilities. 1/5.

0

1

10

José Maria Pombal

@zmprcp

2 months

RT @dongkeun_yoon: 🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My an….

0

50

0

José Maria Pombal

@zmprcp

2 months

RT @psanfernandes: MT metrics excel at evaluating sentence translations, but struggle with complex texts. We introduce *TREQA* a framework….

0

11

0

José Maria Pombal

@zmprcp

4 months

RT @dongkeun_yoon: Introducing M-Prometheus — the latest iteration of the open LLM judge, Prometheus!.Specially trained for multilingual ev….

0

3

0

José Maria Pombal

@zmprcp

4 months

RT @seungonekim: Here's our new paper on m-Prometheus, a series of multulingual judges!. 1/ Effective at safety & translation eval.2/ Also….

0

2

0

José Maria Pombal

@zmprcp

4 months

Models and training data: Paper:

0

3

José Maria Pombal

@zmprcp

4 months

Massive kudos to co-authors @dongkeun_yoon @psanfernandes @ianwu97 @seungonekim @RicardoRei7 @gneubig @andre_t_martins.

1

0

5

José Maria Pombal

@zmprcp

4 months

There were a lot of open questions on what strategies work for building multilingual LLM judges. We perform ablations on our training recipe that highlight the importance of backbone model choice and of using natively multilingual—instead of translated—training data.

1

0

3

José Maria Pombal

@zmprcp

4 months

We fine-tune Qwen2.5 models with a recipe inspired by Prometheus 2. We release two multilingual datasets: M-Feedback-Collection, and M-Preference-Collection. They contain DA and PWC data for 5 languages, and MT eval data for 8 LPs. Our models perform well on unseen languages.

1

0

3

José Maria Pombal

@zmprcp

4 months

For their size, M-Prometheus models achieve SotA performance on multilingual reward benchmarks and literary MT evaluation. They can also be used to significantly improve multilingual LLM outputs via best-of-n decoding (QAD)! Very useful for refining synthetic data, for example.

1

0

5

José Maria Pombal

@zmprcp

4 months

We just released M-Prometheus, a suite of strong open multilingual LLM judges at 3B, 7B, and 14B parameters!. Check out the models and training data on Huggingface: and our paper:

3

18

77

José Maria Pombal

@zmprcp

4 months

Massive kudos to collaborators @nunonmg @RicardoRei7 and @andre_t_martins . To use our benchmarks and create your own, check out our repository:

github.com

Contribute to deep-spin/zsb development by creating an account on GitHub.

0

José Maria Pombal

@zmprcp

4 months

We also perform a series of ablations finding that dataset variety (not size!) and judge model size are crucial factors driving benchmark performance.

1

0

José Maria Pombal

@zmprcp

4 months

We meta-evaluate ZSB across LLM general capabilities in 4 languages (English, Chinese, French, and Korean), translation, and VLM general capabilities in English. ZSB consistently outperforms widely-used standard benchmarks, like MMLU, GPQA, MMBench, etc… at ranking models.

1

0

1

José Maria Pombal

@zmprcp

4 months

We simplify previous approaches: ZSB only requires the creation of a prompt for data generation and one for evaluation, and the rest is handled by a language model, which functions as a data generator and a judge. This approach is scalable to various tasks and modalities.

1

0

2

José Maria Pombal

@zmprcp

4 months

New paper out 🚀 Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models: We present a framework and release a repository for creating reliable benchmarks for (V)LM tasks quickly and fully automatically.