José Maria Pombal
@zmprcp
Followers
94
Following
198
Media
9
Statuses
64
Senior Research Scientist @swordhealth, PhD student @istecnico.
Lisbon, Portugal
Joined March 2023
I'll be at COLM today presenting M-Prometheus (morning, Poster 40) and Zero-shot Benchmarking (afternoon, poster 9). Come check it out!
Don't miss our lab's presentations today at @COLM_conf!! 🔥 We will have two presentations 1/3
0
3
5
1) Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models w/ @zmprcp @nunonmg @RicardoRei7 - Poster session 2, Tue Oct 7, 4:30 PM – 6:30 PM
1
1
2
2) M-Prometheus: A Suite of Open Multilingual LLM Judges w/ @zmprcp @dongkeun_yoon @psanfernandes @ianwu97 @seungonekim @RicardoRei7 @gneubig - (Poster session 1, Tue Oct 7, 11:00 AM – 1:00 PM)
1
2
7
1
0
1
I'll be at ACL presenting our work, A Context-aware Framework for Translation-mediated Conversations ( https://t.co/3Y3IM2n3HU) in the Machine Translation session, 28 Jul, 14:00-15:30, room 1.85. Come check it out if you're interested in bilingual chat MT!
1
2
7
Last week was my final one at @Unbabel. I'm incredibly proud of our work (e.g., Tower, MINT, M-Prometheus, ZSB). Now, alongside my PhD studies at @istecnico, I'm joining @swordhealth as Senior Research Scientist under @RicardoRei7. Super confident in the team we're assembling.
1
0
12
🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like
2
30
55
Check out the latest iteration of Tower models, Tower+. Ideal for translation tasks and beyond, and available at three different scales: 2B, 9B, 72B. All available on huggingface: https://t.co/XWJqTeht7R Kudos to everyone involved!
huggingface.co
🚀 Tower+: our latest model in the Tower family — sets a new standard for open-weight multilingual models! We show how to go beyond sentence-level translation, striking a balance between translation quality and general multilingual capabilities. 1/5 https://t.co/WKQapk31c0
0
1
10
🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.
9
49
302
MT metrics excel at evaluating sentence translations, but struggle with complex texts We introduce *TREQA* a framework to assess how translations preserve key info by using LLMs to generate & answer questions about them https://t.co/aHUScXzoBM (co-lead @swetaagrawal20) 1/15
2
12
38
Introducing M-Prometheus — the latest iteration of the open LLM judge, Prometheus! Specially trained for multilingual evaluation. Excels across diverse settings, including the challenging task of literary translation assessment.
We just released M-Prometheus, a suite of strong open multilingual LLM judges at 3B, 7B, and 14B parameters! Check out the models and training data on Huggingface: https://t.co/nqixsAQtQ0 and our paper: https://t.co/c93J4YGXZH
0
3
22
Here's our new paper on m-Prometheus, a series of multulingual judges! 1/ Effective at safety & translation eval 2/ Also stands out as a good reward model in BoN 3/ Backbone model selection & training on natively multilingual data is important Check out @zmprcp 's post!
We just released M-Prometheus, a suite of strong open multilingual LLM judges at 3B, 7B, and 14B parameters! Check out the models and training data on Huggingface: https://t.co/nqixsAQtQ0 and our paper: https://t.co/c93J4YGXZH
0
2
20
Massive kudos to co-authors @dongkeun_yoon @psanfernandes @ianwu97 @seungonekim @RicardoRei7 @gneubig @andre_t_martins
1
0
5
There were a lot of open questions on what strategies work for building multilingual LLM judges. We perform ablations on our training recipe that highlight the importance of backbone model choice and of using natively multilingual—instead of translated—training data.
1
0
3
We fine-tune Qwen2.5 models with a recipe inspired by Prometheus 2. We release two multilingual datasets: M-Feedback-Collection, and M-Preference-Collection. They contain DA and PWC data for 5 languages, and MT eval data for 8 LPs. Our models perform well on unseen languages.
1
0
3
For their size, M-Prometheus models achieve SotA performance on multilingual reward benchmarks and literary MT evaluation. They can also be used to significantly improve multilingual LLM outputs via best-of-n decoding (QAD)! Very useful for refining synthetic data, for example.
1
0
5
We just released M-Prometheus, a suite of strong open multilingual LLM judges at 3B, 7B, and 14B parameters! Check out the models and training data on Huggingface: https://t.co/nqixsAQtQ0 and our paper: https://t.co/c93J4YGXZH
3
18
79
Massive kudos to collaborators @nunonmg @RicardoRei7 and @andre_t_martins . To use our benchmarks and create your own, check out our repository:
github.com
Contribute to deep-spin/zsb development by creating an account on GitHub.
0
0
0