Alberto Testoni @alberto_testoni X Profile

Alberto Testoni

@alberto_testoni

Followers

241

Following

285

Media

16

Statuses

70

PostDoc @amsterdamumc / NLP4Health. Prev. @UvA_Amsterdam/@illc_amsterdam. PhD @UniTrento_DISI - @AmazonScience. MSc @cimec_unitrento, BSc @Unibo.

Amsterdam

Joined January 2012

Don't wanna be here? Send us removal request.

Alberto Testoni

@alberto_testoni

29 days

RT @Cohere_Labs: Today, our team will share “From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions” at @aclmeeting!….

0

4

0

Alberto Testoni

@alberto_testoni

3 months

RT @ale_suglia: Excited to present PLAYPEN, an environment for learning through dialogue game self-play. Are you interested in LLM post-tra….

0

9

0

Grok

@grok

6 days

What do you want to know?.

535

333

2K

Alberto Testoni

@alberto_testoni

6 months

RT @CohereForAI: Can LLMs collaborate effectively over long-term interactions, like a human teammate, especially in coding tasks? 🤔. We int….

0

7

0

Alberto Testoni

@alberto_testoni

6 months

RT @mziizm: Excited to share insights from our new paper on evaluating LLMs in multi-session coding interactions! 📚📚📚. We introduce MEMORYC….

0

9

0

Alberto Testoni

@alberto_testoni

8 months

4/4 Our results reveal significant limitations and problems of overconfidence of state-of-the-art large V&L models. For more analyses on the role of the saliency features that guide the model selection and on CoT prompting, check out our paper! 🏓

arxiv.org

Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can...

0

2

Alberto Testoni

@alberto_testoni

8 months

3/4 We find significant limitations of all models in responding to these questions. But what can go wrong when ambiguity is not recognized? In RAcQUEt-Bias, we analyze a critical yet underexplored problem: failing to address ambiguity can lead to stereotypical responses.

1

0

1

Alberto Testoni

@alberto_testoni

8 months

2/4 We examine referential ambiguity in image-based question answering by introducing a manually curated dataset, RAcQUEt. We categorize a range of human responses into distinct classes to gauge the way they respond to ambiguity and use these for evaluating model outputs.

1

0

Alberto Testoni

@alberto_testoni

8 months

1/4 Excited to share our latest paper “🏓 RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs”. Joint work with @barbara_plank and @raquel_dmg. #NLProc 🧵

1

5

Alberto Testoni

@alberto_testoni

9 months

RT @cimec_unitrento: 🔍 Papers being presented:. 1️⃣ Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and….

0

2

0

Alberto Testoni

@alberto_testoni

10 months

RT @dmazzaccara: Flying to Miami! I will present “Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Ex….

0

5

0

Alberto Testoni

@alberto_testoni

11 months

RT @barbara_plank: PhD opportunities in Munich 🥳 - consider applying to MCML and reach out if you are interested in @MaiNLPlab research the….

0

13

0

Alberto Testoni

@alberto_testoni

1 year

2) "Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models" work led by @anna_bavaresco_ with @raquel_dmg (poster 12/8 at 14:00 + oral 13/8 11:45)

arxiv.org

Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported...

0

7

Alberto Testoni

@alberto_testoni

1 year

I am attending #ACL2024 in Bangkok with 2 papers on multimodal #NLProc 🇹🇭 🧵.1) "Naming, Describing, and Quantifying Visual Objects in Humans and LLMs" with @sandropezzelle and J. Sprott (poster 12/8 at 14:00 - with a fun game for attendees)

arxiv.org

While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the...

2

19

Alberto Testoni

@alberto_testoni

1 year

5/5 ⚠️ We conclude that LLMs are not yet ready to systematically replace human judges in NLP, and caution against using LLMs for this purpose. JUDGE-BENCH is intended as a living benchmark, and you are welcome to contribute:

github.com

Contribute to dmg-illc/JUDGE-BENCH development by creating an account on GitHub.

0

2

12

Alberto Testoni

@alberto_testoni

1 year

4/5 📊 The gap between open and closed models is narrowing, indicating promising prospects for reproducibility. When evaluating different linguistic dimensions, GPT-4o and Gemini-1.5 perform best in acceptability and verbosity, while Mixtral leads in coherence and consistency.

1

9

Alberto Testoni

@alberto_testoni

1 year

3/5 📊 We find that LLMs exhibit a large variance across datasets in their correlation to human judgments. While some LLMs correlate well with human judgments on some datasets, each tested LLM performs poorly on some others and exhibits significant variance across datasets.

1

0

8

Alberto Testoni

@alberto_testoni

1 year

2/5 🔍 Our evaluation goes beyond existing work by including a wide variety of datasets that differ in the type of task, the property being judged, the type of judgments, and the expertise of human annotators. We evaluate 11 open-weight and proprietary LLMs of different sizes.

1

0

7

Alberto Testoni

@alberto_testoni

1 year

1/5 📣 Excited to share “LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks”! 🚀 We introduce JUDGE-BENCH, a benchmark to investigate to what extent LLM-generated judgements align with human evaluations. #NLProc

4

24

97

Alberto Testoni

@alberto_testoni

1 year

RT @raquel_dmg: I'm looking for a last-minute emergency reviewer for a COLM submission related to generation with LLMs. Reviews need to be….

0

4

0

Alberto Testoni

@alberto_testoni

1 year

RT @ELLISforEurope: 22 researchers from 12 European institutions discussed future directions in open #LLMs and multimodal language technolo….

0

7

0