Gabriel Stanovsky
@GabiStanovsky
Followers
839
Following
1K
Media
12
Statuses
282
Assistant Professor at @CseHuji
Joined August 2012
🧩 PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation A modular framework for easier, reproducible multi-prompt evaluation. 📍 Poster - Nov 6, 16:30 @ System Demos With @Dahan_Noam, @GiliLior, @GabiStanovsky Website & paper: https://t.co/oH1V6SJYlc
eliyahabba.github.io
A flexible framework for automatic generation of prompt variations for robust LLM evaluation.
1
2
8
Happening now! 🌈 PromptSuite @ #EMNLP2025 System Demos (16:30) Come chat about prompt robustness, evaluation, and LLM brittleness 💬
2
4
27
Excited to present our paper "ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments" at #EMNLP2025 🎉 Everyone knows LLMs are prompt-sensitive, yet we still report single-prompt scores. Our work suggests a method to make evaluation statistically reliable!
2
13
52
Going to @emnlpmeeting!!✈️ On November 6th, @Itay_Itzhak_, @FazlBarez, and I will present our work "Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer" at a poster session findings 2 at 12:30. w\ @GabiStanovsky, and @boknilev. https://t.co/HQRacQZwhP
3
14
89
I can't make it to #EMNLP2025, but @EliyaHabba and @GiliLior will present our PromptSuite🌈(demo!): a framework tackling prompt sensitivity by generating benchmark variations for any task. Try it with just a few lines of code or the web interface! https://t.co/C4VwIvAhvv
0
12
37
LLMs can hallucinate due to different reasons: ❌They don't know (lack of knowledge) ❌ They "know" but are uncertain ❌They "know" and are certain New Extended version of our paper that combines our understanding of hallucination on the knowledge and certainty axis is out🧵
3
11
36
Heading to #EMNLP2025! 🎉 Two of our papers will be there — come say hi 👋 🖼️ Image Captioning Evaluation — Nov 5, 17:45 📄 https://t.co/TdMVA2iWSD 🕵️ Deceptive LLM Agents (Mafia Game) — Nov 5, 13:00 📄
arxiv.org
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are asynchronous. For example, in...
1
6
26
How can we help LLMs move beyond the obvious toward generating more creative and diverse ideas? In our new TACL paper, we propose a novel approach to enhance LLM creative generation! https://t.co/AFCpQddN6j
@ChenShani2 @GabiStanovsky @jurafsky @HyadataLab @stanfordnlp @nlphuji
6
26
84
Our 🌈 PromptSuite paper has been accepted to #EMNLP2025 🇨🇳 (System Demonstrations)! 🎉 🌈 PromptSuite is a flexible framework for generating thousands of prompt variations per instance - enabling robust, task-agnostic evaluation of LLMs. @Dahan_Noam, @GiliLior, @GabiStanovsky
1
14
33
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
0
6
55
🚨Spotlight update🚨 Our paper on bias origins in LLMs is a *spotlight* paper with oral presentation at CoLM 2025!✨ Honored to be among just 24 selected and super excited to present and discuss biases and finetuning limits. Who’s joining in Montreal Tuesday morning? 👀
🚨New paper alert🚨 🧠 Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing? Excited to share our new paper, accepted to CoLM 2025🎉! See thread below 👇 #BiasInAI #LLMs #MachineLearning #NLProc
3
8
35
What if LLMs can forecast their own scores on unseen benchmarks from just a task description? We are the first to study text description→performance prediction, giving practitioners an early read on outcomes so they can plan what to build—before paying full price 💸
3
9
27
Happy to share that our Image Captioning evaluation survey was accepted to TACL! I will be presenting the paper @emnlpmeeting
1/ Into Image Captioning? Don’t miss this! Struggling to keep up with the influx of new metrics but still see the same 5 (BLEU, METEOR, ROUGE, CIDEr, SPICE) leading? Read our recent Captioning evaluation survey! https://t.co/TdMVA2ip35 w\ @GabiStanovsky
@AbendOmri
@leafrermann >
0
4
13
Old news: Single-prompt eval is unreliable🤯 New news: PromptSuite🌈 - an easy way to augment your benchmark with thousands of paraphrases ➡️ robust eval, zero sweat! - Works on any dataset! - Python API + web UI @EliyaHabba, @GiliLior, @GabiStanovsky
https://t.co/C4VwIvzJFX
eliyahabba.github.io
A flexible framework for automatic generation of prompt variations for robust LLM evaluation.
2
15
62
Very pleased that "Trust me I'm Wrong" was accepted to @emnlpmeeting findings! Trust me I'm Wrong shows that LLMs can hallucinate with high certainty even when they know the correct answer! Check our latest work with @Itay_itzhak_, @FazlBarez, @GabiStanovsky, and @boknilev.
5
14
113
How can we evaluate the real world impact of generative AI? Great panel GEM2 workshop #ACL2025NLP 🇦🇹
1
6
33