Sima Noorani ✈️ NeurIPS
@NooraniSimaa
Followers
122
Following
119
Media
10
Statuses
27
PhD candidate @Penn
Philadelphia, PA
Joined March 2024
We’re presenting our work today at Poster Session 2 (4:30 PM, #3303). Come check it out and chat with us! :) This is a joint work with @ShayanKiyani1 @HamedSHassani @pappasg69.
How can we quantify uncertainty in LLMs from only a few sampled outputs? The key lies in the classical problem of missing mass—the probability of unseen outputs. This perspective offers a principled foundation for conformal prediction in query-only settings like LLMs.
1
3
29
Paper : https://t.co/4z1zUyn0c2 Joint work with the amazing @ShayanKiyani1 , @pappasg69, @HamedSHassani
1
0
7
Empirically, the resulting collaborative sets improve over both human-only and AI-only sets in both marginal coverage and average set size.
1
0
2
We provide finite-sample algorithms in both offline (calibration-based) and online settings, with guarantees that hold under arbitrary distribution shifts ( including the natural case of human adaptation to the AI over time ) .
1
0
1
We show that optimal solution to this problem admits a simple two-threshold structure over a single nonconformity score: a pruning threshold that decides which labels in H(x) to remove, and an augmentation threshold that decides which new AI-suggested labels to add.
1
0
1
These two principles lead to an explicit optimization problem over collaborative sets C, a trade off between avoiding counterfactual harm, promoting complementarity, and keeping prediction sets informative (not too large)
1
0
1
We formalize the two guiding principles as follows: Counterfactual harm: collaboration should not make the human worse: P(Y∉ C(X) ∣ Y∈H(X)). Complementarity: the AI should recover labels the human misses: P(Y∈ C(X) ∣ Y∉H(X))
1
0
1
We formalize a simple collaborative prediction setting. Let (X,Y)∼P(X, Y) , with features X and label Y. A human expert first proposes a set H(x)⊆Y of plausible outcomes. The AI system then refines this proposal by outputting a collaborative set C(x) ⊆Y.
1
0
1
When humans and AI collaborate, what should uncertainty quantification look like? Our new paper proposes two principles---no counterfactual harm and complementarity---and gives distribution-free guarantees without assumptions on the task, AI model, or human behavior.
4
10
92
How should you use forecasts f:X->R^d to make decisions? It depends what properties they have. If they are fully calibrated (E[y | f(x) = p] = p), then you should be maximally agressive and act as if they are correct --- i.e. play argmax_a E_{o ~ f(x)}[u(a,o)]. On the other hand
1
14
97
We push conformal prediction and its trade-offs beyond regression & classification — into query-based generative models. Surprisingly (or not?), missing mass & Good-Turing estimators emerge as key tools once again. Very excited about this one!
How can we quantify uncertainty in LLMs from only a few sampled outputs? The key lies in the classical problem of missing mass—the probability of unseen outputs. This perspective offers a principled foundation for conformal prediction in query-only settings like LLMs.
0
4
24
Paper: https://t.co/cv0lD18MNK github: https://t.co/63bQfcBKPw Many many thanks to my amazing collaborators @ShayanKiyani1 @pappasg69 @HamedSHassani
github.com
Official repository for Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models - nooranisima/CPQ-missing-mass
0
0
4
We show meaningful conformal prediction in query-only settings—like LLMs—arises from deep connections to the classical missing mass problem. Our principled framework balances key trade-offs: coverage, query cost, and informativeness in building prediction sets.
1
0
4
Across all datasets, CPQ achieves: ✔️tighter coverage ✔️Far lower EE usage
1
0
1
How does CPQ compare to state-of-the-art? We compare CPQ to CLM and SCOPE-Gen, two leading conformal methods for LLMs. But unlike CPQ, they: 📍 Don’t account for missing mass (i.e., unseen outputs) 📍 Don’t control the query budget explicitly
1
0
1
We evaluate CPQ on 3 LLM tasks with fixed query budgets and varying coverage. We show that: ✅ Adding each component improves performance ✅ Full CPQ has valid coverage with minimal EE inclusion ✅ CPQ adapts prediction set size principally to achieve compact, informative sets
1
0
1
Set map: After querying, CPQ builds optimal prediction sets relying on an estimate of the missing mass itself—using the classical Good-Turing estimator.
1
0
1
Query policy : Querying reduces uncertainty—but when is it enough? 🧐 The optimal strategy is to stop when the missing mass stops decreasing meaningfully! In finite-sample, this relies on our novel estimator for the missing mass derivative!
1
0
1
CPQ is built on two core components: 📍 A querying policy (how long to sample) 📍 A set-map (how to turn samples into a valid, informative set) And the optimal solution for each module are rooted in the classical problem of missing mass in statistics
1
0
1
There’s a key trade-off in the query-only setting: More queries improve coverage and reduce reliance on EE—but cost more compute. Fewer queries save resources but increase the need for EE. Our framework CPQ balances coverage, query cost, and informativeness under a fixed budget
1
0
1