Mingxuan (Aldous) Li @itea1001 X Profile

Mingxuan (Aldous) Li

@itea1001

Followers

6

Following

17

Media

3

Statuses

18

Student at the University of Chicago

Joined January 2024

Don't wanna be here? Send us removal request.

Mingxuan (Aldous) Li

@itea1001

2 months

RT @Elenal3ai: 🚨 New paper alert 🚨. Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? 🤔 Should the LLM answer at all? W….

0

10

0

Mingxuan (Aldous) Li

@itea1001

2 months

HypoEval evaluators ( are now incorporated into judges from @QuotientAI — check it out at .

0

4

Mingxuan (Aldous) Li

@itea1001

2 months

12/n Acknowledgments:.Great thanks to my wonderful collaborators @lihanc02 and my advisor @ChenhaoTan!.Check out full paper here at (.

0

3

Mingxuan (Aldous) Li

@itea1001

2 months

11/n Closing thoughts:.This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments — paving the way for personalized evaluators and alignment!.

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

10/n Code:.We have released to repositories for HypoEval:.For replicating results/building upon: For off-the-shelf 0-shot evaluators for summaries and stories🚀:

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

9/n Why HypoEval matters:. We push forward LLM-as-a-judge research by showing you can get:.Sample efficiency.Interpretable automated evaluation.Strong human alignment. …without massive fine-tuning.

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

8/n 🔬 Ablation insights:.Dropping hypothesis generation → performance drops ~7%.Combining all hypotheses into one criterion → performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!).

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

7/n 💪 What’s robust?.✅ Works across out-of-distribution (OOD) tasks.✅ Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini ↔ LLAMA-3.3-70B).✅ Reduces sensitivity to prompt variations compared to direct scoring

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

6/n 🏆 Where did we test it?.Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt).We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! 📈.

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

5/n Why is this better?.By combining small-scale human data + literature + non-binary checklists, HypoEval:.🔹 Outperforms G-Eval by ~12%.🔹 Beats fine-tuned models using 3x more human labels.🔹 Adds interpretable evaluation.

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

4/n These hypotheses break down complex evaluation rubric (ex. “Is this summary comprehensive?”) into sub-dimensions an LLM can score clearly. ✅✅✅

1

0

3

Mingxuan (Aldous) Li

@itea1001

2 months

3/n 🌟 Our solution: HypoEval.Building upon SOTA hypothesis generation methods, we generate hypotheses — decomposed rubrics (similar to checklists, but more systematic and explainable) — from existing literature and just 30 human annotations (scores) of texts.

1

0

2

Mingxuan (Aldous) Li

@itea1001

2 months

2/n What’s the problem?.Most LLM-as-a-judge studies either:.❌ Achieve lower alignment with humans.⚙️ Requires extensive fine-tuning -> expensive data and compute. ❓ Lack of interpretability.

2

0

3

Mingxuan (Aldous) Li

@itea1001

2 months

1/n 🚀🚀🚀 Thrilled to share our latest work🔥: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! 🧠💬📊.There’s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability — let’s dive in! 🌊.

1

3

9

Mingxuan (Aldous) Li

@itea1001

2 months

RT @mouradheddaya: 🧑‍⚖️How well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?. Excited to be in Albuquerque….

0

9

0

Mingxuan (Aldous) Li

@itea1001

3 months

RT @HaokunLiu5280: 🚀🚀🚀Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation me….

0

9

0

Mingxuan (Aldous) Li

@itea1001

3 months

RT @divingwithorcas: 1/n You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how thos….

0

8

0

Mingxuan (Aldous) Li

@itea1001

8 months

RT @HaokunLiu5280: 1/ 🚀 New Paper Alert!.Excited to share: Literature Meets Data: A Synergistic Approach to Hypothesis Generation 📚📊!.We pr….

0

6

0