Mingxuan (Aldous) Li Profile
Mingxuan (Aldous) Li

@itea1001

Followers
6
Following
17
Media
3
Statuses
18

Student at the University of Chicago

Joined January 2024
Don't wanna be here? Send us removal request.
@itea1001
Mingxuan (Aldous) Li
2 months
RT @Elenal3ai: ๐Ÿšจ New paper alert ๐Ÿšจ. Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? ๐Ÿค” Should the LLM answer at all? Wโ€ฆ.
0
10
0
@itea1001
Mingxuan (Aldous) Li
2 months
HypoEval evaluators ( are now incorporated into judges from @QuotientAI โ€” check it out at .
0
4
4
@itea1001
Mingxuan (Aldous) Li
2 months
12/n Acknowledgments:.Great thanks to my wonderful collaborators @lihanc02 and my advisor @ChenhaoTan!.Check out full paper here at (.
0
0
3
@itea1001
Mingxuan (Aldous) Li
2 months
11/n Closing thoughts:.This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments โ€” paving the way for personalized evaluators and alignment!.
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
10/n Code:.We have released to repositories for HypoEval:.For replicating results/building upon: For off-the-shelf 0-shot evaluators for summaries and stories๐Ÿš€:
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
9/n Why HypoEval matters:. We push forward LLM-as-a-judge research by showing you can get:.Sample efficiency.Interpretable automated evaluation.Strong human alignment. โ€ฆwithout massive fine-tuning.
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
8/n ๐Ÿ”ฌ Ablation insights:.Dropping hypothesis generation โ†’ performance drops ~7%.Combining all hypotheses into one criterion โ†’ performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!).
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
7/n ๐Ÿ’ช Whatโ€™s robust?.โœ… Works across out-of-distribution (OOD) tasks.โœ… Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini โ†” LLAMA-3.3-70B).โœ… Reduces sensitivity to prompt variations compared to direct scoring
Tweet media one
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
6/n ๐Ÿ† Where did we test it?.Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt).We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! ๐Ÿ“ˆ.
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
5/n Why is this better?.By combining small-scale human data + literature + non-binary checklists, HypoEval:.๐Ÿ”น Outperforms G-Eval by ~12%.๐Ÿ”น Beats fine-tuned models using 3x more human labels.๐Ÿ”น Adds interpretable evaluation.
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
4/n These hypotheses break down complex evaluation rubric (ex. โ€œIs this summary comprehensive?โ€) into sub-dimensions an LLM can score clearly. โœ…โœ…โœ…
Tweet media one
1
0
3
@itea1001
Mingxuan (Aldous) Li
2 months
3/n ๐ŸŒŸ Our solution: HypoEval.Building upon SOTA hypothesis generation methods, we generate hypotheses โ€” decomposed rubrics (similar to checklists, but more systematic and explainable) โ€” from existing literature and just 30 human annotations (scores) of texts.
Tweet media one
1
0
2
@itea1001
Mingxuan (Aldous) Li
2 months
2/n Whatโ€™s the problem?.Most LLM-as-a-judge studies either:.โŒ Achieve lower alignment with humans.โš™๏ธ Requires extensive fine-tuning -> expensive data and compute. โ“ Lack of interpretability.
2
0
3
@itea1001
Mingxuan (Aldous) Li
2 months
1/n ๐Ÿš€๐Ÿš€๐Ÿš€ Thrilled to share our latest work๐Ÿ”ฅ: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! ๐Ÿง ๐Ÿ’ฌ๐Ÿ“Š.Thereโ€™s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability โ€” letโ€™s dive in! ๐ŸŒŠ.
1
3
9
@itea1001
Mingxuan (Aldous) Li
2 months
RT @mouradheddaya: ๐Ÿง‘โ€โš–๏ธHow well can LLMs summarize complex legal documents?ย  And can we use LLMs to evaluate?. Excited to be in Albuquerqueโ€ฆ.
0
9
0
@itea1001
Mingxuan (Aldous) Li
3 months
RT @HaokunLiu5280: ๐Ÿš€๐Ÿš€๐Ÿš€Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation meโ€ฆ.
0
9
0
@itea1001
Mingxuan (Aldous) Li
3 months
RT @divingwithorcas: 1/n You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how thosโ€ฆ.
0
8
0
@itea1001
Mingxuan (Aldous) Li
8 months
RT @HaokunLiu5280: 1/ ๐Ÿš€ New Paper Alert!.Excited to share: Literature Meets Data: A Synergistic Approach to Hypothesis Generation ๐Ÿ“š๐Ÿ“Š!.We prโ€ฆ.
0
6
0