Alex Gill
@alex_gill_nlp
Followers
25
Following
12
Media
5
Statuses
10
Joined June 2025
๐ข I'm recruiting PhD students at MPI!! Topics include: 1โฃ LLM factuality, reliable info synthesis and reasoning, personalization + applications in real-world inc. education, science 2โฃ Data-centric interpretability 3โฃCreativity in AI, esp scientific applications ๐งต1/2
9
106
441
I'll be in Suzhou ๐จ๐ณ at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anmarasovic & @lasha_nlp ! ๐ ๐Findings Session 1 - Hall C ๐
Wed, November 5, 13:00 - 14:00 https://t.co/35AshcUfT2
0
2
16
More results and analysis can be found in the paper. We welcome any discussion, thanks for reading!! (Full Link: https://t.co/35AshcUfT2)
arxiv.org
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific...
0
0
2
We hope that our work will inspire future research into: - Can further prompt review improve the difficulty of synthetic data? - What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
1
0
2
Key takeaways: - While LLM generated evals may be ๐ฃ๐๐๐๐, as a whole they lose crucial aspects in complexity. - LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
1
0
1
But are these instances similarly difficult? We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models. We find that performance is consistently higher on generated versions of the datasets.
1
0
1
We perform a human study and even find that LLM-generated data is preferred! We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
1
0
1
We examine both the ๐ฃ๐๐๐๐๐๐ก๐ฆ and ๐๐๐๐๐๐๐ข๐๐ก๐ฆ of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP. We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid.
1
0
1
We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs. Is this a valid approach to evaluation construction? Do we loose anything in this process?
1
0
2
๐๐ก๐๐ญ ๐๐๐ฌ ๐๐๐๐ง ๐๐จ๐ฌ๐ญ ๐๐ข๐ญ๐ก ๐๐ฒ๐ง๐ญ๐ก๐๐ญ๐ข๐ ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง? I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha_nlp and @anmarasovic (Full link below ๐)
1
20
76