Alex Gill Profile
Alex Gill

@alex_gill_nlp

Followers
25
Following
12
Media
5
Statuses
10

Joined June 2025
Don't wanna be here? Send us removal request.
@lasha_nlp
Abhilasha Ravichander
20 days
๐Ÿ“ข I'm recruiting PhD students at MPI!! Topics include: 1โƒฃ LLM factuality, reliable info synthesis and reasoning, personalization + applications in real-world inc. education, science 2โƒฃ Data-centric interpretability 3โƒฃCreativity in AI, esp scientific applications ๐Ÿงต1/2
9
106
441
@alex_gill_nlp
Alex Gill
29 days
I'll be in Suzhou ๐Ÿ‡จ๐Ÿ‡ณ at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anmarasovic & @lasha_nlp ! ๐ŸŽ‰ ๐Ÿ“Findings Session 1 - Hall C ๐Ÿ“… Wed, November 5, 13:00 - 14:00 https://t.co/35AshcUfT2
0
2
16
@alex_gill_nlp
Alex Gill
6 months
We hope that our work will inspire future research into: - Can further prompt review improve the difficulty of synthetic data? - What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
1
0
2
@alex_gill_nlp
Alex Gill
6 months
Key takeaways: - While LLM generated evals may be ๐‘ฃ๐‘Ž๐‘™๐‘–๐‘‘, as a whole they lose crucial aspects in complexity. - LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
But are these instances similarly difficult? We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models. We find that performance is consistently higher on generated versions of the datasets.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
We perform a human study and even find that LLM-generated data is preferred! We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
We examine both the ๐‘ฃ๐‘Ž๐‘™๐‘–๐‘‘๐‘–๐‘ก๐‘ฆ and ๐‘‘๐‘–๐‘“๐‘“๐‘–๐‘๐‘ข๐‘™๐‘ก๐‘ฆ of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP. We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs. Is this a valid approach to evaluation construction? Do we loose anything in this process?
1
0
2
@alex_gill_nlp
Alex Gill
6 months
๐–๐ก๐š๐ญ ๐‡๐š๐ฌ ๐๐ž๐ž๐ง ๐‹๐จ๐ฌ๐ญ ๐–๐ข๐ญ๐ก ๐’๐ฒ๐ง๐ญ๐ก๐ž๐ญ๐ข๐œ ๐„๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง? I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha_nlp and @anmarasovic (Full link below ๐Ÿ‘‡)
1
20
76