@alex_gill_nlp
Alex Gill
6 months
𝐖𝐡𝐚𝐭 𝐇𝐚𝐬 𝐁𝐞𝐞𝐧 𝐋𝐨𝐬𝐭 𝐖𝐢𝐭𝐡 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧? I'm happy to announce that the preprint release of my first project is online! Developed with the amazing support of @lasha_nlp and @anmarasovic (Full link below 👇)
1
20
76

Replies

@alex_gill_nlp
Alex Gill
6 months
We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs. Is this a valid approach to evaluation construction? Do we loose anything in this process?
1
0
2
@alex_gill_nlp
Alex Gill
6 months
We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP. We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
We perform a human study and even find that LLM-generated data is preferred! We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
But are these instances similarly difficult? We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models. We find that performance is consistently higher on generated versions of the datasets.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
Key takeaways: - While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity. - LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
1
0
1
@alex_gill_nlp
Alex Gill
6 months
We hope that our work will inspire future research into: - Can further prompt review improve the difficulty of synthetic data? - What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
1
0
2
@alex_gill_nlp
Alex Gill
6 months
More results and analysis can be found in the paper. We welcome any discussion, thanks for reading!! (Full Link: https://t.co/35AshcUfT2)
0
0
2