From the papers that I've read on LLMs in the past 6 months, one thing is clear: higher data quality will be key to keep pushing progress.
Lots of companies and researchers keep innovating and implementing ways to improve data quality in all areas ranging from finetuning LLMs…
@omarsar0
Agreed, *but* people don't want to work on data quality. From a Google Research paper 2 years ago, even in incredibly high stakes use cases, most teams aren't cleaning up their data.
“Everyone wants to do the model work, not the data work” new Google research finds that under-appreciation of data quality, including in high-stakes AI, results in 92% of AI projects experiencing data cascades: compounding, negative, downstream events
@omarsar0
100% agree. Ive spent a lot of time cleaning up corpora in my own language. Ive also both seen that data cleaning both makes the corpus better and worse. What would you
@omarsar0
is the most important criteria for s good/clean corpus?