@omarsar0
elvis
10 months
From the papers that I've read on LLMs in the past 6 months, one thing is clear: higher data quality will be key to keep pushing progress. Lots of companies and researchers keep innovating and implementing ways to improve data quality in all areas ranging from finetuning LLMs…
12
35
230

Replies

@AITimetoImpact
AI Time to Impact
10 months
@omarsar0 Agreed, *but* people don't want to work on data quality. From a Google Research paper 2 years ago, even in incredibly high stakes use cases, most teams aren't cleaning up their data.
@marshallk
Marshall Kirkpatrick
3 years
“Everyone wants to do the model work, not the data work” new Google research finds that under-appreciation of data quality, including in high-stakes AI, results in 92% of AI projects experiencing data cascades: compounding, negative, downstream events
Tweet media one
5
139
471
0
4
11
@aiguy_arjun
Arjun
10 months
@omarsar0 Yes! As shown by the Open-Platypus dataset and many others in recent times.
0
0
1
@SeguraAndres7
Andres Segura-Tinoco
10 months
@omarsar0 No doubt about it. It is a premise of more than 10 years ago, which I believe will continue to be valid for a long time.
0
0
1
@buzz1light1year
Space Ranger
10 months
@omarsar0 garbage in, garbage out
0
0
2
@mhz758
M.H
10 months
@omarsar0 How can i learn LLM Please, recommend courses or tutorial
0
0
0
@whiletruelearn
Krishna Sangeeth
10 months
@omarsar0 Yeah beyond a doubt, ‘text book is all you need’ proved that high quality data can make a big impact.
0
1
2
@junzhao333
Jun
10 months
@omarsar0 hope you share about how to Gathering high-quality data
0
0
0
@taehallm
Taeha 🏓
10 months
@omarsar0 Thank you for the insight!
0
0
0
@JulezArdilla
Julez
10 months
@omarsar0 Same applies to human educational system.
0
0
0
@peregil
peregil
10 months
@omarsar0 100% agree. Ive spent a lot of time cleaning up corpora in my own language. Ive also both seen that data cleaning both makes the corpus better and worse. What would you @omarsar0 is the most important criteria for s good/clean corpus?
0
0
0