From the papers that I've read on LLMs in the past 6 months, one thing is clear: higher data quality will be key to keep pushing progress.
Lots of companies and researchers keep innovating and implementing ways to improve data quality in all areas ranging from finetuning LLMs…
@leavittron
@MosaicML
@jefrankle
gathering and filtering are key
but fundamentally we don't understand how each token affects downstream performance beyond trial and error (add this data, remove this data see what happens).
predictable function of F(Loss, token) -> Accuracy(New_Task) is missing