The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded
@DatologyAI
, with
@arimorcos
and
@hurrycane
There’s massive demand by companies to train their own models. And I’ve seen firsthand the training efficiency and model quality improvements that data curation can unlock. But expertise and tooling for data curation are lacking.
Data curation is a frontier research problem. There’s only a handful of scientists in the world with deep expertise. And let’s be real—most scientists can’t build a deployable product that scales effortlessly.
That’s why we founded
@DatologyAI
: the algorithms that power our tools are automatic, modality-agnostic, don’t require labels, and the product scales seamlessly to the largest datasets. These are essential features for realizing the next generation of large deep learning models.
Solving data curation for large-scale model training requires groundbreaking science and engineering. It’s a hard problem with tremendous impact. That’s also what makes it fun. I would have been stupid NOT to co-found
@datologyai
. And there are a lot of smart people who agree:
And we’d like to grow that team: if you’re a deep learning scientist that’s passionate about data or an engineer with data expertise, please get in touch! You can learn more about us at
@leavittron
@datologyai
@arimorcos
@hurrycane
Super cool! I think this is a great idea. We've seen that just deduplicating pretraining data can massively improve data/learning efficiency of LLMs. I can't even begin to think what's possible if you explore more sophisticated approaches for data intervention.