The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded @DatologyAI, with @arimorcos and @hurrycane Tweet added by Matthew Leavitt @leavittron

Matthew Leavitt

4 months

The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded @DatologyAI , with @arimorcos and @hurrycane

11

16

126

Matthew Leavitt

@leavittron

4 months

There’s massive demand by companies to train their own models. And I’ve seen firsthand the training efficiency and model quality improvements that data curation can unlock. But expertise and tooling for data curation are lacking.

1

2

12

Matthew Leavitt

@leavittron

4 months

Data curation is a frontier research problem. There’s only a handful of scientists in the world with deep expertise. And let’s be real—most scientists can’t build a deployable product that scales effortlessly.

2

0

10

Matthew Leavitt

@leavittron

4 months

That’s why we founded @DatologyAI : the algorithms that power our tools are automatic, modality-agnostic, don’t require labels, and the product scales seamlessly to the largest datasets. These are essential features for realizing the next generation of large deep learning models.

1

0

7

Matthew Leavitt

@leavittron

4 months

Solving data curation for large-scale model training requires groundbreaking science and engineering. It’s a hard problem with tremendous impact. That’s also what makes it fun. I would have been stupid NOT to co-found @datologyai . And there are a lot of smart people who agree:

1

0

7

Matthew Leavitt

@leavittron

4 months

We have an incredible set of institutional investors who believe deeply in us and our mission: @sarahcat21 and @dauber from @AmplifyPartners , @_RobToews from @radicalvcfund , @saranormous , @outsetcap , and @QuietCapital .

3

0

8

Matthew Leavitt

@leavittron

4 months

And our list of angels speaks for itself: @JeffDean , @geoffreyhinton , @ylecun , @adamdangelo , @aidangomez , @1vnzh , @douwekiela , @NaveenGRao , @jaschasd , @barrald , and @jefrankle .

1

0

9

Matthew Leavitt

@leavittron

4 months

We also have a hell of a team: @arimorcos , @bodan , @JackUrbs , @j_mcgraph , Kerstin Frailey, Fan Pan, @Ning_Catsnail , @pratyushmaini , and @priy2201 .

2

0

10

Matthew Leavitt

@leavittron

4 months

And we’d like to grow that team: if you’re a deep learning scientist that’s passionate about data or an engineer with data expertise, please get in touch! You can learn more about us at

DatologyAI | Automated data curation for GenAI

DatologyAI builds data curation tools to automatically select the best data on which to train GenAI models. We leverage cutting-edge research—much of which we perform ourselves—to identify redundant,...

www.datologyai.com

1

0

6

Cameron R. Wolfe, Ph.D.

@cwolferesearch

4 months

@leavittron @datologyai @arimorcos @hurrycane Super cool! I think this is a great idea. We've seen that just deduplicating pretraining data can massively improve data/learning efficiency of LLMs. I can't even begin to think what's possible if you explore more sophisticated approaches for data intervention.

1

7

Matthew Leavitt

@leavittron

4 months

@cwolferesearch @arimorcos @datologyai @hurrycane You spend a LOT of time reading and thinking about deep learning research, so your opinion means a lot to us 🙂

1

0

1