Pratyush Maini
@pratyushmaini
Followers
2K
Following
2K
Media
113
Statuses
678
Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi
Joined November 2019
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳.- 3B LLMs beat 8B models🚀.- Pareto frontier for performance
21
118
659
RT @amrokamal1997: After months of development, we finally share with the world some hard-earned science behind synthetic data. @datologyai….
0
3
0
RT @RicardoMonti9: its a pleasure to show up to work everyday and learn from synthetic data gurus @pratyushmaini , @VineethDorna and team -….
0
5
0
Finalizing the magic seed for synthetic data generation while the data dawg @RicardoMonti9 showers his blessings in the background.
1
1
26
RT @sarahcat21: For years, researchers have known that synthetic data is valuable; but not all synthetic data is created equally. Generatin….
0
4
0
RT @j_mcgraph: @pratyushmaini and @datologyai make synthetic data seem easy, but it's really just how good they are.
0
2
0
RT @RishabhAdiga01: Thrilled to see BeyondWeb launched 🚀 Phenomenal insights and a huge step forward for scaling high-quality synthetic dat….
0
3
0
RT @sjoshi804: As we hit the limits of real web-scale data, @datologyai's synthetic data shows how we can leverage the models we've already….
0
4
0
RT @VineethDorna: Big day for the @datologyai team! We introduce BeyondWeb, scaling synthetic data for trillion-scale pretraining! .✨ Colle….
0
6
0
RT @leavittron: Very excited to announce BeyondWeb, @datologyAI’s synthetic pretraining data generation paradigm. BeyondWeb is a rephrasing….
0
40
0
RT @arimorcos: Today, we introduce BeyondWeb, our synthetic data generation approach which significantly outperforms all open synthetic dat….
0
24
0
14/ Dive into the full post for recipes, benchmarks, and other cool experiments. Thanks to the @datologyai team, especially @leavittron and @VineethDorna, for their massive contributions that helped shape this up. Arxiv: Blog:
blog.datologyai.com
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. We introduce BeyondWeb, a synthe...
1
0
20
13/ Implications: @datologyai is here to democratize high-quality synthetic data for all. We show a pathway for generating synthetic data cheaply at scale. We're part of curating 7T tokens for @arcee_ai 's AFM4.5B, which already shows our real-world wins!
Today, we’re officially releasing the weights for AFM-4.5B and AFM-4.5B-Base on HuggingFace. This is a major milestone for @arcee_ai. AFM is designed to be flexible and high-performing across a wide range of deployment environments.
1
0
15