pratyushmaini Profile Banner
Pratyush Maini Profile
Pratyush Maini

@pratyushmaini

Followers
2K
Following
2K
Media
113
Statuses
678

Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi

Joined November 2019
Don't wanna be here? Send us removal request.
@pratyushmaini
Pratyush Maini
1 day
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳.- 3B LLMs beat 8B models🚀.- Pareto frontier for performance
Tweet media one
21
118
659
@pratyushmaini
Pratyush Maini
7 hours
RT @amrokamal1997: After months of development, we finally share with the world some hard-earned science behind synthetic data. @datologyai….
0
3
0
@pratyushmaini
Pratyush Maini
20 hours
RT @RicardoMonti9: its a pleasure to show up to work everyday and learn from synthetic data gurus @pratyushmaini , @VineethDorna and team -….
0
5
0
@pratyushmaini
Pratyush Maini
20 hours
Finalizing the magic seed for synthetic data generation while the data dawg @RicardoMonti9 showers his blessings in the background.
@RicardoMonti9
Ricardo Monti
23 hours
@j_mcgraph @pratyushmaini @datologyai .@pratyushmaini is truly the master of synthetic data
Tweet media one
1
1
26
@pratyushmaini
Pratyush Maini
21 hours
Tweet media one
0
3
0
@pratyushmaini
Pratyush Maini
21 hours
RT @sarahcat21: For years, researchers have known that synthetic data is valuable; but not all synthetic data is created equally. Generatin….
0
4
0
@pratyushmaini
Pratyush Maini
23 hours
RT @j_mcgraph: @pratyushmaini and @datologyai make synthetic data seem easy, but it's really just how good they are.
0
2
0
@pratyushmaini
Pratyush Maini
1 day
RT @acrognali: Great work by the @datologyai team, enjoyed reading this.
0
3
0
@pratyushmaini
Pratyush Maini
1 day
RT @RishabhAdiga01: Thrilled to see BeyondWeb launched 🚀 Phenomenal insights and a huge step forward for scaling high-quality synthetic dat….
0
3
0
@pratyushmaini
Pratyush Maini
1 day
RT @sjoshi804: As we hit the limits of real web-scale data, @datologyai's synthetic data shows how we can leverage the models we've already….
0
4
0
@pratyushmaini
Pratyush Maini
1 day
RT @leavittron: two paths for synthetic data
Tweet media one
0
8
0
@pratyushmaini
Pratyush Maini
1 day
RT @soldni: OLMo 2 is SOTA web rewriter??
Tweet media one
Tweet media two
0
40
0
@pratyushmaini
Pratyush Maini
1 day
RT @VineethDorna: Big day for the @datologyai team! We introduce BeyondWeb, scaling synthetic data for trillion-scale pretraining! .✨ Colle….
0
6
0
@pratyushmaini
Pratyush Maini
1 day
RT @leavittron: Very excited to announce BeyondWeb, @datologyAI’s synthetic pretraining data generation paradigm. BeyondWeb is a rephrasing….
0
40
0
@pratyushmaini
Pratyush Maini
1 day
RT @leavittron: The era of "The Era of Pretraining is Over" is over.
0
6
0
@pratyushmaini
Pratyush Maini
1 day
RT @code_star: Its here! Checkout our blog and paper on scaling synthetic data 1/n
Tweet media one
0
3
0
@pratyushmaini
Pratyush Maini
1 day
RT @HaoliYin: Launch day! Take a look at the phenomenal insights into what production-grade synthetic data looks like - driven by @pratyush….
0
5
0
@pratyushmaini
Pratyush Maini
1 day
RT @arimorcos: Today, we introduce BeyondWeb, our synthetic data generation approach which significantly outperforms all open synthetic dat….
0
24
0
@pratyushmaini
Pratyush Maini
1 day
15/Needless to say, such a massive undertaking could not have been accomplished without a stellar engineering team that helped us scale our work to trillions of tokens. If you are excited about this, join us
Tweet media one
2
0
18
@pratyushmaini
Pratyush Maini
1 day
14/ Dive into the full post for recipes, benchmarks, and other cool experiments. Thanks to the @datologyai team, especially @leavittron and @VineethDorna, for their massive contributions that helped shape this up. Arxiv: Blog:
blog.datologyai.com
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. We introduce BeyondWeb, a synthe...
1
0
20
@pratyushmaini
Pratyush Maini
1 day
13/ Implications: @datologyai is here to democratize high-quality synthetic data for all. We show a pathway for generating synthetic data cheaply at scale. We're part of curating 7T tokens for @arcee_ai 's AFM4.5B, which already shows our real-world wins!
@LucasAtkins7
Lucas Atkins
21 days
Today, we’re officially releasing the weights for AFM-4.5B and AFM-4.5B-Base on HuggingFace. This is a major milestone for @arcee_ai. AFM is designed to be flexible and high-performing across a wide range of deployment environments.
Tweet media one
1
0
15