Pratyush Maini @pratyushmaini X Profile

Pratyush Maini

@pratyushmaini

Followers

3K

Following

3K

Media

120

Statuses

735

Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi

https://t.co/dJJKrexZCb

Joined November 2019

Don't wanna be here? Send us removal request.

Pratyush Maini

@pratyushmaini

3 months

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

23

124

712

rishabh ranjan

@_rishabhranjan_

6 days

Transformers are great for sequences, but most business-critical predictions (e.g. product sales, customer churn, ad CTR, in-hospital mortality) rely on highly-structured relational data where signal is scattered across rows, columns, linked tables and time. Excited to finally

4

38

130

Vishaal Udandarao

@vishaal_urao

5 days

🚀New Paper https://t.co/KB2hZljDHu We conduct a systematic data-centric study for speech-language pretraining, to improve end-to-end spoken-QA! 🎙️🤖 Using our data-centric insights, we pretrain a 3.8B SpeechLM (called SpeLangy) outperforming 3x larger models! 🧵👇

3

36

118

Sachin Goyal

@goyalsachin007

6 days

Repeat after me Very few researchers bring an industrial impact of this scale

Pratyush Maini

@pratyushmaini

6 days

1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago

1

123

Pratyush Maini

@pratyushmaini

6 days

9/I am extraordinarily fortunate. Very few papers achieve this level of industry impact. To everyone facing rejections: believe in your work. The right people will find it. Finally, thanks to Apple for a wonderful summer internship: Skyler, David, Richard, Yizhe, Navdeep

arxiv.org

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an...

0

36

Pratyush Maini

@pratyushmaini

6 days

8/And of course, I should mention WRAP has been core to the thesis of @datologyai, and has shown great success in the recent release of open models by @arcee_ai. We have shared all our learnings from scaling this to trillion tokens, a challenge in itself. https://t.co/cw5ysJbVUe

Pratyush Maini

@pratyushmaini

3 months

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

1

0

18

Pratyush Maini

@pratyushmaini

6 days

7/The Kimi K2 (frontier open model) uses extensive rephrasing in its training data (they did a really cool innovation on top of WRAP to enable long-context synthetic data!): https://t.co/fwultHkPh6

Kimi.ai

@Kimi_Moonshot

4 months

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence

1

0

18

Pratyush Maini

@pratyushmaini

6 days

6/Released today by @percyliang and others: Marin 32B, the best open source base model is trained on large volumes of rephrased data. https://t.co/eWwq5zvEFM

Percy Liang

@percyliang

6 days

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

1

19

Pratyush Maini

@pratyushmaini

6 days

5/The Phi-4 family of models also went from generator-driven synthetic data generation (as in the Phi-1.5 family) to the use of the web-rephrased synthetic data for increasing diversity.

1

0

18

Pratyush Maini

@pratyushmaini

6 days

4/Grok-4 was supposedly trained by "rewriting the entire corpus of human knowledge." https://t.co/Ifekn65Q3D

Elon Musk

@elonmusk

5 months

We will use Grok 3.5 (maybe we should call it 4), which has advanced reasoning, to rewrite the entire corpus of human knowledge, adding missing information and deleting errors. Then retrain on that. Far too much garbage in any foundation model trained on uncorrected data.

1

0

17

Pratyush Maini

@pratyushmaini

6 days

3/Most notable is Nemotron-CC (and its sequel), one of the biggest openly available synthetic datasets that took WRAP and scaled the recipe tremendously. This powers multiple open-source projects today. https://t.co/Dpgv1GOLg1

1

0

23

Pratyush Maini

@pratyushmaini

6 days

2/Rejection at a workshop (SynthData4ML) which usually has high acceptance rates was certainly not the best feeling. I learned that conferences often miss transformative work. What matters is believing in your research. Sharing some industry adoptions 🧵 https://t.co/hNjrrNd4xj

Pratyush Maini

@pratyushmaini

2 years

1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝 https://t.co/1zoYmRIFhl

1

30

Pratyush Maini

@pratyushmaini

6 days

1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago

7

12

235

Sachin Goyal

@goyalsachin007

6 days

📢 Multi-token prediction has long struggled with defining the right “auxiliary target,” leading to tons of heuristics. We show a core limitation of these and propose a simple & sweet idea: future summary prediction. Introducing what I call 🚀TL;DR token pretraining🚀

Divyat Mahajan

@divyat09

6 days

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned

3

43

245

Ziqian Zhong

@fjzzq2002

11 days

New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9)

11

61

442

Johnny Tian-Zheng Wei

@johntzwei

11 days

Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵

2

38

112

Pratyush Maini

@pratyushmaini

14 days

Dataloader bottlenecks were by far the biggest big pain point for me when doing large scale training at CMU. We are developing a new open standard for dataloaders with native support for all your whishlist items like curriculum, statefulness, profiling... what else? TELL US!

JosH100

@josh_wills

14 days

1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc.

3

81

Hadi Pouransari

@HPouransari

29 days

Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵

11

116

641

Hritik Bansal

@hbXNov

19 days

New paper 📢 Most powerful vision-language (VL) reasoning datasets remain proprietary 🔒, hindering efforts to study their principles and develop similarly effective datasets in the open 🔓. Thus, we introduce HoneyBee, a 2.5M-example dataset created through careful data

5

38

197

Saining Xie

@sainingxie

21 days

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

56

332

2K

Emily Byun

@yewonbyun_

26 days

💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data

2

37

138