pratyushmaini Profile Banner
Pratyush Maini Profile
Pratyush Maini

@pratyushmaini

Followers
3K
Following
3K
Media
120
Statuses
735

Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi

Joined November 2019
Don't wanna be here? Send us removal request.
@pratyushmaini
Pratyush Maini
3 months
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
23
124
712
@_rishabhranjan_
rishabh ranjan
6 days
Transformers are great for sequences, but most business-critical predictions (e.g. product sales, customer churn, ad CTR, in-hospital mortality) rely on highly-structured relational data where signal is scattered across rows, columns, linked tables and time. Excited to finally
4
38
130
@vishaal_urao
Vishaal Udandarao
5 days
🚀New Paper https://t.co/KB2hZljDHu We conduct a systematic data-centric study for speech-language pretraining, to improve end-to-end spoken-QA! 🎙️🤖 Using our data-centric insights, we pretrain a 3.8B SpeechLM (called SpeLangy) outperforming 3x larger models! 🧵👇
3
36
118
@goyalsachin007
Sachin Goyal
6 days
Repeat after me Very few researchers bring an industrial impact of this scale
@pratyushmaini
Pratyush Maini
6 days
1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago
1
1
123
@pratyushmaini
Pratyush Maini
6 days
9/I am extraordinarily fortunate. Very few papers achieve this level of industry impact. To everyone facing rejections: believe in your work. The right people will find it. Finally, thanks to Apple for a wonderful summer internship: Skyler, David, Richard, Yizhe, Navdeep
Tweet card summary image
arxiv.org
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an...
0
0
36
@pratyushmaini
Pratyush Maini
6 days
8/And of course, I should mention WRAP has been core to the thesis of @datologyai, and has shown great success in the recent release of open models by @arcee_ai. We have shared all our learnings from scaling this to trillion tokens, a challenge in itself. https://t.co/cw5ysJbVUe
@pratyushmaini
Pratyush Maini
3 months
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
1
0
18
@pratyushmaini
Pratyush Maini
6 days
7/The Kimi K2 (frontier open model) uses extensive rephrasing in its training data (they did a really cool innovation on top of WRAP to enable long-context synthetic data!): https://t.co/fwultHkPh6
@Kimi_Moonshot
Kimi.ai
4 months
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence
1
0
18
@pratyushmaini
Pratyush Maini
6 days
6/Released today by @percyliang and others: Marin 32B, the best open source base model is trained on large volumes of rephrased data. https://t.co/eWwq5zvEFM
@percyliang
Percy Liang
6 days
⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:
1
1
19
@pratyushmaini
Pratyush Maini
6 days
5/The Phi-4 family of models also went from generator-driven synthetic data generation (as in the Phi-1.5 family) to the use of the web-rephrased synthetic data for increasing diversity.
1
0
18
@pratyushmaini
Pratyush Maini
6 days
4/Grok-4 was supposedly trained by "rewriting the entire corpus of human knowledge." https://t.co/Ifekn65Q3D
@elonmusk
Elon Musk
5 months
We will use Grok 3.5 (maybe we should call it 4), which has advanced reasoning, to rewrite the entire corpus of human knowledge, adding missing information and deleting errors. Then retrain on that. Far too much garbage in any foundation model trained on uncorrected data.
1
0
17
@pratyushmaini
Pratyush Maini
6 days
3/Most notable is Nemotron-CC (and its sequel), one of the biggest openly available synthetic datasets that took WRAP and scaled the recipe tremendously. This powers multiple open-source projects today. https://t.co/Dpgv1GOLg1
1
0
23
@pratyushmaini
Pratyush Maini
6 days
2/Rejection at a workshop (SynthData4ML) which usually has high acceptance rates was certainly not the best feeling. I learned that conferences often miss transformative work. What matters is believing in your research. Sharing some industry adoptions 🧵 https://t.co/hNjrrNd4xj
@pratyushmaini
Pratyush Maini
2 years
1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝 https://t.co/1zoYmRIFhl
1
1
30
@pratyushmaini
Pratyush Maini
6 days
1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago
7
12
235
@goyalsachin007
Sachin Goyal
6 days
📢 Multi-token prediction has long struggled with defining the right “auxiliary target,” leading to tons of heuristics. We show a core limitation of these and propose a simple & sweet idea: future summary prediction. Introducing what I call 🚀TL;DR token pretraining🚀
@divyat09
Divyat Mahajan
6 days
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned
3
43
245
@fjzzq2002
Ziqian Zhong
11 days
New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9)
11
61
442
@johntzwei
Johnny Tian-Zheng Wei
11 days
Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵
2
38
112
@pratyushmaini
Pratyush Maini
14 days
Dataloader bottlenecks were by far the biggest big pain point for me when doing large scale training at CMU. We are developing a new open standard for dataloaders with native support for all your whishlist items like curriculum, statefulness, profiling... what else? TELL US!
@josh_wills
JosH100
14 days
1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc.
3
3
81
@HPouransari
Hadi Pouransari
29 days
Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵
11
116
641
@hbXNov
Hritik Bansal
19 days
New paper 📢 Most powerful vision-language (VL) reasoning datasets remain proprietary 🔒, hindering efforts to study their principles and develop similarly effective datasets in the open 🔓. Thus, we introduce HoneyBee, a 2.5M-example dataset created through careful data
5
38
197
@sainingxie
Saining Xie
21 days
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
56
332
2K
@yewonbyun_
Emily Byun
26 days
💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data
2
37
138