Pratyush Maini
            
            @pratyushmaini
Followers
                3K
              Following
                3K
              Media
                120
              Statuses
                735
              Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi
              
              Joined November 2019
            
            
           1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance 
          
                
                23
              
              
                
                124
              
              
                
                712
              
             Transformers are great for sequences, but most business-critical predictions (e.g. product sales, customer churn, ad CTR, in-hospital mortality) rely on highly-structured relational data where signal is scattered across rows, columns, linked tables and time. Excited to finally 
          
                
                4
              
              
                
                38
              
              
                
                130
              
             🚀New Paper  https://t.co/KB2hZljDHu  We conduct a systematic data-centric study for speech-language pretraining, to improve end-to-end spoken-QA! 🎙️🤖 Using our data-centric insights, we pretrain a 3.8B SpeechLM (called SpeLangy) outperforming 3x larger models! 🧵👇 
          
                
                3
              
              
                
                36
              
              
                
                118
              
             Repeat after me Very few researchers bring an industrial impact of this scale 
           1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago 
            
                
                1
              
              
                
                1
              
              
                
                123
              
             9/I am extraordinarily fortunate. Very few papers achieve this level of industry impact. To everyone facing rejections: believe in your work. The right people will find it. Finally, thanks to Apple for a wonderful summer internship: Skyler, David, Richard, Yizhe, Navdeep 
          
            
            arxiv.org
              Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an...
            
                
                0
              
              
                
                0
              
              
                
                36
              
             8/And of course, I should mention WRAP has been core to the thesis of @datologyai, and has shown great success in the recent release of open models by @arcee_ai. We have shared all our learnings from scaling this to trillion tokens, a challenge in itself.  https://t.co/cw5ysJbVUe 
          
           1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance 
            
                
                1
              
              
                
                0
              
              
                
                18
              
             7/The Kimi K2 (frontier open model) uses extensive rephrasing in its training data (they did a really cool innovation on top of WRAP to enable long-context synthetic data!):  https://t.co/fwultHkPh6 
          
           🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence 
            
                
                1
              
              
                
                0
              
              
                
                18
              
             6/Released today by @percyliang and others: Marin 32B, the best open source base model is trained on large volumes of rephrased data.  https://t.co/eWwq5zvEFM 
          
           ⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks: 
            
                
                1
              
              
                
                1
              
              
                
                19
              
             5/The Phi-4 family of models also went from generator-driven synthetic data generation (as in the Phi-1.5 family) to the use of the web-rephrased synthetic data for increasing diversity. 
          
                
                1
              
              
                
                0
              
              
                
                18
              
             4/Grok-4 was supposedly trained by "rewriting the entire corpus of human knowledge."  https://t.co/Ifekn65Q3D 
          
           We will use Grok 3.5 (maybe we should call it 4), which has advanced reasoning, to rewrite the entire corpus of human knowledge, adding missing information and deleting errors. Then retrain on that. Far too much garbage in any foundation model trained on uncorrected data. 
          
                
                1
              
              
                
                0
              
              
                
                17
              
             3/Most notable is Nemotron-CC (and its sequel), one of the biggest openly available synthetic datasets that took WRAP and scaled the recipe tremendously. This powers multiple open-source projects today.  https://t.co/Dpgv1GOLg1 
          
          
                
                1
              
              
                
                0
              
              
                
                23
              
             2/Rejection at a workshop (SynthData4ML) which usually has high acceptance rates was certainly not the best feeling. I learned that conferences often miss transformative work. What matters is believing in your research. Sharing some industry adoptions 🧵  https://t.co/hNjrrNd4xj 
          
           1/7 Super excited about my Apple Internship work finally coming out: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web! 📝  https://t.co/1zoYmRIFhl 
            
            
                
                1
              
              
                
                1
              
              
                
                30
              
             1/It's not often that academic projects get industry-wide adoption. I've been fortunate to develop Rephrasing the Web, a synthetic pretraining pipeline that powers pretty much EVERY frontier model today But probably no one knows: our paper was REJECTED from a workshop 2 yrs ago 
          
                
                7
              
              
                
                12
              
              
                
                235
              
             📢 Multi-token prediction has long struggled with defining the right “auxiliary target,” leading to tons of heuristics. We show a core limitation of these and propose a simple & sweet idea: future summary prediction. Introducing what I call 🚀TL;DR token pretraining🚀 
           [1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned 
            
                
                3
              
              
                
                43
              
              
                
                245
              
             New research with @AdtRaghunathan, Nicholas Carlini and Anthropic! We built ImpossibleBench to measure reward hacking in LLM coding agents 🤖, by making benchmark tasks impossible and seeing whether models game tests or follow specs. (1/9) 
          
                
                11
              
              
                
                61
              
              
                
                442
              
             Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵 
          
                
                2
              
              
                
                38
              
              
                
                112
              
             Dataloader bottlenecks were by far the biggest big pain point for me when doing large scale training at CMU. We are developing a new open standard for dataloaders with native support for all your whishlist items like curriculum, statefulness, profiling... what else? TELL US! 
           1/ Really looking forward to #PytorchConf this week in SF-- I've spent the last couple of months at @datologyai immersed in the DataLoader ecosystem (especially for our VLM stack) and I have a few topics I would love to discuss with folks (DMs are open, say hi if you see me, etc. 
          
                
                3
              
              
                
                3
              
              
                
                81
              
             Introducing Pretraining with Hierarchical Memories: Separating Knowledge & Reasoning for On-Device LLM Deployment 💡We propose dividing LLM parameters into 1) anchor (always used, capturing commonsense) and 2) memory bank (selected per query, capturing world knowledge). [1/X]🧵 
          
                
                11
              
              
                
                116
              
              
                
                641
              
             New paper 📢 Most powerful vision-language (VL) reasoning datasets remain proprietary 🔒, hindering efforts to study their principles and develop similarly effective datasets in the open 🔓. Thus, we introduce HoneyBee, a 2.5M-example dataset created through careful data 
          
                
                5
              
              
                
                38
              
              
                
                197
              
             three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n) 
          
                
                56
              
              
                
                332
              
              
                
                2K
              
             💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data 
          
                
                2
              
              
                
                37
              
              
                
                138