Mehrdad Farajtabar
@MFarajtabar
Followers
9K
Following
479
Media
54
Statuses
202
Research Scientist at @Apple, prev @DeepMind, prev @GeorgiaTech
Seattle Area
Joined January 2021
Join our innovative team at #Apple as a Research Scientist/Engineer specializing in LLM #Reasoning, #Planning, and General #Intelligence. We are seeking an ideal candidate who: - Is available to start by the end of this year - Holds a PhD or will graduate by year-end - Has 3-5
lnkd.in
This link will take you to a page that’s not on LinkedIn
9
31
257
One usually gets a PhD to become an expert! Then we combine experts to form a Mixture-of-Experts (MoE) — gaining efficiency through specialization. But what if you could educate your MoE even further? In our latest work, we show that you can push the boundaries of #efficient
0
0
4
🪩The one and only @stateofaireport 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:
53
310
963
This is an unwise statement that can only make people confused about what LLMs can or cannot do. Let me tell you something: Math is NOT about solving this kind of ad hoc optimization problems. Yeah, by scraping available data and then clustering it, LLMs can sometimes solve some
Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.
252
233
2K
Apple research just revealed a way to make LLMs 5.35x faster. 🤯 That’s not a typo. They've found a method to get a >500% speedup for code & math tasks, with ZERO quality loss. Here's how they're unlocking AI model's "latent potential": 🧵
18
83
565
📢Submissions are now open for #NeurIPS2025 CCFM workshop. Submission deadline: August 22, 2025, AoE. Website: https://t.co/oIrrtiRKD6 Call for papers: https://t.co/9sUoMl7AJg Submission Link: https://t.co/2aXHQaqFDf
openreview.net
Welcome to the OpenReview homepage for NeurIPS 2025 Workshop CCFM
Is your AI keeping Up with the world? Announcing #NeurIPS2025 CCFM Workshop: Continual and Compatible Foundation Model Updates When/Where: Dec. 6-7 San Diego Submission deadline: Aug. 22, 2025. (opening soon!) https://t.co/oIrrtiRcNy
#FoundationModels #ContinualLearning
0
6
11
I noticed the same thing! Engaging in conversations, replies, or DMs with #DeepMind folks always feels safe and welcoming. Their culture is truly remarkable. Thanks to leaders like Samy Bengio, Devi Krishna, Daphne Luong, JG, and many others who've joined Apple, this incredible
Personal observation: The level of intellectual discussion with @GoogleDeepMind vs @OpenAI that I am able to have is literally night and day. DeepMind knows my work, can raise serious objections, propose and develop alternatives, etc. OpenAI speaks to me with insulting memes
0
0
15
🧵 12/12 What’s next? Our method is just one way to unlock future-token knowledge in AR models. We hope to see new ideas build on this! Diffusion LMs explore the opposite extreme—fully non-AR—but suffer from slow inference. Multi-token prediction may be the sweet spot. 🔄✨
0
0
3
🧵 11/12 Tiny changes, big gains We add two lightweight components: Gated LoRA (on each Linear layer) Sampler head (on final transformer output) Memory overhead? Minimal. Even LoRA rank=1 yields speedup—proof that the AR model already knows the future. You just have to ask. 👀
1
0
3
🧵 10/12 Building speed, step by step 🛠️ Our design improves in layers, each adding speedup (shown in the figure): Linear speculative decoding → light blue Quadratic decoding → yellow boost Sampler head → dark blue LCM loss → olive green Each step stacks more gains. 📈
1
0
3
🧵 9/12 Speedup over different tasks. We trained a model to predict 8 future tokens at once—and saw 1.5× to 5× speedups, depending on the task. More predictable domains (like code & math) get the biggest gains. And the best part? No quality drop, thanks to gated LoRA
1
0
3
🧵 8/12 Latent Consistency Matching (LCM) loss We add an extra loss that encourages <mask> predictions to align with the AR model’s next-token predictions. This improves inference speedups by distilling knowledge from the AR model (teacher) to the multi-token predictor
1
0
3
🧵 7/12 Speculative decoding & multi-token prediction When generating multiple tokens, some might not match what the AR model would produce step-by-step. How do we catch and reject these? We use speculative decoding: generate extra tokens at step T, then verify or reject them at
2
0
2
🧵 6/12 Preserving generation quality with Gated LoRA Remember Gated LoRA? We fine-tune these modules to help the model fill in <mask> tokens. It’s a simple twist on LoRA: the adapter activates only on <mask> tokens, leaving all other tokens untouched. This ensures the model’s
1
0
5
The best influencer deals aren’t about price. They’re about outcomes. Cheap Organic Reach, Higher ROAS, Scaled Ad spend. Stop thinking “one post” and start thinking “partnership.”
5
13
170
🧵 5/12 Sampling coherent sequences The <mask> tokens give us a distribution over future tokens, but we need to sample from it to create coherent sequences. To do this, we train a sampler head—a simple 2-layer perceptron. The blue token (in the figure) is generated just like
1
0
3
🧵 4/12 Training and generation with <mask> tokens We fine-tune the AR model with Gated LoRA layers (more on that soon). During training, we insert <mask> tokens in place of future tokens (shown in yellow below). The model learns to predict them accurately. At generation time,
1
0
3
🧵 3/12 Converting AR model to multi-token predictor We augmented a standard AR model with a few lightweight components to leverage its knowledge of future tokens: 1️⃣ Treat <mask> tokens as future tokens to predict 2️⃣ Add a sampler head to generate coherent multi-token sequences
1
0
2
🧵 2/12 AR training: the unsung hero with a blessing and a curse AR training made LLMs possible—it's simple, scalable, and needs no labeled data. But at inference, it’s costly: every token needs a full model pass. Although AR models are trained to predict one token at a time,
1
0
3
🧵 1/12 Your LLM Knows the Future: Revealing its Multi-token Prediction Capabilities Autoregressive (AR) models power today's LLMs by predicting one token at a time. But what if they could see into the future? In our latest work, we show how to turn AR-trained models into
6
23
155