Mike Lewis Profile
Mike Lewis

@ml_perception

Followers
8K
Following
806
Media
11
Statuses
276

Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, attention sinks, kNN-LM, top-k sampling & Deal Or No Deal.

Seattle
Joined September 2019
Don't wanna be here? Send us removal request.
@Yen_Ju_Lu
Yen-Ju Lu
24 days
πŸš€ Introducing the Latent Speech-Text Transformer (LST) β€” a speech-text model that organizes speech tokens into latent patches for better textβ†’speech transfer, enabling steeper scaling laws and more efficient multimodal training ⚑️ Paper πŸ“„ https://t.co/4nUsbC1YKF
7
15
30
@ml_perception
Mike Lewis
3 months
Love seeing these incredibly creative new evaluations! Optimizing benchmarks is easy, the real challenge is in generalizing to the tasks that don't exist yet
@arithmoquine
henry
3 months
new post. there's a lot in it. i suggest you check it out
3
0
49
@Guangxuan_Xiao
Guangxuan Xiao
3 months
I've written the full story of Attention Sinks β€” a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: https://t.co/0EAi2KQMMx
39
283
2K
@sharan0909
Sharan Narang
7 months
Don’t miss this - I’ve worked with Mike (@ml_perception) very closely at Meta and his talks are super informative and fun.
@alan_ritter
Alan Ritter
7 months
Want to learn about Llama's pre-training? Mike Lewis will be giving a Keynote at NAACL 2025 in Albuquerque, NM on May 1. https://t.co/c9VdW0GYKM @naaclmeeting
0
1
26
@nick11roberts
Nicholas Roberts
8 months
πŸ“‰πŸ“‰NEW SCALING LAW PHENOMENON πŸ“‰πŸ“‰ We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: https://t.co/SH3YCMyIeG [1/n]
13
170
1K
@IreneZhang30
Qizhen (Irene) Zhang
9 months
✨New Preprint✨We introduce 𝐁𝐫𝐚𝐧𝐜𝐑-π“π«πšπ’π§-π’π­π’π­πœπ‘ (𝐁𝐓𝐒), an efficient & flexible method for stitching together independently pretrained LLM experts (i.e. code, math) into a single, capable generalist model. Key Takeaways: βœ…BTS achieves the best average
1
9
78
@ArtidoroPagnoni
Artidoro Pagnoni
11 months
πŸš€ Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte-patches instead of tokens 🀯 Paper πŸ“„ https://t.co/5QGrlJdK0y Code πŸ› οΈ https://t.co/jCdDI5BXwe
17
142
728
@liang_weixin
Weixin Liang
1 year
How can we reduce pretraining costs for multi-modal models without sacrificing quality? We study this Q in our new work: https://t.co/KQoZ3cunEf At @AIatMeta, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding
5
37
219
@VictoriaLinML
Victoria X Lin
1 year
1/n Introducing MoMa πŸ–Ό, our new sparse early-fusion architecture for mixed-modal language modeling that significantly boosts pre-training efficiency πŸš€ ( https://t.co/AmemA1SOM1). MoMa employs a mixture-of-expert (MoE) framework with modality-specific expert groups. Given any
8
51
306
@ml_perception
Mike Lewis
1 year
tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!
@ml_perception
Mike Lewis
1 year
So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! https://t.co/9Bg6m3HOFQ Also check out the paper here, with lots of details on how this was made:
4
15
167
@ml_perception
Mike Lewis
1 year
So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! https://t.co/9Bg6m3HOFQ Also check out the paper here, with lots of details on how this was made:
Tweet card summary image
llama.com
Discover Llama 4's class-leading AI models, Scout and Maverick. Experience top performance, multimodality, low costs, and unparalleled efficiency.
3
19
180
@ml_perception
Mike Lewis
1 year
Excited to see the open source release of FAIR's early fusion multimodal LLMs!
@AIatMeta
AI at Meta
1 year
Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and
0
5
46
@ruoxijia
Ruoxi Jia
1 year
Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!
2
16
58
@ZexuanZhong
Zexuan Zhong
1 year
Introducing Lory, a fully-differentiable MoE arch for decoder LM pre-training! Lory merges expert FFNs by computing a weighted average in the parameter space, and computes the output through the merged FFNs. But training naively is infeasible, how to make it work? Details in🧡
4
44
229
@arena
lmarena.ai
2 years
Moreover, we observe even stronger performance in English category, where Llama 3 ranking jumps to ~1st place with GPT-4-Turbo! It consistently performs strong against top models (see win-rate matrix) by human preference. It's been optimized for dialogue scenario with large
11
39
378
@ml_perception
Mike Lewis
2 years
I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.
@ml_perception
Mike Lewis
2 years
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
6
14
173
@ml_perception
Mike Lewis
2 years
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
@felix_red_panda
Felix
2 years
Llama3 8B is trained on almost 100 times the Chinchilla optimal number of tokens
14
36
491
@sharan0909
Sharan Narang
2 years
Excited to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are the best open source models.
6
5
59
@vedanujg
Vedanuj Goswami
2 years
Happy to be part of this incredible journey of Llama3 and to share the best open weight 8B and 70B models! Our largest 400B+ model is still cooking but we are providing a sneak peek into how it is trending! Check more details here https://t.co/tL12S6GymG
Tweet card summary image
ai.meta.com
Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. In the coming months, we expect to share new capabilities, additional model sizes,...
0
10
44