elie @eliebakouch X Profile

elie

@eliebakouch

Followers

10K

Following

15K

Media

465

Statuses

4K

Training llm's (now: @huggingface) anon feedback: https://t.co/JmMh7Sfvxd

https://t.co/Rhb0otAbl1

Joined January 2024

Don't wanna be here? Send us removal request.

elie

@eliebakouch

1 month

Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23

123

897

6K

Kate Deyneka

@katedeyneka

6 hours

introducing X Wrapped 2025 visualize your posts as a github contribution graph. roast your X persona. boast your stats! link below 👇🏻

45

13

145

UFC

@ufc

6 days

This January

3

7

168

elie

@eliebakouch

5 hours

rephrasing the web with every frontier open source model and train a 5T sparse MoE model on everything, the way to go

Andrew Curran

@AndrewCurran_

12 hours

Bloomberg is reporting that META's superintelligence lab is using Gemma, OpenAI's open source model, and Qwen to train their next large model, code named Avocado.

4

3

65

elie

@eliebakouch

6 hours

(someone told me it might not be 40k gpu available all the time, which make sense)

0

2

elie

@eliebakouch

7 hours

just realizing that mistral cluster will be ~2x smaller than poolside cluster??? wtf

Anjney Midha

@AnjneyMidha

8 days

these were trained on ~3k h200s, a practice cluster, and yet, sota zone mistral’s 18k gb200 cluster comes online soon today’s releases are a warmup for the mistral 4 family it will be an interesting few months for frontier open models

3

0

35

Andrej Karpathy

@karpathy

10 hours

nanoGPT - the first LLM to train and inference in space 🥹. It begins.

Adi Oltean

@AdiOltean

12 hours

We have just used the @Nvidia H100 onboard Starcloud-1 to train the first LLM in space! We trained the nano-GPT model from Andrej @Karpathy on the complete works of Shakespeare and successfully ran inference on it. We have also run inference on a preloaded Gemma model, and we

211

444

6K

Victor Renard

@valent44355

9 hours

DID YOU MISS IT AGAIN ???? COME ON

1

2

150

ServiceNow AI Research

@ServiceNowRSRCH

1 day

1/5 🚀Apriel-1.6-15B-Thinker: a 15B multimodal reasoner scoring 57 on the Artificial Analysis Intelligence Index - approaching the performance of ~200B-scale frontier models while remaining an order of magnitude smaller. 🧠Model weights: https://t.co/GE22SOIBfT 📄Blog:

9

49

200

elie

@eliebakouch

13 hours

let's go! glad to see more lab releasing part of their training data

OpenBMB

@OpenBMB

13 hours

🔥 Ultra-FineWeb-en-v1.4 is coming! 2.2T tokens fully open-sourced! The core training fuel for MiniCPM4 / 4.1, fully updated based on FineWeb v1.4.0: 🆕 What's New 1️⃣ Fresher Data: Added CommonCrawl snapshots from Apr 2024 - Jun 2025 to capture the latest world knowledge. 2️⃣

0

1

36

Seunghyun Seo

@SeunghyunSEO7

19 hours

what a pace. cant wait to see their further scaled up model using moe. https://t.co/IJng9yfZ3V

Artificial Analysis

@ArtificialAnlys

1 day

Motif Technologies, a 🇰🇷 Korean AI lab, has just launched Motif-2-12.7B-Reasoning, a 12.7B open weights reasoning model that scores 45 on the Artificial Analysis Intelligence Index and is now the leading model from Korea Key benchmarking takeaways: ➤ Open weights:

1

28

Justin Waugh

@JustinWaugh

1 day

Directly comparing a benchmark of Devstral2-123B on my hardware to MiniMax-M2 (230B-A10B) shows the difference in performance MoE can give. At 100 requests concurrently: MiniMax is 2x faster At 2 requests concurrently: MiniMax is 3.5x faster

elie

@eliebakouch

1 day

nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -

2

13

Aries Worldwide Logistics

@aries_logistics

5 hours

Stop guessing Incoterms. Grab our free quick-reference sheets and know exactly who owns cost + risk on every move. → Click for free download

0

2

6

elie

@eliebakouch

1 day

is there any paper or blog on synthetic data vs knowledge distillation, and if using both together lead to diminishing return? my understanding is that none of the open models (except gemma and lfm2) use knowledge distillation in pre training, but most of them do use some kind

Jeff Dean

@JeffDean

1 day

Similar to how we use distillation ( https://t.co/KUjmWfHyxC) to create awesome Gemini Flash models that are high quality and very computationally efficient from larger-scale Pro models, Waymo similarly uses distillation from larger models to create computationally efficient

8

2

50

Benjamin Lefaudeux 🇺🇦

@BenTheEgg

1 day

I think that's a very fair point, but also swe-bench is not a great proxy anyway so the whole graph is a bit meh ? (notice the y axis on top) Worth a try is all I would say, really impressive model to my eyes in a practical use, go check for yourself ! https://t.co/HbEJqN6YB0

elie

@eliebakouch

1 day

nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -

0

1

8

elie

@eliebakouch

1 day

https://t.co/l03v28mfWx

Q

@qtnx_

1 day

@eliebakouch @ADarmouni not really! a lot of orgs have only 4xh100 for their entire companies, so this is where something dense shines

0

8

elie

@eliebakouch

1 day

https://t.co/aqetGHkVui

elie

@eliebakouch

1 day

i should clarify, no hate on mistral here, i'm really glad they release it in open weight and i'm sure it will be useful for researcher! It's the comparison on total parameters for dense vs moe that i find very misleading but it's not something that they advertise heavily (it's

0

4

Boxem

@Boxem

7 months

More and more Amazon sellers switch to Boxem every day It's the modern way to create shipments, track profit, and save more time than ever managing your business Get a free trial today:

3

1

21

elie

@eliebakouch

1 day

i should clarify, no hate on mistral here, i'm really glad they release it in open weight and i'm sure it will be useful for researcher! It's the comparison on total parameters for dense vs moe that i find very misleading but it's not something that they advertise heavily (it's

elie

@eliebakouch

1 day

nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -

4

0

47

elie

@eliebakouch

1 day

nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -

Mistral AI

@MistralAI

2 days

Introducing the Devstral 2 coding model family. Two sizes, both open source. Also, meet Mistral Vibe, a native CLI, enabling end-to-end automation. 🧵

18

17

289

samsja

@samsja19

2 days

We committed nodes of compute for running all of our RL examples/ ablation every night to make sure we catch any regression in prime rl

Mika Senghaas

@mikasenghaas

2 days

btw, prime-rl has multi-hour training runs in ci running overnight every day since last week

6

117

jietang

@jietang

2 days

try GLM-4.6v

TestingCatalog News 🗞

@testingcatalog

3 days

Z AI released GLM-4.6V (106B) and GLM-4.6V-Flash (9B) open models! Both are available on Z AI Chat and APIs. Coding evals 👀

6

3

88

Graffiti Clothes

@GCGraffiti

16 hours

Hey @EUCouncil why don’t you guys go fine these police officers? Or are you’ll to busy trying to steal more money from American Companies and shareholders since you all opened an investigation into @Google now as well?

5

6

56

elie

@eliebakouch

2 days

btw i didn't mean at all that glm are benchmaxing here lol, i was thinking that some of the benchmark here can be saturate with distillation and some not (if they are distilling)

0

elie

@eliebakouch

2 days

> Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: "Make frontier-level AI infrastructure open and accessible to everyone." this is very very exciting 🥹

Ying Sheng

@ying11231

2 days

We've been running @radixark for a few months, started by many core developers in SGLang @lmsysorg and its extended ecosystem (slime @slime_framework , AReaL @jxwuyi). I left @xai in August — a place where I built deep emotions and countless beautiful memories. It was the best

0

1

36

elie

@eliebakouch

3 days

interesting how some benchmark doesn't seems to get huge boost between glm-4.6V and the flash version (which is ONLY 9B dense compare to 106B A12B MoE)

Z.ai

@Zai_org

3 days

GLM-4.6V Series is here🚀 - GLM-4.6V (106B): flagship vision-language model with 128K context - GLM-4.6V-Flash (9B): ultra-fast, lightweight version for local and low-latency workloads First-ever native Function Calling in the GLM vision model family Weights:

8

2

107