eliebakouch Profile Banner
elie Profile
elie

@eliebakouch

Followers
10K
Following
15K
Media
465
Statuses
4K

Training llm's (now: @huggingface) anon feedback: https://t.co/JmMh7Sfvxd

Joined January 2024
Don't wanna be here? Send us removal request.
@eliebakouch
elie
1 month
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23
123
897
6K
@katedeyneka
Kate Deyneka
6 hours
introducing X Wrapped 2025 visualize your posts as a github contribution graph. roast your X persona. boast your stats! link below 👇🏻
45
13
145
@ufc
UFC
6 days
This January
3
7
168
@eliebakouch
elie
5 hours
rephrasing the web with every frontier open source model and train a 5T sparse MoE model on everything, the way to go
@AndrewCurran_
Andrew Curran
12 hours
Bloomberg is reporting that META's superintelligence lab is using Gemma, OpenAI's open source model, and Qwen to train their next large model, code named Avocado.
4
3
65
@eliebakouch
elie
6 hours
(someone told me it might not be 40k gpu available all the time, which make sense)
0
0
2
@eliebakouch
elie
7 hours
just realizing that mistral cluster will be ~2x smaller than poolside cluster??? wtf
@AnjneyMidha
Anjney Midha
8 days
these were trained on ~3k h200s, a practice cluster, and yet, sota zone mistral’s 18k gb200 cluster comes online soon today’s releases are a warmup for the mistral 4 family it will be an interesting few months for frontier open models
3
0
35
@karpathy
Andrej Karpathy
10 hours
nanoGPT - the first LLM to train and inference in space 🥹. It begins.
@AdiOltean
Adi Oltean
12 hours
We have just used the @Nvidia H100 onboard Starcloud-1 to train the first LLM in space! We trained the nano-GPT model from Andrej @Karpathy on the complete works of Shakespeare and successfully ran inference on it. We have also run inference on a preloaded Gemma model, and we
211
444
6K
@valent44355
Victor Renard
9 hours
DID YOU MISS IT AGAIN ???? COME ON
1
2
150
@ServiceNowRSRCH
ServiceNow AI Research
1 day
1/5 🚀Apriel-1.6-15B-Thinker: a 15B multimodal reasoner scoring 57 on the Artificial Analysis Intelligence Index - approaching the performance of ~200B-scale frontier models while remaining an order of magnitude smaller. 🧠Model weights: https://t.co/GE22SOIBfT 📄Blog:
9
49
200
@eliebakouch
elie
13 hours
let's go! glad to see more lab releasing part of their training data
@OpenBMB
OpenBMB
13 hours
🔥 Ultra-FineWeb-en-v1.4 is coming! 2.2T tokens fully open-sourced! The core training fuel for MiniCPM4 / 4.1, fully updated based on FineWeb v1.4.0: 🆕 What's New 1️⃣ Fresher Data: Added CommonCrawl snapshots from Apr 2024 - Jun 2025 to capture the latest world knowledge. 2️⃣
0
1
36
@SeunghyunSEO7
Seunghyun Seo
19 hours
what a pace. cant wait to see their further scaled up model using moe. https://t.co/IJng9yfZ3V
@ArtificialAnlys
Artificial Analysis
1 day
Motif Technologies, a 🇰🇷 Korean AI lab, has just launched Motif-2-12.7B-Reasoning, a 12.7B open weights reasoning model that scores 45 on the Artificial Analysis Intelligence Index and is now the leading model from Korea Key benchmarking takeaways: ➤ Open weights:
1
1
28
@JustinWaugh
Justin Waugh
1 day
Directly comparing a benchmark of Devstral2-123B on my hardware to MiniMax-M2 (230B-A10B) shows the difference in performance MoE can give. At 100 requests concurrently: MiniMax is 2x faster At 2 requests concurrently: MiniMax is 3.5x faster
@eliebakouch
elie
1 day
nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -
2
2
13
@aries_logistics
Aries Worldwide Logistics
5 hours
Stop guessing Incoterms. Grab our free quick-reference sheets and know exactly who owns cost + risk on every move. → Click for free download
0
2
6
@eliebakouch
elie
1 day
is there any paper or blog on synthetic data vs knowledge distillation, and if using both together lead to diminishing return? my understanding is that none of the open models (except gemma and lfm2) use knowledge distillation in pre training, but most of them do use some kind
@JeffDean
Jeff Dean
1 day
Similar to how we use distillation ( https://t.co/KUjmWfHyxC) to create awesome Gemini Flash models that are high quality and very computationally efficient from larger-scale Pro models, Waymo similarly uses distillation from larger models to create computationally efficient
8
2
50
@BenTheEgg
Benjamin Lefaudeux 🇺🇦
1 day
I think that's a very fair point, but also swe-bench is not a great proxy anyway so the whole graph is a bit meh ? (notice the y axis on top) Worth a try is all I would say, really impressive model to my eyes in a practical use, go check for yourself ! https://t.co/HbEJqN6YB0
@eliebakouch
elie
1 day
nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -
0
1
8
@eliebakouch
elie
1 day
@qtnx_
Q
1 day
@eliebakouch @ADarmouni not really! a lot of orgs have only 4xh100 for their entire companies, so this is where something dense shines
0
0
8
@eliebakouch
elie
1 day
@eliebakouch
elie
1 day
i should clarify, no hate on mistral here, i'm really glad they release it in open weight and i'm sure it will be useful for researcher! It's the comparison on total parameters for dense vs moe that i find very misleading but it's not something that they advertise heavily (it's
0
0
4
@Boxem
Boxem
7 months
More and more Amazon sellers switch to Boxem every day It's the modern way to create shipments, track profit, and save more time than ever managing your business Get a free trial today:
3
1
21
@eliebakouch
elie
1 day
i should clarify, no hate on mistral here, i'm really glad they release it in open weight and i'm sure it will be useful for researcher! It's the comparison on total parameters for dense vs moe that i find very misleading but it's not something that they advertise heavily (it's
@eliebakouch
elie
1 day
nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -
4
0
47
@eliebakouch
elie
1 day
nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story: - GLM 4.6 (32B): 74% fewer - Minimax M2 (10B): 92% fewer - K2 thinking (32B): 74% fewer -
@MistralAI
Mistral AI
2 days
Introducing the Devstral 2 coding model family. Two sizes, both open source. Also, meet Mistral Vibe, a native CLI, enabling end-to-end automation. 🧵
18
17
289
@samsja19
samsja
2 days
We committed nodes of compute for running all of our RL examples/ ablation every night to make sure we catch any regression in prime rl
@mikasenghaas
Mika Senghaas
2 days
btw, prime-rl has multi-hour training runs in ci running overnight every day since last week
6
6
117
@jietang
jietang
2 days
try GLM-4.6v
@testingcatalog
TestingCatalog News 🗞
3 days
Z AI released GLM-4.6V (106B) and GLM-4.6V-Flash (9B) open models! Both are available on Z AI Chat and APIs. Coding evals 👀
6
3
88
@GCGraffiti
Graffiti Clothes
16 hours
Hey @EUCouncil why don’t you guys go fine these police officers? Or are you’ll to busy trying to steal more money from American Companies and shareholders since you all opened an investigation into @Google now as well?
5
6
56
@eliebakouch
elie
2 days
btw i didn't mean at all that glm are benchmaxing here lol, i was thinking that some of the benchmark here can be saturate with distillation and some not (if they are distilling)
0
0
0
@eliebakouch
elie
2 days
> Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: "Make frontier-level AI infrastructure open and accessible to everyone." this is very very exciting 🥹
@ying11231
Ying Sheng
2 days
We've been running @radixark for a few months, started by many core developers in SGLang @lmsysorg and its extended ecosystem (slime @slime_framework , AReaL @jxwuyi). I left @xai in August — a place where I built deep emotions and countless beautiful memories. It was the best
0
1
36
@eliebakouch
elie
3 days
interesting how some benchmark doesn't seems to get huge boost between glm-4.6V and the flash version (which is ONLY 9B dense compare to 106B A12B MoE)
@Zai_org
Z.ai
3 days
GLM-4.6V Series is here🚀 - GLM-4.6V (106B): flagship vision-language model with 128K context - GLM-4.6V-Flash (9B): ultra-fast, lightweight version for local and low-latency workloads First-ever native Function Calling in the GLM vision model family Weights:
8
2
107