Maxime Labonne
@maximelabonne
Followers
25K
Following
9K
Media
762
Statuses
3K
Head of Post-Training @liquidai π» GitHub: https://t.co/ElXDsjz8YP π€ HF: https://t.co/2ECS7GiJGD π Blog: https://t.co/Gz5bhbXWT0
London, England
Joined October 2017
You always think you're safe until your job becomes a benchmark.
We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs. We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years. π https://t.co/dVSSHkpAE1 π https://t.co/vqZNrQw66z 1/n
3
1
56
Beyond removing refusals, I hope these techniques can be used for "latent fine-tuning" to customize models at inference time. https://t.co/rqacwXnmzf
github.com
Fully automatic censorship removal for language models - p-e-w/heretic
0
2
8
Abliterate LLMs with Heretic 1.1 It's cool to see this project evolving into a solid open-source library The new viz feature shows how the abliteration process gradually groups the residual vectors into two nice clusters π
2
3
39
@asapzzhou More details in @asapzzhou's thread, please give him a like :)
(1/n) Tiny-A2D: An Open Recipe to Turn Any AR LM into a Diffusion LM Code (dLLM): https://t.co/yYNBo4N99B Checkpoints: https://t.co/fBG4MmoaTZ With dLLM, you can turn ANY autoregressive LM into a diffusion LM (parallel generation + infilling) with minimal compute. Using this
0
2
13
@asapzzhou Here are checkpoints made with this recipe you can actually try in dLLM
huggingface.co
1
1
26
Open recipe to turn Qwen3 into a diffusion LLM ππ > Swap the causal mask for bidirectional attention > Source model matters a lot for performance > Block diffusion (BD3LM) >> masked diffusion (MDLM) > Light SFT with masking Great work from @asapzzhou with his dLLM library!
17
122
875
The "commodity AI" thesis is wrong. The API market is splitting into two modalities: - Premium models (Claude) dominate programming and high-stakes work. Users pay $2+/M tokens because correct code > cheap code. - Cheap open models own roleplay and creative tasks. Volume is
11
8
73
Please release the forbidden training dataset.
BREAKING: @OpenAI must tuner over 20 million+ chat logs to plaintiffs, Judge Ona Wang has ruled in a 9-pg Order just issued:
3
5
49
Proudly powered by LFM2 for the language backbone Compared to Qwen2.5-1.5B, it achieves 2.9x higher throughput and 4x larger context length on NPU hardware
Today we're releasing AutoNeural-VL-1.5B β the world's first real-time multimodal model built for in-car AI. It runs fully local on the @Qualcomm SA8295P NPU with a softwareβhardware co-designed architecture, setting a new bar for speed and quality. AutoNeural redefines what AI
1
3
51
Today we introduce Liquid Labs, our advanced research unit, with the goal of understanding and building efficient and adaptive intelligence systems. Liquid Labs consolidates our existing research efforts at Liquid across architecture of foundation models, multimodality,
18
34
240
You don't understand evals Everybody in AI should read this π
Hey twitter! I'm releasing the LLM Evaluation Guidebook v2! Updated, nicer to read, interactive graphics, etc! https://t.co/xG4VQOj2wN After this, I'm off: I'm taking a sabbatical to go hike with my dogs :D (back @huggingface in Dec *2026*) See you all next year!
4
47
586
LFM2 Technical Report dropped! π₯³ It provides details about the LFM2 architecture, pre-training, post-training, vision, audio, and ColBERT models It's 51 pages long, have fun!
7
37
167
π¨New Blog Alert: Is AdamW an overkill for RLVR? We found that vanilla SGD is 1. As performant as AdamW, 2. 36x more parameter efficient naturally. (much more than a rank 1 lora) π€― Looks like a "free lunch". Maybe Itβs time to rethink the optimizers for RLVR π§΅
16
57
476
Here's the calculation without FP32 gradient accumulation: - Model parameters (FP32): 4 bytes per param - Gradients (FP32): 4 bytes per param - Adam's optimizer states (momentum + variance): 8 bytes per param - SGD: 0! (Still have to add activations on top.)
0
0
2
π Blog post:
notion.so
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tur, Hao Peng
1
1
12
Does SGD > AdamW for RLVR? > RLVR updates very few parameters (sparse subnetwork) > The "active" parameters may share similar properties > Similar loss curvature β single learning rate sufficient > SGD (uniform LR) β AdamW (adaptive LR) It means RLFT on potato GPUs!
2
4
31
@nvidia π Paper: https://t.co/Ub8MkEdp8Z (Please ignore the benchmaxx, it doesn't matter in this case)
arxiv.org
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally...
0
7
29
ToolOrchestra is such a cool work from @nvidia Just an 8B model trained on calling tools and other LLMs to answer queries It's a great demo of what frontier SLMs will be about in 2026
15
79
488