Haojun Zhao
@Haojun_Zhao14
Followers
922
Following
870
Media
27
Statuses
178
Research engineer @huggingface prev @Polytechnique
Joined November 2021
Introducing Picotron: a minimal educational repository for 4D parallelism Inspired by @karpathy's NanoGPT. @FerdinandMom and I started this project to rebuild 4D parallelism with super easy-to-understand code for everyone learning about LLMs
8
36
188
@eliebakouch Ours was forked from torchtitan to start but has since been heavily modified (although some of the core remains intact!) For example, every group of researchers has their own opinion on how configs should be done.
6
2
90
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
356
2K
9K
Kimi K2 tech report is full of gems as always. Here are my notes on it: > MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with
7
51
337
500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and
3
30
170
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence
279
1K
7K
New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by @BanghuaZ, Assistant Professor at the University of Washington @UW, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key
39
300
1K
A new iPhone in robot world?
Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics Our first robot: Reachy Mini A dream come true: cute and low priced, hackable yet easy to use, powered by open-source and the infinite community. Tiny price, small size, huge
0
0
4
Super nice plot! Love it
Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >
0
0
3
Congrats! This is impressive!
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) https://t.co/duszyObJsG
0
0
2
This blew my mind. One line of code: • Qwen2.5-0.5B training speed: 5% → 40% (MFU) • Qwen3-8B training speed: 34% → 54% (MFU) The culprit? A careless tensor transpose in the cross-entropy loss. Big thanks to @xingkaiyu for spotting it.
10
46
706
Remarkable progress of the Hugging Face science team in 2025: Open-R1, smolagents, SmolVLM2, Ultra-Scale Playbook, OlympicCoder, Open Computer Agent, Reachy Mini, SmolVLA, LeRobot Hackathon and many more... A summary of the projects we released so far this year🧶
10
42
161
We are so excited to announce a new open-source challenge in collaboration with @proximafusion : unlocking fusion with AI If you haven't followed, fusion is how the sun make energy and is –in the long term– our best bet on a clean, safe, and virtually limitless energy In the
10
25
112
We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
8
102
427
MiniMax is the James Bond of AI agents. It uses the world's first open-weight model (MiniMax-M1), and it squeezes every bit of power from it. The agent takes a prompt and does more than any other agent in the market right now: 1. It can do Deep Research 2. It can write code 3.
10
55
390
Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning. - World’s longest context window: 1M-token input, 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency:
85
305
1K
Interesting
Flash Linear Attention ( https://t.co/k2UDXQAaKo) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
0
0
2
Wow! They make the training of billion-parameter models on low-end GPUs with 80Mbps internet speeds possible by compressing the activation, achieving convergence similar to that of centralized datacenter systems with 100Gbps connections using model parallelism
We've reached a major milestone in fully decentralized training: for the first time, we've demonstrated that a large language model can be split and trained across consumer devices connected over the internet - with no loss in speed or performance.
0
1
18
I just read this WSJ article on why Europe's tech scene is so much smaller than the US's and China's. I'm afraid that, like most articles on this topic, it largely misses the mark. Which in itself illustrates a key reason why Europe is lagging behind: when you fail to
692
1K
7K
[Past weekends side project] A quick introduction to contextual speech models I made some interactive visualisations to explain a bit better how recent contextual speech models and audio tokenization work (e.g. Sesame's demo). Added a small dataset and code repo as well though
10
34
151