Haojun Zhao Profile
Haojun Zhao

@Haojun_Zhao14

Followers
922
Following
870
Media
27
Statuses
178

Research engineer @huggingface prev @Polytechnique

Joined November 2021
Don't wanna be here? Send us removal request.
@Haojun_Zhao14
Haojun Zhao
1 year
Introducing Picotron: a minimal educational repository for 4D parallelism Inspired by @karpathy's NanoGPT. @FerdinandMom and I started this project to rebuild 4D parallelism with super easy-to-understand code for everyone learning about LLMs
8
36
188
@cHHillee
Horace He
5 months
@eliebakouch Ours was forked from torchtitan to start but has since been heavily modified (although some of the core remains intact!) For example, every group of researchers has their own opinion on how configs should be done.
6
2
90
@Alibaba_Qwen
Qwen
5 months
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
356
2K
9K
@eliebakouch
elie
5 months
Kimi K2 tech report is full of gems as always. Here are my notes on it: > MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with
7
51
337
@LoubnaBenAllal1
Loubna Ben Allal
5 months
500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and
3
30
170
@Kimi_Moonshot
Kimi.ai
5 months
🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence
279
1K
7K
@AndrewYNg
Andrew Ng
5 months
New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by @BanghuaZ, Assistant Professor at the University of Washington @UW, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key
39
300
1K
@Haojun_Zhao14
Haojun Zhao
5 months
A new iPhone in robot world?
@Thom_Wolf
Thomas Wolf
5 months
Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics Our first robot: Reachy Mini A dream come true: cute and low priced, hackable yet easy to use, powered by open-source and the infinite community. Tiny price, small size, huge
0
0
4
@Haojun_Zhao14
Haojun Zhao
5 months
Super nice plot! Love it
@eliebakouch
elie
5 months
Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >
0
0
3
@Haojun_Zhao14
Haojun Zhao
5 months
Congrats! This is impressive!
@LoubnaBenAllal1
Loubna Ben Allal
5 months
Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) https://t.co/duszyObJsG
0
0
2
@Haojun_Zhao14
Haojun Zhao
5 months
This blew my mind. One line of code: • Qwen2.5-0.5B training speed: 5% → 40% (MFU) • Qwen3-8B training speed: 34% → 54% (MFU) The culprit? A careless tensor transpose in the cross-entropy loss. Big thanks to @xingkaiyu for spotting it.
10
46
706
@lvwerra
Leandro von Werra
5 months
Remarkable progress of the Hugging Face science team in 2025: Open-R1, smolagents, SmolVLM2, Ultra-Scale Playbook, OlympicCoder, Open Computer Agent, Reachy Mini, SmolVLA, LeRobot Hackathon and many more... A summary of the projects we released so far this year🧶
10
42
161
@Thom_Wolf
Thomas Wolf
5 months
We are so excited to announce a new open-source challenge in collaboration with @proximafusion : unlocking fusion with AI If you haven't followed, fusion is how the sun make energy and is –in the long term– our best bet on a clean, safe, and virtually limitless energy In the
10
25
112
@gui_penedo
Guilherme Penedo
6 months
We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.
8
102
427
@svpino
Santiago
6 months
MiniMax is the James Bond of AI agents. It uses the world's first open-weight model (MiniMax-M1), and it squeezes every bit of power from it. The agent takes a prompt and does more than any other agent in the market right now: 1. It can do Deep Research 2. It can write code 3.
10
55
390
@MiniMax__AI
MiniMax (official)
6 months
Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning. - World’s longest context window: 1M-token input, 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency:
85
305
1K
@Haojun_Zhao14
Haojun Zhao
6 months
Interesting
@SonglinYang4
Songlin Yang
6 months
Flash Linear Attention ( https://t.co/k2UDXQAaKo) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:
0
0
2
@Haojun_Zhao14
Haojun Zhao
6 months
Wow! They make the training of billion-parameter models on low-end GPUs with 80Mbps internet speeds possible by compressing the activation, achieving convergence similar to that of centralized datacenter systems with 100Gbps connections using model parallelism
@Pluralis
Pluralis Research
6 months
We've reached a major milestone in fully decentralized training: for the first time, we've demonstrated that a large language model can be split and trained across consumer devices connected over the internet - with no loss in speed or performance.
0
1
18
@RnaudBertrand
Arnaud Bertrand
7 months
I just read this WSJ article on why Europe's tech scene is so much smaller than the US's and China's. I'm afraid that, like most articles on this topic, it largely misses the mark. Which in itself illustrates a key reason why Europe is lagging behind: when you fail to
692
1K
7K
@ClementDelangue
clem 🤗
7 months
Excited to expand our collaboration with @Azure, thanks for the shoutout @satyanadella!
11
28
411
@Thom_Wolf
Thomas Wolf
8 months
[Past weekends side project] A quick introduction to contextual speech models I made some interactive visualisations to explain a bit better how recent contextual speech models and audio tokenization work (e.g. Sesame's demo). Added a small dataset and code repo as well though
10
34
151