Haojun Zhao @Haojun_Zhao14 X Profile

Haojun Zhao

@Haojun_Zhao14

Followers

922

Following

870

Media

27

Statuses

178

Research engineer @huggingface prev @Polytechnique

Joined November 2021

Don't wanna be here? Send us removal request.

Haojun Zhao

@Haojun_Zhao14

1 year

Introducing Picotron: a minimal educational repository for 4D parallelism Inspired by @karpathy's NanoGPT. @FerdinandMom and I started this project to rebuild 4D parallelism with super easy-to-understand code for everyone learning about LLMs

8

36

188

Horace He

@cHHillee

5 months

@eliebakouch Ours was forked from torchtitan to start but has since been heavily modified (although some of the core remains intact!) For example, every group of researchers has their own opinion on how configs should be done.

6

2

90

Qwen

@Alibaba_Qwen

5 months

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves

356

2K

9K

elie

@eliebakouch

5 months

Kimi K2 tech report is full of gems as always. Here are my notes on it: > MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with

7

51

337

Loubna Ben Allal

@LoubnaBenAllal1

5 months

500k samples of multilingual post-training data in 5 languages: French, Spanish, Italian, German and Portuguese. To address the lack of multilingual post-training datasets, we created these samples and found they improve performance on benchmarks like Global MMLU, Belebele, and

3

30

170

Kimi.ai

@Kimi_Moonshot

5 months

🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models 🔹Strong in coding and agentic tasks 🐤 Multimodal & thought-mode not supported for now With Kimi K2, advanced agentic intelligence

279

1K

7K

Andrew Ng

@AndrewYNg

5 months

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by @BanghuaZ, Assistant Professor at the University of Washington @UW, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key

39

300

1K

Haojun Zhao

@Haojun_Zhao14

5 months

A new iPhone in robot world?

Thomas Wolf

@Thom_Wolf

5 months

Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics Our first robot: Reachy Mini A dream come true: cute and low priced, hackable yet easy to use, powered by open-source and the infinite community. Tiny price, small size, huge

0

4

Haojun Zhao

@Haojun_Zhao14

5 months

Super nice plot! Love it

elie

@eliebakouch

5 months

Super excited to share SmolLM3, a new strong 3B model. SmolLM3 is fully open, we share the recipe, the dataset, the training codebase and much more! > Train on 11T token on 384 H100 for 220k GPU hours > Support long context up to 128k thanks to NoPE and intra document masking >

0

3

Haojun Zhao

@Haojun_Zhao14

5 months

Congrats! This is impressive!

Loubna Ben Allal

@LoubnaBenAllal1

5 months

Introducing SmolLM3: a strong, smol reasoner! > SoTA 3B model > dual mode reasoning (think/no_think) > long context, up to 128k > multilingual: en, fr, es, de, it, pt > fully open source (data, code, recipes) https://t.co/duszyObJsG

0

2

Haojun Zhao

@Haojun_Zhao14

5 months

This blew my mind. One line of code: • Qwen2.5-0.5B training speed: 5% → 40% (MFU) • Qwen3-8B training speed: 34% → 54% (MFU) The culprit? A careless tensor transpose in the cross-entropy loss. Big thanks to @xingkaiyu for spotting it.

10

46

706

Leandro von Werra

@lvwerra

5 months

Remarkable progress of the Hugging Face science team in 2025: Open-R1, smolagents, SmolVLM2, Ultra-Scale Playbook, OlympicCoder, Open Computer Agent, Reachy Mini, SmolVLA, LeRobot Hackathon and many more... A summary of the projects we released so far this year🧶

10

42

161

Thomas Wolf

@Thom_Wolf

5 months

We are so excited to announce a new open-source challenge in collaboration with @proximafusion : unlocking fusion with AI If you haven't followed, fusion is how the sun make energy and is –in the long term– our best bet on a clean, safe, and virtually limitless energy In the

10

25

112

Guilherme Penedo

@gui_penedo

6 months

We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and exhaustive) multilingual work, we introduce a concept that can also improve English performance: deduplication-based upsampling, which we call rehydration.

8

102

427

Santiago

@svpino

6 months

MiniMax is the James Bond of AI agents. It uses the world's first open-weight model (MiniMax-M1), and it squeezes every bit of power from it. The agent takes a prompt and does more than any other agent in the market right now: 1. It can do Deep Research 2. It can write code 3.

10

55

390

MiniMax (official)

@MiniMax__AI

6 months

Day 1/5 of #MiniMaxWeek: We’re open-sourcing MiniMax-M1, our latest LLM — setting new standards in long-context reasoning. - World’s longest context window: 1M-token input, 80k-token output - State-of-the-art agentic use among open-source models - RL at unmatched efficiency:

85

305

1K

Haojun Zhao

@Haojun_Zhao14

6 months

Interesting

Songlin Yang

@SonglinYang4

6 months

Flash Linear Attention ( https://t.co/k2UDXQAaKo) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:

0

2

Haojun Zhao

@Haojun_Zhao14

6 months

Wow! They make the training of billion-parameter models on low-end GPUs with 80Mbps internet speeds possible by compressing the activation, achieving convergence similar to that of centralized datacenter systems with 100Gbps connections using model parallelism

Pluralis Research

@Pluralis

6 months

We've reached a major milestone in fully decentralized training: for the first time, we've demonstrated that a large language model can be split and trained across consumer devices connected over the internet - with no loss in speed or performance.

0

1

18

Arnaud Bertrand

@RnaudBertrand

7 months

I just read this WSJ article on why Europe's tech scene is so much smaller than the US's and China's. I'm afraid that, like most articles on this topic, it largely misses the mark. Which in itself illustrates a key reason why Europe is lagging behind: when you fail to

692

1K

7K

clem 🤗

@ClementDelangue

7 months

Excited to expand our collaboration with @Azure, thanks for the shoutout @satyanadella!

11

28

411

Thomas Wolf

@Thom_Wolf

8 months

[Past weekends side project] A quick introduction to contextual speech models I made some interactive visualisations to explain a bit better how recent contextual speech models and audio tokenization work (e.g. Sesame's demo). Added a small dataset and code repo as well though

10

34

151