X. Dong Profile
X. Dong

@SimonXinDong

Followers
1K
Following
71
Media
48
Statuses
248

Research Scientist@NVIDIA . Making LLMs e.g., Hymba, Nemotron serials. Ex @Harvard @Meta @Tencent| Views and opinions are my own

Joined April 2017
Don't wanna be here? Send us removal request.
@SimonXinDong
X. Dong
7 days
We, at NVIDIA, presents - Length Penalty Done Right - Cut CoT length by 3/4 without sacrificing accuracy using only RL - This makes DeepSeek-R1-7B running ~8 times faster on AIME-24 while maintaining the same accuracy.
8
39
244
@SimonXinDong
X. Dong
4 days
This is both highly useful and beautifully crafted.
@DAlistarh
Dan Alistarh
4 days
Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications! 1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat 2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs 3. Transformers/vLLM integrations
0
0
5
@SimonXinDong
X. Dong
7 days
Thanks to the team. @nbasyl_tw @GXiming @shizhediao Mingjie Liu @CMHungSteven @yin_hongxu Yu-Chiang Frank Wang Kwang-Ting (Tim) Cheng @YejinChoinka @jankautz @PavloMolchanov @NVIDIAAI @nvidianewsroom
0
0
4
@SimonXinDong
X. Dong
7 days
- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization
2
0
11
@SimonXinDong
X. Dong
7 days
- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.
1
0
12
@SimonXinDong
X. Dong
7 days
- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization
0
1
1
@SimonXinDong
X. Dong
7 days
- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.
1
1
3
@SimonXinDong
X. Dong
10 days
Finally, someone shared something different. DeepSeek chose text-to-image NOT because of the image itself; it was more of a coincidence. There are many ways to compress thousands of tokens into a few embeddings. Compressing tokens into a few OCR image embeddings has an
@dileeplearning
Dileep George
12 days
I love Andrej...but this makes no sense to me. I don't see how converting text to image ('pixels') makes it any better for language modeling. What am I missing?
0
1
7
@SimonXinDong
X. Dong
14 days
He drank bacteria to prove a theory. He got ulcers. He won the Nobel Prize. Barry Marshall—science’s bravest stomach. 🧪
@Dr_Singularity
Dr Singularity
15 days
Craziest robot video of the week https://t.co/K3q6ZFVPxQ
0
0
0
@SimonXinDong
X. Dong
15 days
@a1zhang
Alex L Zhang
17 days
What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,
0
1
1
@SimonXinDong
X. Dong
15 days
We should scale up context hard but scale up context length softly
1
0
1
@SimonXinDong
X. Dong
16 days
This limitation is real — but we’ve found solutions. Stay tuned.
@Kangwook_Lee
Kangwook Lee
16 days
DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast
1
1
7
@SimonXinDong
X. Dong
17 days
Simplicity is the hardest, without a doubt.
@xwang_lk
Xin Eric Wang
19 days
Unpopular opinion: Finding a simple idea that actually works is way harder than publishing a fancy one that kinda works. You have to fight the urge to overcomplicate, give up many fancier ideas, fail and pivot again and again until you hit the first principle that truly holds.
0
0
1
@SimonXinDong
X. Dong
17 days
6️⃣Intra-layer hybridization works seamlessly with various model parallelism techniques in both training and inference, opening new opportunities for hardware and model architecture co-design.
1
0
1
@SimonXinDong
X. Dong
17 days
5️⃣Head-wise hybridization positional embedding (NaPE from @NVIDIAAI @simeng_ssun) demonstrates the best long tracing and length generalization out of 10 common PE variants.
1
0
3
@SimonXinDong
X. Dong
17 days
4️⃣In addition, they show that intra-layer hybridization is well compatiable with new advanced linear attentions (Mamba2 @tri_dao @_albertgu, Gated Delta Net @SonglinYang4 @ahatamiz1) and softmax attentions (Diff Attn @ytz2024 @donglixp @MSFTResearch)
1
0
2
@SimonXinDong
X. Dong
17 days
3️⃣Falcon-H1 (up to 34B params on 18T tokens, from @TIIuae @JingweiZuo) and Dragon (3B params on 3.5T tokens and bigger models coming, from @DragonLLM @EuroHPC_JU @JG_Barthelemy @Dorialexander) further validate the scalibilty and long-context potentials (long documents, files,
1
0
3
@SimonXinDong
X. Dong
17 days
2️⃣A recent FAIR Meta (@sangminbae @CarolejeanWu ) study also show that "intra-layer hybridization shows best pareto-frontier of model quality and efficiency" https://t.co/aM3UHwzEm7
1
2
2