X. Dong
@SimonXinDong
Followers
1K
Following
71
Media
48
Statuses
248
Research Scientist@NVIDIA . Making LLMs e.g., Hymba, Nemotron serials. Ex @Harvard @Meta @Tencent| Views and opinions are my own
Joined April 2017
We, at NVIDIA, presents - Length Penalty Done Right - Cut CoT length by 3/4 without sacrificing accuracy using only RL - This makes DeepSeek-R1-7B running ~8 times faster on AIME-24 while maintaining the same accuracy.
8
39
244
This is both highly useful and beautifully crafted.
Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications! 1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat 2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs 3. Transformers/vLLM integrations
0
0
5
Thanks to the team. @nbasyl_tw
@GXiming
@shizhediao Mingjie Liu @CMHungSteven
@yin_hongxu Yu-Chiang Frank Wang Kwang-Ting (Tim) Cheng @YejinChoinka
@jankautz
@PavloMolchanov
@NVIDIAAI
@nvidianewsroom
0
0
4
Project site: https://t.co/6YH5kGT5jG Tech report: https://t.co/yVkvLdj4q9 Model: https://t.co/ujVMzp9m7a Code:
huggingface.co
1
1
16
- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization
2
0
11
- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.
1
0
12
- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization
0
1
1
- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.
1
1
3
Finally, someone shared something different. DeepSeek chose text-to-image NOT because of the image itself; it was more of a coincidence. There are many ways to compress thousands of tokens into a few embeddings. Compressing tokens into a few OCR image embeddings has an
I love Andrej...but this makes no sense to me. I don't see how converting text to image ('pixels') makes it any better for language modeling. What am I missing?
0
1
7
He drank bacteria to prove a theory. He got ulcers. He won the Nobel Prize. Barry Marshall—science’s bravest stomach. 🧪
0
0
0
We should scale up context hard but scale up context length softly
1
0
1
This limitation is real — but we’ve found solutions. Stay tuned.
DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast
1
1
7
Simplicity is the hardest, without a doubt.
Unpopular opinion: Finding a simple idea that actually works is way harder than publishing a fancy one that kinda works. You have to fight the urge to overcomplicate, give up many fancier ideas, fail and pivot again and again until you hit the first principle that truly holds.
0
0
1
6️⃣Intra-layer hybridization works seamlessly with various model parallelism techniques in both training and inference, opening new opportunities for hardware and model architecture co-design.
1
0
1
5️⃣Head-wise hybridization positional embedding (NaPE from @NVIDIAAI @simeng_ssun) demonstrates the best long tracing and length generalization out of 10 common PE variants.
1
0
3
4️⃣In addition, they show that intra-layer hybridization is well compatiable with new advanced linear attentions (Mamba2 @tri_dao @_albertgu, Gated Delta Net @SonglinYang4 @ahatamiz1) and softmax attentions (Diff Attn @ytz2024 @donglixp @MSFTResearch)
1
0
2
3️⃣Falcon-H1 (up to 34B params on 18T tokens, from @TIIuae @JingweiZuo) and Dragon (3B params on 3.5T tokens and bigger models coming, from @DragonLLM @EuroHPC_JU @JG_Barthelemy @Dorialexander) further validate the scalibilty and long-context potentials (long documents, files,
1
0
3
2️⃣A recent FAIR Meta (@sangminbae @CarolejeanWu ) study also show that "intra-layer hybridization shows best pareto-frontier of model quality and efficiency" https://t.co/aM3UHwzEm7
1
2
2