X. Dong @SimonXinDong X Profile

X. Dong

@SimonXinDong

Followers

1K

Following

71

Media

48

Statuses

248

Research Scientist@NVIDIA . Making LLMs e.g., Hymba, Nemotron serials. Ex @Harvard @Meta @Tencent｜ Views and opinions are my own

https://t.co/UBGFkcMBYn

Joined April 2017

Don't wanna be here? Send us removal request.

X. Dong

@SimonXinDong

7 days

We, at NVIDIA, presents - Length Penalty Done Right - Cut CoT length by 3/4 without sacrificing accuracy using only RL - This makes DeepSeek-R1-7B running ~8 times faster on AIME-24 while maintaining the same accuracy.

8

39

244

X. Dong

@SimonXinDong

4 days

This is both highly useful and beautifully crafted.

Dan Alistarh

@DAlistarh

4 days

Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications! 1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat 2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs 3. Transformers/vLLM integrations

0

5

X. Dong

@SimonXinDong

7 days

Thanks to the team. @nbasyl_tw @GXiming @shizhediao Mingjie Liu @CMHungSteven @yin_hongxu Yu-Chiang Frank Wang Kwang-Ting (Tim) Cheng @YejinChoinka @jankautz @PavloMolchanov @NVIDIAAI @nvidianewsroom

0

4

X. Dong

@SimonXinDong

7 days

Project site: https://t.co/6YH5kGT5jG Tech report: https://t.co/yVkvLdj4q9 Model: https://t.co/ujVMzp9m7a Code:

huggingface.co

1

16

X. Dong

@SimonXinDong

7 days

- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization

2

0

11

X. Dong

@SimonXinDong

7 days

- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.

1

0

12

X. Dong

@SimonXinDong

7 days

- “Don’t teach–Incentivize.” We show that concise reasoning can be encouraged through a simplest length-penalty design: truncation, with only RL—no handcrafted priors or specially annotated data. - “Simplicity is great but not easy to get.” We analyze the emerging optimization

0

1

X. Dong

@SimonXinDong

7 days

- A new test-time scaling paradigm: In an iso-thinking-time setup for a question, we can generate more and shorter answers. This approach leads to a 28% improvement in accuracy against a single long answer.

1

3

X. Dong

@SimonXinDong

10 days

Finally, someone shared something different. DeepSeek chose text-to-image NOT because of the image itself; it was more of a coincidence. There are many ways to compress thousands of tokens into a few embeddings. Compressing tokens into a few OCR image embeddings has an

Dileep George

@dileeplearning

12 days

I love Andrej...but this makes no sense to me. I don't see how converting text to image ('pixels') makes it any better for language modeling. What am I missing?

0

1

7

X. Dong

@SimonXinDong

14 days

He drank bacteria to prove a theory. He got ulcers. He won the Nobel Prize. Barry Marshall—science’s bravest stomach. 🧪

Dr Singularity

@Dr_Singularity

15 days

Craziest robot video of the week https://t.co/K3q6ZFVPxQ

0

X. Dong

@SimonXinDong

15 days

https://t.co/O9h8mOkx2E

Alex L Zhang

@a1zhang

17 days

What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length,

0

1

X. Dong

@SimonXinDong

15 days

We should scale up context hard but scale up context length softly

1

0

1

X. Dong

@SimonXinDong

16 days

This limitation is real — but we’ve found solutions. Stay tuned.

Kangwook Lee

@Kangwook_Lee

16 days

DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast

1

7

X. Dong

@SimonXinDong

17 days

https://t.co/CRS4YQJ9mX

0

1

X. Dong

@SimonXinDong

17 days

Simplicity is the hardest, without a doubt.

Xin Eric Wang

@xwang_lk

19 days

Unpopular opinion: Finding a simple idea that actually works is way harder than publishing a fancy one that kinda works. You have to fight the urge to overcomplicate, give up many fancier ideas, fail and pivot again and again until you hit the first principle that truly holds.

0

1

X. Dong

@SimonXinDong

17 days

6️⃣Intra-layer hybridization works seamlessly with various model parallelism techniques in both training and inference, opening new opportunities for hardware and model architecture co-design.

1

0

1

X. Dong

@SimonXinDong

17 days

5️⃣Head-wise hybridization positional embedding (NaPE from @NVIDIAAI @simeng_ssun) demonstrates the best long tracing and length generalization out of 10 common PE variants.

1

0

3

X. Dong

@SimonXinDong

17 days

4️⃣In addition, they show that intra-layer hybridization is well compatiable with new advanced linear attentions (Mamba2 @tri_dao @_albertgu, Gated Delta Net @SonglinYang4 @ahatamiz1) and softmax attentions (Diff Attn @ytz2024 @donglixp @MSFTResearch)

1

0

2

X. Dong

@SimonXinDong

17 days

3️⃣Falcon-H1 (up to 34B params on 18T tokens, from @TIIuae @JingweiZuo) and Dragon (3B params on 3.5T tokens and bigger models coming, from @DragonLLM @EuroHPC_JU @JG_Barthelemy @Dorialexander) further validate the scalibilty and long-context potentials (long documents, files,

1

0

3

X. Dong

@SimonXinDong

17 days

2️⃣A recent FAIR Meta (@sangminbae @CarolejeanWu ) study also show that "intra-layer hybridization shows best pareto-frontier of model quality and efficiency" https://t.co/aM3UHwzEm7

1

2