insightLLM @insightLLM X Profile

insightLLM

@insightLLM

Followers

11

Following

39

Media

5

Statuses

14

LLM/NLP researcher in Baichuan-Inc @BaichuanAI. Prev @AlibabaGroup, Research Intern at @MSFTResearch Asia , GitHub: https://t.co/kUbOq3uc75

Joined May 2025

Don't wanna be here? Send us removal request.

insightLLM

@insightLLM

6 months

🎉 Introducing our latest work ！ 🚀 We propose a label-free method that enables RL without ground truth answer, yet achieves impressive performance on mathematical tasks: 40.0% accuracy on AIME2024 🎯 with a 7B base model. Paper:

huggingface.co

4

1

8

Jiayi Weng

@Trinkle23897

3 months

Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) @shengjia_zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library.

github.com

Renderer for the harmony response format to be used with gpt-oss - openai/harmony

34

155

2K

insightLLM

@insightLLM

5 months

Why does GRPO make responses shorter as training goes on? 🤔 While reproducing Deepseek-R1, we noticed a weird thing: the longer you train with GRPO, the shorter the responses get 📉.Totally opposite to what paper shows(See Figure 1)! Some blame KL divergence, others say the

0

4

insightLLM

@insightLLM

6 months

🎉Our latest work! https://t.co/ooMFSxYoVf

0

2

insightLLM

@insightLLM

6 months

Meaningful!💡The reason behind this phenomenon: the base model(from pre-train) already possesses strong mathematical and logical reasoning abilities. The post-train merely serves to elicit these abilities, so the ground truth answers themselves are actually not that important.

Stella Li

@StellaLisy

6 months

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: https://t.co/jBPlm7cyhr

0

1

Mingyang Chen

@chen_mingyang

7 months

🚀 Introducing ReCall, learning to Reason with Tool Call via RL. - Multi-turn Reinforcement Learning - No need for supervised data on tool use or reasoning steps - Empowers LLMs to agentically use and combine arbitrary tools Fully open-source! A work in progress and we are

1

52

207

insightLLM

@insightLLM

6 months

📚Experiments show that structure constraints and length optimization alone can yield performance comparable to baseline methods that rely on ground-truth answers.

0

2

insightLLM

@insightLLM

6 months

✨We find that constructing surrogate signals based on format correctness and response length, combined with the GRPO algorithm, enables effective training of large models to solve mathematical problems.

0

2

insightLLM

@insightLLM

6 months

💡Base model is like an excellent student who has already mastered mathematical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams, to unlock the capabilities it already possesses.

0

2