insightLLM
@insightLLM
Followers
11
Following
39
Media
5
Statuses
14
LLM/NLP researcher in Baichuan-Inc @BaichuanAI. Prev @AlibabaGroup, Research Intern at @MSFTResearch Asia , GitHub: https://t.co/kUbOq3uc75
Joined May 2025
🎉 Introducing our latest work ! 🚀 We propose a label-free method that enables RL without ground truth answer, yet achieves impressive performance on mathematical tasks: 40.0% accuracy on AIME2024 🎯 with a 7B base model. Paper:
huggingface.co
4
1
8
Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) @shengjia_zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library.
github.com
Renderer for the harmony response format to be used with gpt-oss - openai/harmony
34
155
2K
Why does GRPO make responses shorter as training goes on? 🤔 While reproducing Deepseek-R1, we noticed a weird thing: the longer you train with GRPO, the shorter the responses get 📉.Totally opposite to what paper shows(See Figure 1)! Some blame KL divergence, others say the
0
0
4
Meaningful!💡The reason behind this phenomenon: the base model(from pre-train) already possesses strong mathematical and logical reasoning abilities. The post-train merely serves to elicit these abilities, so the ground truth answers themselves are actually not that important.
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: https://t.co/jBPlm7cyhr
0
0
1
🚀 Introducing ReCall, learning to Reason with Tool Call via RL. - Multi-turn Reinforcement Learning - No need for supervised data on tool use or reasoning steps - Empowers LLMs to agentically use and combine arbitrary tools Fully open-source! A work in progress and we are
1
52
207
📚Experiments show that structure constraints and length optimization alone can yield performance comparable to baseline methods that rely on ground-truth answers.
0
0
2
✨We find that constructing surrogate signals based on format correctness and response length, combined with the GRPO algorithm, enables effective training of large models to solve mathematical problems.
0
0
2
💡Base model is like an excellent student who has already mastered mathematical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams, to unlock the capabilities it already possesses.
0
0
2