insightLLM Profile
insightLLM

@insightLLM

Followers
11
Following
39
Media
5
Statuses
14

LLM/NLP researcher in Baichuan-Inc @BaichuanAI. Prev @AlibabaGroup, Research Intern at @MSFTResearch Asia , GitHub: https://t.co/kUbOq3uc75

Joined May 2025
Don't wanna be here? Send us removal request.
@insightLLM
insightLLM
6 months
🎉 Introducing our latest work ! 🚀 We propose a label-free method that enables RL without ground truth answer, yet achieves impressive performance on mathematical tasks: 40.0% accuracy on AIME2024 🎯 with a 7B base model. Paper:
Tweet card summary image
huggingface.co
4
1
8
@Trinkle23897
Jiayi Weng
3 months
Harmony format is finally open-sourced. I still remember 3 years ago (before ChatGPT release) @shengjia_zhao, Daniel and I were brainstorming about the right abstraction for RL training, and that is the start point of the entire harmony library.
Tweet card summary image
github.com
Renderer for the harmony response format to be used with gpt-oss - openai/harmony
34
155
2K
@insightLLM
insightLLM
5 months
Why does GRPO make responses shorter as training goes on? 🤔 While reproducing Deepseek-R1, we noticed a weird thing: the longer you train with GRPO, the shorter the responses get 📉.Totally opposite to what paper shows(See Figure 1)! Some blame KL divergence, others say the
0
0
4
@insightLLM
insightLLM
6 months
🎉Our latest work! https://t.co/ooMFSxYoVf
0
0
2
@insightLLM
insightLLM
6 months
Meaningful!💡The reason behind this phenomenon: the base model(from pre-train) already possesses strong mathematical and logical reasoning abilities. The post-train merely serves to elicit these abilities, so the ground truth answers themselves are actually not that important.
@StellaLisy
Stella Li
6 months
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: https://t.co/jBPlm7cyhr
0
0
1
@chen_mingyang
Mingyang Chen
7 months
🚀 Introducing ReCall, learning to Reason with Tool Call via RL. - Multi-turn Reinforcement Learning - No need for supervised data on tool use or reasoning steps - Empowers LLMs to agentically use and combine arbitrary tools Fully open-source! A work in progress and we are
1
52
207
@insightLLM
insightLLM
6 months
📚Experiments show that structure constraints and length optimization alone can yield performance comparable to baseline methods that rely on ground-truth answers.
0
0
2
@insightLLM
insightLLM
6 months
✨We find that constructing surrogate signals based on format correctness and response length, combined with the GRPO algorithm, enables effective training of large models to solve mathematical problems.
0
0
2
@insightLLM
insightLLM
6 months
💡Base model is like an excellent student who has already mastered mathematical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams, to unlock the capabilities it already possesses.
0
0
2