This is my summary of a podcast interviewing Xiangyu Zhang (. I found this very insightful. As I have used whisper to transcribe and gemini to translate this, it could contain errors. Though, considering the overall flow of the content, I think it would be.
rStar2-Agent: Agentic Reasoning Technical Report. Agentic RL. To enhance the quality of interactions, they oversample the rollouts and subsample positive samples to retain high quality ones (instead of penalizing with rewards).
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training. Taming attention logits in large batch training by rescaling updates using the ratio between weights and updates. How would this compare with MuonClip?
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. Increasing the number of activated experts or the total number of experts can lead to a decrease in downstream task (GSM) performance even when train/valid loss itself decreases. It is also more sensitive
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning. Improved product key memory + tensor decomposition ( with FFN ( and initialization. Competitive with MoE.
Predicting the Order of Upcoming Tokens Improves Language Modeling. Multiple token prediction could be too hard, so instead let the model predict the distance between tokens in the window.
Gemini 2.5 Flash Image Preview or nano banana. It is quite fast (about 8 - 9 seconds to generate a 1024px image, 1290 tokens). Editing is almost pixel perfect. (I found sometimes shifting happens.) It also lightens the cast shadow from the front gear, due to the change of