YIFENG LIU
@YIFENGLIU_AI
Followers
215
Following
126
Media
14
Statuses
56
CS Ph.D. student on LLM @ UCLA AGI Lab. Previous works: RPG, MARS, TPA, Kimi-1.5....
Los Angeles
Joined April 2024
The most creative open-source team I've ever known.
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built
0
0
2
Interestingly, a lot of tech reports for industrial LLMs trained with μP do not mention "μP" but "special" scaling laws, in the shade of MetaP's failure. This is quite like the AI winter 20 years ago when people referred to AI as cognitive systems or computational intelligence.
0
0
2
🎉 Our paper "Tensor Product Attention Is All You Need" has been accepted as NeurIPS 2025 Spotlight (Top 3%)! The Camera Ready version of TPA has been publicly available on the arXiv now: https://t.co/5AJoEjl6oH ⚡️TPA is stronger and faster than GQA and MLA, and is compatible
🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!
12
104
726
it’s an improved version of Gated DeltaNet. enjoy ^^
5
15
203
@HuizhuoY @QuanquanGu 8/n We are glad to announce that our MARS-M paper is released: https://t.co/2a2Vb0G96u. It extends variance reduction framework of MARS to matrix-level optimizers, demonstrating consistent performance gains over Muon in LLM pretraining tasks. GitHub Repo:
arxiv.org
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language...
0
0
3
We are so back, Long live REINFORCE! See also our RPG paper: https://t.co/BNclBdUg4Z
The first fantastic paper on scaling RL with LLMs just dropped. I strongly recommend taking a look and will be sharing more thoughts on the blog soon. The Art of Scaling Reinforcement Learning Compute for LLMs Khatri & Madaan et al.
2
12
112
🥂A friendly mapping for folks reading the MiniMax‑M1 report: the awesome work CISPO can be written as a clean instantiation of RPG‑REINFORCE ( https://t.co/H8FZPFLfSL), an off-policy REINFORCE Algorithm with clipped importance sampling weight.
arxiv.org
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of...
1
8
69
6/n Thank @HuizhuoY and @QuanquanGu for the great collaboration.
1
0
8
5/n Tagging some people who might be interested: @JingyuanLiu123 , @kellerjordan0 , @MatPagliardini , @_arohan_ , @ShamKakade6 , @bremen79 , @aaron_defazio , @peter_richtarik , @HazanPrinceton , @MarkSchmidtUBC , @zhiyuanli_ , @cloneofsimo , @jxbz , @HessianFree
1
1
8
4/n Our work combines the best of both worlds: variance reduction (MARS) and matrix-based optimization (Muon). Both have been shown to be among the most effective methods for improving AdamW in recent optimizer benchmarks by @wen_kaiyue, @tengyuma, @percyliang, @AndreiSemenov17.
1
1
7
3/n MARS-M empirically validates that variance reduction techniques, like MARS, can be adapted to matrix-based optimizers to achieve consistent and stable performance improvements.
1
0
10
2/n MARS-M enhances Muon by incorporating scaled gradient correction into momentum, which significantly reduces variance during model training. In GPT-2 experiments, MARS-M beats Muon at every step with the same hyper-params, and outperforms AdamW on training & validation loss.
1
2
15
1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ
4
12
67
🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!
github.com
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425) - tensorgi/TPA
7
12
64
Interesting conclusions for MARS in these papers. Glad to see that MARS performs well in the researches.
Amazing "competing" work from @wen_kaiyue @tengyuma @percyliang There are some good stories about optimizers to tell this week 😃 https://t.co/z0K0kG90mW
https://t.co/KziMZlzwGj
0
0
7
Another fantastic benchmark of optimizers. Key takeaways: 1. Variance-reduced Adam variants (e.g., MARS) achieve significant speedups over the AdamW baseline. 2. Matrix-based optimizers (e.g., Muon, SOAP) consistently outperform their scalar-based counterparts (e.g., Lion).
Fantastic Pretraining Optimizers and Where to Find Them "we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1–8× the Chinchilla optimum)." "we find that all the fastest optimizers such as Muon
5
22
188
Why does CANADA try to prevent AI researchers from attending conferences in Cadana? I doubt whether Canada wants to develop their AI industry. Why does CANADA try to prevent ones named for maple to enter Canada? I doubt whether Canadians love maples.
0
0
3
Which optimizer (from 100+ optimizers for DL models) is best for training Large Language Models? 🤔 https://t.co/GDVmzzmSts
0
2
10
6/6 🚀Experimental Results: RPG beats GRPO/DAPO/REINFORCE++ We implement RL training experiments on math datasets with Qwen-2.5-7B-Instruct and Qwen-2.5-Math-7B, achieving more stable and better performances than baselines including GRPO, DAPO and REINFORCE++.
1
1
13