YIFENG LIU Profile
YIFENG LIU

@YIFENGLIU_AI

Followers
215
Following
126
Media
14
Statuses
56

CS Ph.D. student on LLM @ UCLA AGI Lab. Previous works: RPG, MARS, TPA, Kimi-1.5....

Los Angeles
Joined April 2024
Don't wanna be here? Send us removal request.
@YIFENGLIU_AI
YIFENG LIU
10 days
The most creative open-source team I've ever known.
@Kimi_Moonshot
Kimi.ai
10 days
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built
0
0
2
@YIFENGLIU_AI
YIFENG LIU
12 days
Interestingly, a lot of tech reports for industrial LLMs trained with μP do not mention "μP" but "special" scaling laws, in the shade of MetaP's failure. This is quite like the AI ​​winter 20 years ago when people referred to AI as cognitive systems or computational intelligence.
0
0
2
@yifan_zhang_
Yifan Zhang
19 days
🎉 Our paper "Tensor Product Attention Is All You Need" has been accepted as NeurIPS 2025 Spotlight (Top 3%)! The Camera Ready version of TPA has been publicly available on the arXiv now: https://t.co/5AJoEjl6oH ⚡️TPA is stronger and faster than GQA and MLA, and is compatible
@yifan_zhang_
Yifan Zhang
2 months
🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!
12
104
726
@SonglinYang4
Songlin Yang
19 days
it’s an improved version of Gated DeltaNet. enjoy ^^
@eliebakouch
elie
19 days
Kimi Delta Attention PR in FLA, very nice @yzhang_cs and team, i'm sooo excited for this model
5
15
203
@YIFENGLIU_AI
YIFENG LIU
19 days
@HuizhuoY @QuanquanGu 8/n We are glad to announce that our MARS-M paper is released: https://t.co/2a2Vb0G96u. It extends variance reduction framework of MARS to matrix-level optimizers, demonstrating consistent performance gains over Muon in LLM pretraining tasks. GitHub Repo:
Tweet card summary image
arxiv.org
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language...
0
0
3
@yifan_zhang_
Yifan Zhang
1 month
We are so back, Long live REINFORCE! See also our RPG paper: https://t.co/BNclBdUg4Z
@natolambert
Nathan Lambert
1 month
The first fantastic paper on scaling RL with LLMs just dropped. I strongly recommend taking a look and will be sharing more thoughts on the blog soon. The Art of Scaling Reinforcement Learning Compute for LLMs Khatri & Madaan et al.
2
12
112
@yifan_zhang_
Yifan Zhang
1 month
🥂A friendly mapping for folks reading the MiniMax‑M1 report: the awesome work CISPO can be written as a clean instantiation of RPG‑REINFORCE ( https://t.co/H8FZPFLfSL), an off-policy REINFORCE Algorithm with clipped importance sampling weight.
Tweet card summary image
arxiv.org
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of...
1
8
69
@YIFENGLIU_AI
YIFENG LIU
1 month
@HuizhuoY @QuanquanGu 7/n MARS-M and approximated version of MARS-M:
2
1
10
@YIFENGLIU_AI
YIFENG LIU
1 month
6/n Thank @HuizhuoY and @QuanquanGu for the great collaboration.
1
0
8
@YIFENGLIU_AI
YIFENG LIU
1 month
4/n Our work combines the best of both worlds: variance reduction (MARS) and matrix-based optimization (Muon). Both have been shown to be among the most effective methods for improving AdamW in recent optimizer benchmarks by @wen_kaiyue, @tengyuma, @percyliang, @AndreiSemenov17.
1
1
7
@YIFENGLIU_AI
YIFENG LIU
1 month
3/n MARS-M empirically validates that variance reduction techniques, like MARS, can be adapted to matrix-based optimizers to achieve consistent and stable performance improvements.
1
0
10
@YIFENGLIU_AI
YIFENG LIU
1 month
2/n MARS-M enhances Muon by incorporating scaled gradient correction into momentum, which significantly reduces variance during model training. In GPT-2 experiments, MARS-M beats Muon at every step with the same hyper-params, and outperforms AdamW on training & validation loss.
1
2
15
@YIFENGLIU_AI
YIFENG LIU
1 month
1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ
4
12
67
@yifan_zhang_
Yifan Zhang
2 months
🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!
Tweet card summary image
github.com
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425) - tensorgi/TPA
7
12
64
@YIFENGLIU_AI
YIFENG LIU
2 months
Interesting conclusions for MARS in these papers. Glad to see that MARS performs well in the researches.
@AndreiSemenov17
Andrei Semenov
2 months
Amazing "competing" work from @wen_kaiyue @tengyuma @percyliang There are some good stories about optimizers to tell this week 😃 https://t.co/z0K0kG90mW https://t.co/KziMZlzwGj
0
0
7
@QuanquanGu
Quanquan Gu
2 months
Another fantastic benchmark of optimizers. Key takeaways: 1. Variance-reduced Adam variants (e.g., MARS) achieve significant speedups over the AdamW baseline. 2. Matrix-based optimizers (e.g., Muon, SOAP) consistently outperform their scalar-based counterparts (e.g., Lion).
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
2 months
Fantastic Pretraining Optimizers and Where to Find Them "we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1–8× the Chinchilla optimum)." "we find that all the fastest optimizers such as Muon
5
22
188
@YIFENGLIU_AI
YIFENG LIU
4 months
Why does CANADA try to prevent AI researchers from attending conferences in Cadana? I doubt whether Canada wants to develop their AI industry. Why does CANADA try to prevent ones named for maple to enter Canada? I doubt whether Canadians love maples.
0
0
3
@YIFENGLIU_AI
YIFENG LIU
5 months
Which optimizer (from 100+ optimizers for DL models) is best for training Large Language Models? 🤔 https://t.co/GDVmzzmSts
0
2
10
@YIFENGLIU_AI
YIFENG LIU
6 months
6/6 🚀Experimental Results: RPG beats GRPO/DAPO/REINFORCE++ We implement RL training experiments on math datasets with Qwen-2.5-7B-Instruct and Qwen-2.5-Math-7B, achieving more stable and better performances than baselines including GRPO, DAPO and REINFORCE++.
1
1
13