YIFENG LIU @YIFENGLIU_AI X Profile

YIFENG LIU

@YIFENGLIU_AI

Followers

215

Following

126

Media

14

Statuses

56

CS Ph.D. student on LLM @ UCLA AGI Lab. Previous works: RPG, MARS, TPA, Kimi-1.5....

https://t.co/p0AfEGpLJX

Los Angeles

Joined April 2024

Don't wanna be here? Send us removal request.

YIFENG LIU

@YIFENGLIU_AI

10 days

The most creative open-source team I've ever known.

Kimi.ai

@Kimi_Moonshot

10 days

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built

0

2

YIFENG LIU

@YIFENGLIU_AI

12 days

Interestingly, a lot of tech reports for industrial LLMs trained with μP do not mention "μP" but "special" scaling laws, in the shade of MetaP's failure. This is quite like the AI winter 20 years ago when people referred to AI as cognitive systems or computational intelligence.

0

2

Yifan Zhang

@yifan_zhang_

19 days

🎉 Our paper "Tensor Product Attention Is All You Need" has been accepted as NeurIPS 2025 Spotlight (Top 3%)! The Camera Ready version of TPA has been publicly available on the arXiv now: https://t.co/5AJoEjl6oH ⚡️TPA is stronger and faster than GQA and MLA, and is compatible

Yifan Zhang

@yifan_zhang_

2 months

🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!

12

104

726

Songlin Yang

@SonglinYang4

19 days

it’s an improved version of Gated DeltaNet. enjoy ^^

elie

@eliebakouch

19 days

Kimi Delta Attention PR in FLA, very nice @yzhang_cs and team, i'm sooo excited for this model

5

15

203

YIFENG LIU

@YIFENGLIU_AI

19 days

@HuizhuoY @QuanquanGu 8/n We are glad to announce that our MARS-M paper is released: https://t.co/2a2Vb0G96u. It extends variance reduction framework of MARS to matrix-level optimizers, demonstrating consistent performance gains over Muon in LLM pretraining tasks. GitHub Repo:

arxiv.org

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language...

0

3

Yifan Zhang

@yifan_zhang_

1 month

We are so back, Long live REINFORCE! See also our RPG paper: https://t.co/BNclBdUg4Z

Nathan Lambert

@natolambert

1 month

The first fantastic paper on scaling RL with LLMs just dropped. I strongly recommend taking a look and will be sharing more thoughts on the blog soon. The Art of Scaling Reinforcement Learning Compute for LLMs Khatri & Madaan et al.

2

12

112

Yifan Zhang

@yifan_zhang_

1 month

🥂A friendly mapping for folks reading the MiniMax‑M1 report: the awesome work CISPO can be written as a clean instantiation of RPG‑REINFORCE ( https://t.co/H8FZPFLfSL), an off-policy REINFORCE Algorithm with clipped importance sampling weight.

arxiv.org

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of...

1

8

69

YIFENG LIU

@YIFENGLIU_AI

1 month

@HuizhuoY @QuanquanGu 7/n MARS-M and approximated version of MARS-M:

2

1

10

YIFENG LIU

@YIFENGLIU_AI

1 month

6/n Thank @HuizhuoY and @QuanquanGu for the great collaboration.

1

0

8

YIFENG LIU

@YIFENGLIU_AI

1 month

5/n Tagging some people who might be interested: @JingyuanLiu123 , @kellerjordan0 , @MatPagliardini , @_arohan_ , @ShamKakade6 , @bremen79 , @aaron_defazio , @peter_richtarik , @HazanPrinceton , @MarkSchmidtUBC , @zhiyuanli_ , @cloneofsimo , @jxbz , @HessianFree

1

8

YIFENG LIU

@YIFENGLIU_AI

1 month

4/n Our work combines the best of both worlds: variance reduction (MARS) and matrix-based optimization (Muon). Both have been shown to be among the most effective methods for improving AdamW in recent optimizer benchmarks by @wen_kaiyue, @tengyuma, @percyliang, @AndreiSemenov17.

1

7

YIFENG LIU

@YIFENGLIU_AI

1 month

3/n MARS-M empirically validates that variance reduction techniques, like MARS, can be adapted to matrix-based optimizers to achieve consistent and stable performance improvements.

1

0

10

YIFENG LIU

@YIFENGLIU_AI

1 month

2/n MARS-M enhances Muon by incorporating scaled gradient correction into momentum, which significantly reduces variance during model training. In GPT-2 experiments, MARS-M beats Muon at every step with the same hyper-params, and outperforms AdamW on training & validation loss.

1

2

15

YIFENG LIU

@YIFENGLIU_AI

1 month

1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ

4

12

67

Yifan Zhang

@yifan_zhang_

2 months

🎉Our paper, "Tensor Product Attention Is All You Need," has been accepted for a spotlight presentation at the 2025 Conference on Neural Information Processing Systems (NeurIPS 2025)! See https://t.co/5AJoEjl6oH and https://t.co/0y5bySxU4S for details!

github.com

[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425) - tensorgi/TPA

7

12

64

YIFENG LIU

@YIFENGLIU_AI

2 months

Interesting conclusions for MARS in these papers. Glad to see that MARS performs well in the researches.

Andrei Semenov

@AndreiSemenov17

2 months

Amazing "competing" work from @wen_kaiyue @tengyuma @percyliang There are some good stories about optimizers to tell this week 😃 https://t.co/z0K0kG90mW https://t.co/KziMZlzwGj

0

7

Quanquan Gu

@QuanquanGu

2 months

Another fantastic benchmark of optimizers. Key takeaways: 1. Variance-reduced Adam variants (e.g., MARS) achieve significant speedups over the AdamW baseline. 2. Matrix-based optimizers (e.g., Muon, SOAP) consistently outperform their scalar-based counterparts (e.g., Lion).

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

2 months

Fantastic Pretraining Optimizers and Where to Find Them "we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1–8× the Chinchilla optimum)." "we find that all the fastest optimizers such as Muon

5

22

188

YIFENG LIU

@YIFENGLIU_AI

4 months

Why does CANADA try to prevent AI researchers from attending conferences in Cadana? I doubt whether Canada wants to develop their AI industry. Why does CANADA try to prevent ones named for maple to enter Canada? I doubt whether Canadians love maples.

0

3

YIFENG LIU

@YIFENGLIU_AI

5 months

Which optimizer (from 100+ optimizers for DL models) is best for training Large Language Models? 🤔 https://t.co/GDVmzzmSts

0

2

10

YIFENG LIU

@YIFENGLIU_AI

6 months

6/6 🚀Experimental Results: RPG beats GRPO/DAPO/REINFORCE++ We implement RL training experiments on math datasets with Qwen-2.5-7B-Instruct and Qwen-2.5-Math-7B, achieving more stable and better performances than baselines including GRPO, DAPO and REINFORCE++.

1

13