Yunzhen Feng @feeelix_feng X Profile

Yunzhen Feng

@feeelix_feng

Followers

346

Following

220

Media

7

Statuses

89

PhD at CDS, NYU. Ex-Intern at FAIR @AIatMeta. Previously undergrad at @PKU1898

Joined May 2022

Don't wanna be here? Send us removal request.

Yunzhen Feng

@feeelix_feng

2 months

RT @KunhaoZ: 🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨. That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get….

0

139

0

Yunzhen Feng

@feeelix_feng

2 months

Check out our poster tmr at 10am at the ICLR Bidirectional Human-AI Alignment workshop! We cover how on-policy preference sampling can be biased and our optimal response sampling for human labeling. @NYUDataScience.@AIatMeta.@KempeLab.@YaqiDuanPKU.

Yunzhen Feng

@feeelix_feng

5 months

You think on-policy sampling gives the best reward models? Think again! 🔥.Our finding: Even with on-policy data, reward models misalign with policy optimization goals!.Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)

1

7

22

Yunzhen Feng

@feeelix_feng

3 months

RT @roydanroy: I need to offer some clarification for this post because it would be wrong for people to lump this situation in with ones wh….

0

1

0

Yunzhen Feng

@feeelix_feng

3 months

RT @dohmatobelvis: We refused to cite the paper due to severe misconduct of the authors of that paper: plagiarism of our own prior work,….

0

27

0

Yunzhen Feng

@feeelix_feng

4 months

RT @dohmatobelvis: Papers accepted at @iclr_conf 2025: . - An Effective Theory of Bias Amplification - Pitfalls of….

0

26

0

Yunzhen Feng

@feeelix_feng

4 months

RT @aviral_kumar2: A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean….

0

33

0

Yunzhen Feng

@feeelix_feng

4 months

RT @liuzhuang1234: How different are the outputs of various LLMs, and in what ways do they differ?. Turns out, very very different, up to t….

0

85

0

Yunzhen Feng

@feeelix_feng

5 months

RT @xszheng2020: 🎉 Excited to share that "NullModel" has been accepted to #ICLR2025 as an oral presentation!. I am on the Job Market!.I am….

0

8

0

Yunzhen Feng

@feeelix_feng

5 months

RT @vivek_myers: Current robot learning methods are good at imitating tasks seen during training, but struggle to compose behaviors in new….

0

27

0

Yunzhen Feng

@feeelix_feng

5 months

PILAF's principles apply to DPO, PPO, and beyond!.For researchers working on:.✔️ Reward modeling theory.✔️ LLM alignment dynamics.This paper offers new insights: What makes preference data effective for RLHF? 📄(11/11).

1

0

6

Yunzhen Feng

@feeelix_feng

5 months

This work reframes pref data collection: on-policy data is not enough for RLHF. It's NOT just about "gathering more data" or even "gathering on-policy data" – it's about strategically sampling data that maximally reduces reward model bias during policy evolution. 🧠(10/11).

1

0

7

Yunzhen Feng

@feeelix_feng

5 months

Extreme test: Start with overfitted π.Standard on-policy: limited reward with large KL and long responses.PILAF: Escapes via interpolation-guided exploration and converges much better. (9/11)

1

0

5

Yunzhen Feng

@feeelix_feng

5 months

PILAF vs. baselines in iterative/online DPO:.✅ Higher reward with lower KL divergence.✅ Saves annotation + computation for similar performance.✅ No hyperparameters to tune!.(8/11)

1

0

6

Yunzhen Feng

@feeelix_feng

5 months

Why does interpolation work?.1️⃣ Optimization:.➔ PILAF's gradient aligns with oracle reward r*'s policy gradient.➔ Ensures policy updates maximize r* (human values).2️⃣ Statistical:.➔ Samples focus on high-sensitivity regions of r*.➔ Converge with constantly high rewards.(7/11)

1

0

6

Yunzhen Feng

@feeelix_feng

5 months

PILAF: policy-interpolated learning for aligned feedback.An approximation of sampling is:.For each prompt,.With 50% prob: sample y₁, y₂ ~ π (current policy).With 50% prob: sample y₁ ~ (1+β)π - βπ_ref (optimistic).y₂ ~ (1-β)π + βπ_ref (conservative).(6/11).

1

0

6

Yunzhen Feng

@feeelix_feng

5 months

Solution: Inject interpolations into on-policy data!.PILAF's philosophy:.🔍 Sample responses via policy interpolation → balance optimism (explore better actions) and conservatism (anchor to reference policy).➔ Generates more informative pref pairs for learning r̂ (5/11)

1

0

8

Yunzhen Feng

@feeelix_feng

5 months

@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Why does on-policy sampling fail?.➔ Trains reward model r̂ on π's local behaviors (Overconfidence).➔ Misses information for global alignment toward r* (Underestimation).fail to guide policy optimization toward r*: suboptimal policies! (4/11).

1

4

Yunzhen Feng

@feeelix_feng

5 months

@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Standard RLHF pipeline (repetitively):.1️⃣ Collect pref data by sampling (y₁,y₂) from current policy π.2️⃣ Train reward model r̂ via MLE (assuming Bradley-Terry model with human value r*).3️⃣ Optimize π with r̂.(3/11)

1

6

Yunzhen Feng

@feeelix_feng

5 months

Paper: 🧵.@kempelab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU.@AIatMeta @NYUDataScience.

1

11

Yunzhen Feng

@feeelix_feng

5 months

You think on-policy sampling gives the best reward models? Think again! 🔥.Our finding: Even with on-policy data, reward models misalign with policy optimization goals!.Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)

7

39

220