Yunzhen Feng Profile
Yunzhen Feng

@feeelix_feng

Followers
346
Following
220
Media
7
Statuses
89

PhD at CDS, NYU. Ex-Intern at FAIR @AIatMeta. Previously undergrad at @PKU1898

Joined May 2022
Don't wanna be here? Send us removal request.
@feeelix_feng
Yunzhen Feng
2 months
RT @KunhaoZ: 🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨. That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get….
0
139
0
@feeelix_feng
Yunzhen Feng
2 months
Check out our poster tmr at 10am at the ICLR Bidirectional Human-AI Alignment workshop! We cover how on-policy preference sampling can be biased and our optimal response sampling for human labeling. @NYUDataScience.@AIatMeta.@KempeLab.@YaqiDuanPKU.
Tweet media one
@feeelix_feng
Yunzhen Feng
5 months
You think on-policy sampling gives the best reward models? Think again! 🔥.Our finding: Even with on-policy data, reward models misalign with policy optimization goals!.Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
Tweet media one
1
7
22
@feeelix_feng
Yunzhen Feng
3 months
RT @roydanroy: I need to offer some clarification for this post because it would be wrong for people to lump this situation in with ones wh….
0
1
0
@feeelix_feng
Yunzhen Feng
3 months
RT @dohmatobelvis: We refused to cite the paper due to severe misconduct of the authors of that paper: plagiarism of our own prior work,….
0
27
0
@feeelix_feng
Yunzhen Feng
4 months
RT @dohmatobelvis: Papers accepted at @iclr_conf 2025: . - An Effective Theory of Bias Amplification - Pitfalls of….
0
26
0
@feeelix_feng
Yunzhen Feng
4 months
RT @aviral_kumar2: A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean….
0
33
0
@feeelix_feng
Yunzhen Feng
4 months
RT @liuzhuang1234: How different are the outputs of various LLMs, and in what ways do they differ?. Turns out, very very different, up to t….
0
85
0
@feeelix_feng
Yunzhen Feng
5 months
RT @xszheng2020: 🎉 Excited to share that "NullModel" has been accepted to #ICLR2025 as an oral presentation!. I am on the Job Market!.I am….
0
8
0
@feeelix_feng
Yunzhen Feng
5 months
RT @vivek_myers: Current robot learning methods are good at imitating tasks seen during training, but struggle to compose behaviors in new….
0
27
0
@feeelix_feng
Yunzhen Feng
5 months
PILAF's principles apply to DPO, PPO, and beyond!.For researchers working on:.✔️ Reward modeling theory.✔️ LLM alignment dynamics.This paper offers new insights: What makes preference data effective for RLHF? 📄(11/11).
1
0
6
@feeelix_feng
Yunzhen Feng
5 months
This work reframes pref data collection: on-policy data is not enough for RLHF. It's NOT just about "gathering more data" or even "gathering on-policy data" – it's about strategically sampling data that maximally reduces reward model bias during policy evolution. 🧠(10/11).
1
0
7
@feeelix_feng
Yunzhen Feng
5 months
Extreme test: Start with overfitted π.Standard on-policy: limited reward with large KL and long responses.PILAF: Escapes via interpolation-guided exploration and converges much better. (9/11)
Tweet media one
1
0
5
@feeelix_feng
Yunzhen Feng
5 months
PILAF vs. baselines in iterative/online DPO:.✅ Higher reward with lower KL divergence.✅ Saves annotation + computation for similar performance.✅ No hyperparameters to tune!.(8/11)
Tweet media one
Tweet media two
1
0
6
@feeelix_feng
Yunzhen Feng
5 months
Why does interpolation work?.1️⃣ Optimization:.➔ PILAF's gradient aligns with oracle reward r*'s policy gradient.➔ Ensures policy updates maximize r* (human values).2️⃣ Statistical:.➔ Samples focus on high-sensitivity regions of r*.➔ Converge with constantly high rewards.(7/11)
Tweet media one
1
0
6
@feeelix_feng
Yunzhen Feng
5 months
PILAF: policy-interpolated learning for aligned feedback.An approximation of sampling is:.For each prompt,.With 50% prob: sample y₁, y₂ ~ π (current policy).With 50% prob: sample y₁ ~ (1+β)π - βπ_ref (optimistic).y₂ ~ (1-β)π + βπ_ref (conservative).(6/11).
1
0
6
@feeelix_feng
Yunzhen Feng
5 months
Solution: Inject interpolations into on-policy data!.PILAF's philosophy:.🔍 Sample responses via policy interpolation → balance optimism (explore better actions) and conservatism (anchor to reference policy).➔ Generates more informative pref pairs for learning r̂ (5/11)
Tweet media one
1
0
8
@feeelix_feng
Yunzhen Feng
5 months
@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Why does on-policy sampling fail?.➔ Trains reward model r̂ on π's local behaviors (Overconfidence).➔ Misses information for global alignment toward r* (Underestimation).fail to guide policy optimization toward r*: suboptimal policies! (4/11).
1
1
4
@feeelix_feng
Yunzhen Feng
5 months
@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Standard RLHF pipeline (repetitively):.1️⃣ Collect pref data by sampling (y₁,y₂) from current policy π.2️⃣ Train reward model r̂ via MLE (assuming Bradley-Terry model with human value r*).3️⃣ Optimize π with r̂.(3/11)
Tweet media one
1
1
6
@feeelix_feng
Yunzhen Feng
5 months
You think on-policy sampling gives the best reward models? Think again! 🔥.Our finding: Even with on-policy data, reward models misalign with policy optimization goals!.Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
Tweet media one
7
39
220