
Yunzhen Feng
@feeelix_feng
Followers
346
Following
220
Media
7
Statuses
89
PhD at CDS, NYU. Ex-Intern at FAIR @AIatMeta. Previously undergrad at @PKU1898
Joined May 2022
RT @KunhaoZ: 🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨. That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get….
0
139
0
Check out our poster tmr at 10am at the ICLR Bidirectional Human-AI Alignment workshop! We cover how on-policy preference sampling can be biased and our optimal response sampling for human labeling. @NYUDataScience.@AIatMeta.@KempeLab.@YaqiDuanPKU.
You think on-policy sampling gives the best reward models? Think again! 🔥.Our finding: Even with on-policy data, reward models misalign with policy optimization goals!.Introducing PILAF—strategic sampling that fixes this fundamentally. (1/11)
1
7
22
RT @roydanroy: I need to offer some clarification for this post because it would be wrong for people to lump this situation in with ones wh….
0
1
0
RT @dohmatobelvis: We refused to cite the paper due to severe misconduct of the authors of that paper: plagiarism of our own prior work,….
0
27
0
RT @dohmatobelvis: Papers accepted at @iclr_conf 2025: . - An Effective Theory of Bias Amplification - Pitfalls of….
0
26
0
RT @aviral_kumar2: A lot of work focuses on test-time scaling. But we aren't scaling it optimally, simply training a long CoT doesn't mean….
0
33
0
RT @liuzhuang1234: How different are the outputs of various LLMs, and in what ways do they differ?. Turns out, very very different, up to t….
0
85
0
RT @xszheng2020: 🎉 Excited to share that "NullModel" has been accepted to #ICLR2025 as an oral presentation!. I am on the Job Market!.I am….
0
8
0
RT @vivek_myers: Current robot learning methods are good at imitating tasks seen during training, but struggle to compose behaviors in new….
0
27
0
@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Why does on-policy sampling fail?.➔ Trains reward model r̂ on π's local behaviors (Overconfidence).➔ Misses information for global alignment toward r* (Underestimation).fail to guide policy optimization toward r*: suboptimal policies! (4/11).
1
1
4
@KempeLab @ArielKwiatkowsk @KunhaoZ @YaqiDuanPKU @AIatMeta @NYUDataScience Standard RLHF pipeline (repetitively):.1️⃣ Collect pref data by sampling (y₁,y₂) from current policy π.2️⃣ Train reward model r̂ via MLE (assuming Bradley-Terry model with human value r*).3️⃣ Optimize π with r̂.(3/11)
1
1
6