oddqueue Profile Banner
Sangkyu Lee Profile
Sangkyu Lee

@oddqueue

Followers
27
Following
5
Media
6
Statuses
7

M.S. student, Yonsei University

Seoul, Republic of Korea
Joined February 2024
Don't wanna be here? Send us removal request.
@oddqueue
Sangkyu Lee
1 year
Want to know how to do genuinely SELF-improvement? ☝️ We present ⚖️ SELF-JUDGE ⚖️, which teaches pairwise comparison as an instruction-tuning for on-policy alignment learning. All you need is a single policy model! ( ❌ Reward Model ❌ Teacher Model) [1/n]
Tweet media one
2
12
33
@oddqueue
Sangkyu Lee
1 year
Please check out our work for more details and experiments! 🙌. Paper: Code: (available soon!). This work is done with my best co-authors!. @SungdongKim4, @ashkan_yousefpr, @seo_minjoon, @kaniblu, @YoungjaeYu3.
Tweet card summary image
github.com
[ACL 2024] The official implementation of "Aligning Large Language Models by On-Policy Self-Judgment" - oddqueue/self-judge
0
0
1
@grok
Grok
18 hours
Join millions who have switched to Grok.
30
87
351
@oddqueue
Sangkyu Lee
1 year
Further analysis reveals that simply training with principles and rationales for judgment leads to a more robust evaluation without the chain-of-thought reasoning during inference. 🧠 ❌ [6/n]
Tweet media one
1
0
1
@oddqueue
Sangkyu Lee
1 year
Experimental results show that this simple framework outperforms existing methods even if it requires only the parameter for the current policy and reference policy on on-policy learning in training time, the same amount as offline DPO. 💪 [5/n]
Tweet media one
1
0
0
@oddqueue
Sangkyu Lee
1 year
This initial policy can self-improve through the judgment on the responses generated by itself for both training and inference:. 1️⃣ Perform judgments for current policy in on-policy, not only acting as a reference policy. 2️⃣ Conduct rejection sampling through a tournament. [4/n]
Tweet media one
1
0
0
@oddqueue
Sangkyu Lee
1 year
To overcome the trade-off of separated evaluators for on-policy learning, we train an initial policy that can also perform pairwise judgments between response pairs with single token prediction simply by changing the prompt. ⚖️ [3/n]
Tweet media one
1
0
0
@oddqueue
Sangkyu Lee
1 year
Previous studies have chosen one of two strategies:. 1️⃣ Use separate evaluators for on-policy learning. 2️⃣ Giving up separated evaluators to avoid the overhead. In other words, on-policy learning was not available on self-improvement, but not for us! 😎 [2/n]
Tweet media one
1
0
1