Sangkyu Lee @oddqueue X Profile

Sangkyu Lee

@oddqueue

Followers

27

Following

5

Media

6

Statuses

7

M.S. student, Yonsei University

Seoul, Republic of Korea

Joined February 2024

Don't wanna be here? Send us removal request.

Sangkyu Lee

@oddqueue

1 year

Want to know how to do genuinely SELF-improvement? ☝️ We present ⚖️ SELF-JUDGE ⚖️, which teaches pairwise comparison as an instruction-tuning for on-policy alignment learning. All you need is a single policy model! ( ❌ Reward Model ❌ Teacher Model) [1/n]

2

12

33

Sangkyu Lee

@oddqueue

1 year

Please check out our work for more details and experiments! 🙌. Paper: Code: (available soon!). This work is done with my best co-authors!. @SungdongKim4, @ashkan_yousefpr, @seo_minjoon, @kaniblu, @YoungjaeYu3.

github.com

[ACL 2024] The official implementation of "Aligning Large Language Models by On-Policy Self-Judgment" - oddqueue/self-judge

0

1

Grok

@grok

18 hours

Join millions who have switched to Grok.

30

87

351

Sangkyu Lee

@oddqueue

1 year

Further analysis reveals that simply training with principles and rationales for judgment leads to a more robust evaluation without the chain-of-thought reasoning during inference. 🧠 ❌ [6/n]

1

0

1

Sangkyu Lee

@oddqueue

1 year

Experimental results show that this simple framework outperforms existing methods even if it requires only the parameter for the current policy and reference policy on on-policy learning in training time, the same amount as offline DPO. 💪 [5/n]

1

0

Sangkyu Lee

@oddqueue

1 year

This initial policy can self-improve through the judgment on the responses generated by itself for both training and inference:. 1️⃣ Perform judgments for current policy in on-policy, not only acting as a reference policy. 2️⃣ Conduct rejection sampling through a tournament. [4/n]

1

0

Sangkyu Lee

@oddqueue

1 year

To overcome the trade-off of separated evaluators for on-policy learning, we train an initial policy that can also perform pairwise judgments between response pairs with single token prediction simply by changing the prompt. ⚖️ [3/n]

1

0

Sangkyu Lee

@oddqueue

1 year

Previous studies have chosen one of two strategies:. 1️⃣ Use separate evaluators for on-policy learning. 2️⃣ Giving up separated evaluators to avoid the overhead. In other words, on-policy learning was not available on self-improvement, but not for us! 😎 [2/n]

1

0

1