Kevin Yang @kevinyang41 X Profile

Kevin Yang

@kevinyang41

Followers

494

Following

73

Media

13

Statuses

69

Research scientist at @scaledcognition, previously PhD at @BerkeleyNLP, interested in better control and factuality for LLM outputs especially for long context.

https://t.co/b3nnBRU5vy

Mountain View

Joined September 2019

Don't wanna be here? Send us removal request.

Anthony Platanios

@eaplatanios

7 days

Most enterprises we engage with face an important decision when it comes to AI in customer experience: ⛓️ Use Dialog Trees: Predictable but rigid. Customers feel like they’re in an escape room just trying to reach a human. 🤖 Use LLMs: Flexible but unreliable. Hallucinations,

0

3

6

Anthony Platanios

@eaplatanios

22 days

We’re actively hiring researchers! If you’re interested in building highly reliable specialized models for agentic use cases, come join us @ScaledCognition! Our work ranges from low-level modeling advances to synthetic data generation and evaluation, and is directly impacting

2

6

10

Scaled Cognition

@ScaledCognition

28 days

@ScaledCognition + @Genesys = a new era of action-driven AI for CX. Together, we’re helping enterprises deploy deterministic systems that deliver reliable, policy-aligned outcomes — built for action, not just words. Learn more →

genesys.com

Partnership includes Genesys investment in Scaled Cognition to support large action model innovation for CX workflows that enable a new level of reliabi...

0

9

13

Kevin Yang

@kevinyang41

7 months

Will be at NAACL next week, excited to share two of our papers: FACTTRACK: Time-Aware World State Tracking in Story Outlines https://t.co/1KcL0aCWCI THOUGHTSCULPT: Reasoning with Intermediate Revision and Search https://t.co/ZGqvEeReHr Shoutout to first authors @ZhihengLyu and

0

4

10

Kevin Yang

@kevinyang41

9 months

Been building some very cool stuff at @ScaledCognition over the past year, excited to finally be out there!

Scaled Cognition

@ScaledCognition

9 months

We’re Scaled Cognition, developing the first ever models trained specifically for agentic applications: 1. Our first system, APT-1, is now #1 on agentic benchmarks. 2. It was developed by a US team for a total cost of less than $11M. 3. Khosla Ventures led our seed round ($21M

1

0

30

Zhiheng LYU

@ZhihengLyu

9 months

(6/6) 🎉 Huge thanks to our amazing team: @ZhihengLyu @kevinyang41 @ikekong and Dan from @hkuniversity and @BerkeleyNLP! 🌟 Excited to push FactTrack further in dynamic reasoning and real-world applications! 🚀 #NAACL2025 #NLP #AI #FactTracking

1

3

Yizhou Chi

@YizhouChi

1 year

Cool! This work is very similar to ours published a couple of months ago. https://t.co/b9swWuTuh9 Really glad to see the idea shown effective in different tasks! Also wonder what the performance would be like with larger llama or gpt4

arxiv.org

We present THOUGHTSCULPT, a general reasoning and search method for tasks with outputs that can be decomposed into components. THOUGHTSCULPT explores a search tree of potential solutions using...

Rohan Paul

@rohanpaul_ai

1 year

"Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B" From 25.47% to 45.49% in GSM-Hard 🤯 Also noting in this regard, the head of Deepmind said last year that augmenting LLMs with Monte Carlo Tree Search may be the fastest path

0

1

2

Kevin Yang

@kevinyang41

2 years

We propose a reasoning method, ThoughtSculpt, that focuses on replanning and revision for tasks with inherently compositional outputs. Great work by Yizhou!

Yizhou Chi

@YizhouChi

2 years

📝Presenting ThoughtSculpt - a general reasoning & search approach for tasks with decomposable outputs. Leveraging Monte Carlo Tree Search, it surpasses existing methods across diverse tasks! (1/N) arxiv: https://t.co/b9swWuTuh9

0

5

Kevin Yang

@kevinyang41

2 years

I'll be presenting this RLCD work at ICLR in Vienna next week! Come stop by if you want to chat!

Kevin Yang

@kevinyang41

2 years

Excited to share our new preprint on simulating RLHF preference data more effectively: "RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment"! RLCD outperforms strong baselines on three alignment tasks across multiple LLaMA scales. 1/7

0

1

7

Yuandong Tian

@tydsh

2 years

Our RLCD paper was accepted in #ICLR2024! Super simple yet effective approach to generate data for training reward models, avoiding expensive/laborious human labeling. Performance is great in particular on 7B models that cannot do RLAIF properly. Thanks @kevinyang41 for the

Yuandong Tian

@tydsh

2 years

Now there is a new way to do RLHF! We propose RLCD (Reinforcement Learning from Contrast Distillation) that uses contrastive (i.e., positive/negative) prompts to generate pos/neg responses useful to train reward models for RLHF. Compared to Constitutional AI, RLCD achieves

0

5

25

Kevin Yang

@kevinyang41

2 years

Excited to share our new EMNLP Findings paper on better control of pacing--e.g., how vague or detailed is a text passage, and keeping it more consistent--in long-form outputs (stories) by LLMs! Great work by Yichen (undergrad!), who did pretty much all the work for this paper.

Yichen (Zach) Wang

@YichenZW

2 years

Honored to share our exciting paper on pacing!🎉 #EMNLP2023 Have you suffered overly verbose or vague LLM outputs? 👺 ✨Pacing is vital!✨ We try to improve pacing in long-form story planning.📚 All applause and thanks to my mentor @kevinyang41 first! [1/11]

1

0

11

Hanlin Zhu

@zhuhl98

2 years

Generate a high quality story plot containing thousands of tokens automatically with one click and less than 30 seconds! 😺 Introducing our end-to-end story plot generator, E2EPlot, which is fast in speed and easy to fine-tune! https://t.co/Ousik2zm6o

1

4

14

Kevin Yang

@kevinyang41

2 years

Evaluators might have different preferences in open-ended text generation, so we explore personalizing eval. Great results (outperforming GPT-4)! Was really fun to collaborate on this project-- thanks @dqwang122, Hanlin Zhu, @BIT_silence, @andrew_e_cohen, @tydsh, @lileics!

Danqing Wang

@dqwang122

2 years

📚🌟 Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! 📝🎯 https://t.co/uRIGBlnGAI 1/5

0

1

5

Sasha Rush

@srush_nlp

2 years

Introducing COLM ( https://t.co/7T42bAAQa4) the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)

31

425

2K

Kevin Yang

@kevinyang41

2 years

Code available now at https://t.co/mKFZ5RgrkN !

github.com

Reproduction of "RLCD Reinforcement Learning from Contrast Distillation for Language Model Alignment - facebookresearch/RLCD

0

Kevin Yang

@kevinyang41

2 years

Thanks to great coauthors Dan Klein, @real_asli, @VioletNPeng, @tydsh! Check out our paper at https://t.co/rLCWBvE23J for details, and code coming soon! 7/7

2

0

10

Kevin Yang

@kevinyang41

2 years

Very excited to continue exploring new/better ways to do the preference data simulation in these alignment pipelines; I think there's a lot of potential room for further improvement and RLCD is just the tip of the iceberg!

1

0

4

Kevin Yang

@kevinyang41

2 years

Check out the difference on this output example on the helpfulness prompts. Obviously not every example looks like this, but the difference is really very large at LLaMA-7B scale for preference data simulation. 5/7

1

0

7

Kevin Yang

@kevinyang41

2 years

RLCD is equal or better--in most cases substantially better--compared to baselines on human evals across multiple tasks and LLaMA model scales for preference data simulation. RLAIF is competitive at 30B, but RLCD is otherwise much better than baselines, especially so at 7B. 4/7

1

0

7

Kevin Yang

@kevinyang41

2 years

Why's this better? We'd like training examples to be (1) close to the label boundary, yet still (2) accurately labeled. But the initial unaligned LLM may give very noisy labels; RLCD trades off (1) for much better (2), i.e. pushes outputs farther apart to get cleaner labels. 3/7

1

0

9