Kevin Yang
@kevinyang41
Followers
494
Following
73
Media
13
Statuses
69
Research scientist at @scaledcognition, previously PhD at @BerkeleyNLP, interested in better control and factuality for LLM outputs especially for long context.
Mountain View
Joined September 2019
Most enterprises we engage with face an important decision when it comes to AI in customer experience: ⛓️ Use Dialog Trees: Predictable but rigid. Customers feel like they’re in an escape room just trying to reach a human. 🤖 Use LLMs: Flexible but unreliable. Hallucinations,
0
3
6
We’re actively hiring researchers! If you’re interested in building highly reliable specialized models for agentic use cases, come join us @ScaledCognition! Our work ranges from low-level modeling advances to synthetic data generation and evaluation, and is directly impacting
2
6
10
@ScaledCognition + @Genesys = a new era of action-driven AI for CX. Together, we’re helping enterprises deploy deterministic systems that deliver reliable, policy-aligned outcomes — built for action, not just words. Learn more →
genesys.com
Partnership includes Genesys investment in Scaled Cognition to support large action model innovation for CX workflows that enable a new level of reliabi...
0
9
13
Will be at NAACL next week, excited to share two of our papers: FACTTRACK: Time-Aware World State Tracking in Story Outlines https://t.co/1KcL0aCWCI THOUGHTSCULPT: Reasoning with Intermediate Revision and Search https://t.co/ZGqvEeReHr Shoutout to first authors @ZhihengLyu and
0
4
10
Been building some very cool stuff at @ScaledCognition over the past year, excited to finally be out there!
We’re Scaled Cognition, developing the first ever models trained specifically for agentic applications: 1. Our first system, APT-1, is now #1 on agentic benchmarks. 2. It was developed by a US team for a total cost of less than $11M. 3. Khosla Ventures led our seed round ($21M
1
0
30
(6/6) 🎉 Huge thanks to our amazing team: @ZhihengLyu @kevinyang41 @ikekong and Dan from @hkuniversity and @BerkeleyNLP! 🌟 Excited to push FactTrack further in dynamic reasoning and real-world applications! 🚀 #NAACL2025 #NLP #AI #FactTracking
1
1
3
Cool! This work is very similar to ours published a couple of months ago. https://t.co/b9swWuTuh9 Really glad to see the idea shown effective in different tasks! Also wonder what the performance would be like with larger llama or gpt4
arxiv.org
We present THOUGHTSCULPT, a general reasoning and search method for tasks with outputs that can be decomposed into components. THOUGHTSCULPT explores a search tree of potential solutions using...
"Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B" From 25.47% to 45.49% in GSM-Hard 🤯 Also noting in this regard, the head of Deepmind said last year that augmenting LLMs with Monte Carlo Tree Search may be the fastest path
0
1
2
We propose a reasoning method, ThoughtSculpt, that focuses on replanning and revision for tasks with inherently compositional outputs. Great work by Yizhou!
📝Presenting ThoughtSculpt - a general reasoning & search approach for tasks with decomposable outputs. Leveraging Monte Carlo Tree Search, it surpasses existing methods across diverse tasks! (1/N) arxiv: https://t.co/b9swWuTuh9
0
0
5
I'll be presenting this RLCD work at ICLR in Vienna next week! Come stop by if you want to chat!
Excited to share our new preprint on simulating RLHF preference data more effectively: "RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment"! RLCD outperforms strong baselines on three alignment tasks across multiple LLaMA scales. 1/7
0
1
7
Our RLCD paper was accepted in #ICLR2024! Super simple yet effective approach to generate data for training reward models, avoiding expensive/laborious human labeling. Performance is great in particular on 7B models that cannot do RLAIF properly. Thanks @kevinyang41 for the
Now there is a new way to do RLHF! We propose RLCD (Reinforcement Learning from Contrast Distillation) that uses contrastive (i.e., positive/negative) prompts to generate pos/neg responses useful to train reward models for RLHF. Compared to Constitutional AI, RLCD achieves
0
5
25
Excited to share our new EMNLP Findings paper on better control of pacing--e.g., how vague or detailed is a text passage, and keeping it more consistent--in long-form outputs (stories) by LLMs! Great work by Yichen (undergrad!), who did pretty much all the work for this paper.
Honored to share our exciting paper on pacing!🎉 #EMNLP2023 Have you suffered overly verbose or vague LLM outputs? 👺 ✨Pacing is vital!✨ We try to improve pacing in long-form story planning.📚 All applause and thanks to my mentor @kevinyang41 first! [1/11]
1
0
11
Generate a high quality story plot containing thousands of tokens automatically with one click and less than 30 seconds! 😺 Introducing our end-to-end story plot generator, E2EPlot, which is fast in speed and easy to fine-tune! https://t.co/Ousik2zm6o
1
4
14
Evaluators might have different preferences in open-ended text generation, so we explore personalizing eval. Great results (outperforming GPT-4)! Was really fun to collaborate on this project-- thanks @dqwang122, Hanlin Zhu, @BIT_silence, @andrew_e_cohen, @tydsh, @lileics!
📚🌟 Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! 📝🎯 https://t.co/uRIGBlnGAI 1/5
0
1
5
Introducing COLM ( https://t.co/7T42bAAQa4) the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)
31
425
2K
Thanks to great coauthors Dan Klein, @real_asli, @VioletNPeng, @tydsh! Check out our paper at https://t.co/rLCWBvE23J for details, and code coming soon! 7/7
2
0
10
Very excited to continue exploring new/better ways to do the preference data simulation in these alignment pipelines; I think there's a lot of potential room for further improvement and RLCD is just the tip of the iceberg!
1
0
4
Check out the difference on this output example on the helpfulness prompts. Obviously not every example looks like this, but the difference is really very large at LLaMA-7B scale for preference data simulation. 5/7
1
0
7
RLCD is equal or better--in most cases substantially better--compared to baselines on human evals across multiple tasks and LLaMA model scales for preference data simulation. RLAIF is competitive at 30B, but RLCD is otherwise much better than baselines, especially so at 7B. 4/7
1
0
7
Why's this better? We'd like training examples to be (1) close to the label boundary, yet still (2) accurately labeled. But the initial unaligned LLM may give very noisy labels; RLCD trades off (1) for much better (2), i.e. pushes outputs farther apart to get cleaner labels. 3/7
1
0
9