Sohee Yang Profile
Sohee Yang

@soheeyang_

Followers
2K
Following
4K
Media
39
Statuses
185

PhD student/research scientist intern at @ucl_nlp/@GoogleDeepMind (50/50 split). Previously MS at @kaist_ai and research engineer at Naver Clova. #NLProc & ML

London, United Kingdom
Joined August 2020
Don't wanna be here? Send us removal request.
@soheeyang_
Sohee Yang
3 months
Our paper "Do Large Language Models Perform Latent Multi-Hop Reasoning without exploiting shortcuts?" will be presented at #ACL2025 today. ๐Ÿ“ Mon 18:00-19:30 Findings Posters (Hall X4 X5) Please visit our poster if you are interested! โœจ
@soheeyang_
Sohee Yang
11 months
๐Ÿšจ New Paper ๐Ÿšจ Can LLMs perform latent multi-hop reasoning without exploiting shortcuts? We find the answer is yes โ€“ they can recall and compose facts not seen together in training or guessing the answer, but success greatly depends on the type of the bridge entity (80%+ for
0
10
73
@megamor2
Mor Geva
2 months
More parameters and inference-time compute is NOT always better In @soheeyang_โ€™s #EMNLP2025 Findings paper, we show that larger reasoning models struggle more to recover from injected unhelpful thoughts ๐Ÿ’‰ this fragility extends to jailbreak attacks ๐Ÿฆนโ€โ™‚๏ธ https://t.co/gYDrNhGSeW
@soheeyang_
Sohee Yang
5 months
๐Ÿšจ New Paper ๐Ÿงต How effectively do reasoning models reevaluate their thought? We find that: - Models excel at identifying unhelpful thoughts but struggle to recover from them - Smaller models can be more robust - Self-reevaluation ability is far from true meta-cognitive awareness
0
1
24
@joshuaongg21
Joshua Ong @ EMNLP2025
2 months
We introduce PiCSAR (Probabilistic Confidence Selection And Ranking)๐Ÿ’ก: A simple training-free method for scoring samples based on probabilistic confidence, selecting a reasoning chain with the highest confidence from multiple sampled responses. โœ๏ธPiCSAR is generalisable across
2
30
93
@soheeyang_
Sohee Yang
3 months
๐Ÿ”ฎ What will Trump's tariffs be? Which AI company will build the best LLM this year? When will AGI arrive? ๐Ÿ“ˆ We just released a position paper arguing that the time is ripe for large-scale training to approach superforecaster-level event forecasting LLMs!
@sangwoolee_
Sang-Woo Lee
3 months
Paper Title: Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts Link: https://t.co/CYnf3Qp2TK w/ @soheeyang_ @eastin_kwak @noahysiegel
0
0
7
@megamor2
Mor Geva
3 months
๐Ÿ“2025-07-28 18:00 - 19:30 Hall 4/5 (and GEM workshop) @soheeyang_ will present the results of our investigation at @GoogleDeepMind on whether LLMs can perform latent multi-hop reasoning without exploiting shortcuts https://t.co/SyPL0ARWUP @KassnerNora @elenagri_ @riedelcastro
@soheeyang_
Sohee Yang
11 months
๐Ÿšจ New Paper ๐Ÿšจ Can LLMs perform latent multi-hop reasoning without exploiting shortcuts? We find the answer is yes โ€“ they can recall and compose facts not seen together in training or guessing the answer, but success greatly depends on the type of the bridge entity (80%+ for
1
2
6
@soheeyang_
Sohee Yang
5 months
We call for improving self-reevaluation for safer & more reliable reasoning models! Work done w/ @sangwoolee_, @KassnerNora, @dhgottesman, @riedelcastro, and @megamor2 mainly at @TelAvivUni with some at @GoogleDeepMind โœจ See our paper for details ๐Ÿ‘‰ https://t.co/ShwG5HSuVg ๐Ÿงต๐Ÿ”š
0
0
9
@soheeyang_
Sohee Yang
5 months
- Normal scaling for attack in the user input for R1-Distill models: Robustness doesn't transfer between attack formats - Real-world concerns: Large reasoning models (e.g., OpenAI o1) perform tool-use in their thinking process: can expose them to harmful thought injection 13/N
1
0
8
@soheeyang_
Sohee Yang
5 months
Implications for Jailbreak Robustness ๐Ÿšจ We perform "irrelevant harmful thought injection attack" w/ HarmBench: - Harmful question (irrelevant to user input) + jailbreak prompt in thinking process - Non/inverse-scaling trend: Smallest models most robust for 3 model families! 12/N
1
0
6
@soheeyang_
Sohee Yang
5 months
We also test: - Explicit instruction to self-reevaluate โžก Minimal gains (-0.05-0.02) - "Aha moment" trigger, appending "But wait, let me think again" โžก Some help (+0.15-0.34 for incorrect/misdirecting) but the absolute performance is still low, <~50% of that w/o injection 11/N
1
0
7
@soheeyang_
Sohee Yang
5 months
Failure (majority of cases): - 28/30 completely distracted, continue following irrelevant thought style - In 29/30 of the cases, "aha moments" triggered but only for local self-reevaluation - Models' self-reevaluation ability is far from general "meta-cognitive" awareness 10/N
1
0
6
@soheeyang_
Sohee Yang
5 months
Our manual analysis of 30 thought continuations for short irrelevant thoughts reveal that โžก๏ธ Success (minority of the cases): - 16/30 use "aha moments" to recognize wrong question - 9/30 grounds back to given question with CoT in the response - 5/30 correct by chance for MCQA 9/N
1
0
6
@soheeyang_
Sohee Yang
5 months
Surprising Finding: Non/Inverse-Scaling ๐Ÿ“‰ Larger models struggle MORE with short (cut at 10%) irrelevant thoughts! - 7B model shows 1.3x higher absolute performance than 70B model - Consistent across R1-Distill, s1.1, and EXAONE Deep families and all evaluation datasets 8/N
1
0
6
@soheeyang_
Sohee Yang
5 months
Stage 2 Results: Dramatic Recovery Failures โŒ Severe reasoning performance drop across all thought types: - Drops for ALL unhelpful thought injection - Most severe: irrelevant, incorrect, and full-length misdirecting thoughts - Extreme case: 92% relative performance drop 7/N
1
0
6
@soheeyang_
Sohee Yang
5 months
Stage 1 Results: Good at Identification โœ… Five (7B-70B) R1-Distill models show high classification accuracy for most unhelpful thoughts: - Uninformative & irrelevant thoughts: ~90%+ accuracy - Performance improves with model size - Only struggle with incorrect thoughts 6/N
1
0
6
@soheeyang_
Sohee Yang
5 months
We evaluate on 5 reasoning datasets across 3 domains: AIME 24 (math), ARC Challenge (science), GPQA Diamond (science), HumanEval (coding), and MATH-500 (math). 5/N
1
0
6
@soheeyang_
Sohee Yang
5 months
We test four types of unhelpful thoughts: 1. Uninformative: Rambling w/o problem-specific information 2. Irrelevant: Solving completely different questions 3. Misdirecting: Tackling slightly different questions 4. Incorrect: Thoughts with mistakes leading to wrong answers 4/N
1
0
6
@soheeyang_
Sohee Yang
5 months
We use two-stage evaluation โš–๏ธ Identification Task: - Can models identify unhelpful thoughts when explicitly asked? - Kinda prerequisite for recovery Recovery Task: - Can models recover when unhelpful thoughts are injected into their thinking process? - Self-reevaluation test 3/N
1
0
7
@soheeyang_
Sohee Yang
5 months
Reasoning models show impressive problem-solving performance via thinking with "aha moments" where they pause & reevaluate their approach - some refer to it as "meta-cognitive" behavior. But how effectively do they perform self-reevaluation, e.g., recover from unhelpful thoughts?
1
0
7
@soheeyang_
Sohee Yang
5 months
๐Ÿšจ New Paper ๐Ÿงต How effectively do reasoning models reevaluate their thought? We find that: - Models excel at identifying unhelpful thoughts but struggle to recover from them - Smaller models can be more robust - Self-reevaluation ability is far from true meta-cognitive awareness
3
27
131
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
5 months
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? "We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant
4
12
114