Yu Feng Profile
Yu Feng

@AnnieFeng6

Followers
373
Following
19
Media
17
Statuses
28

CS PhD Student @Penn | NLP & ML @cogcomp @upennnlp @duke_nlp @ RUC | 🧗🏻‍♀️🎨🩰🎹

Joined August 2018
Don't wanna be here? Send us removal request.
@AnnieFeng6
Yu Feng
7 months
#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵‍💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD
2
40
259
@AnnieFeng6
Yu Feng
13 days
Building on this verification capability, we leverage VeriCoT to actively enhance an LLM’s reasoning. - We use VeriCoT for inference-time self-reflection - VeriCoT helps create a high-fidelity dataset of verified CoTs for SFT and as a source of pairwise reward signals for DPO
0
0
1
@AnnieFeng6
Yu Feng
13 days
VeriCoT can detect ungrounded or incorrect reasoning, and VeriCoT-validation is a strong predictor of final answer correctness, with validated CoT attaining higher precision than task-level accuracy. (4/n)
1
0
1
@AnnieFeng6
Yu Feng
13 days
VeriCoT provides multi-faceted feedback. It identifies: - Whether a CoT can be represented in formal logic - How the CoT’s steps are logically supported - What underlying NL premises need to be accepted in order to accept the CoT’s reasoning (3/n)
1
0
1
@AnnieFeng6
Yu Feng
13 days
Here's how VeriCoT works: 1. It maintains a growing set of first-order-logic (FOL) premises by inferring them from the NL context. 2. It autoformalizes each new CoT step into an FOL formula. 3. Using a constraint solver, it checks if this new step is logically entailed by the
1
0
1
@AnnieFeng6
Yu Feng
13 days
LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT
1
9
23
@realliyifei
Li S. Yifei
3 months
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
1
24
61
@JerrryKun
Zhikun Xu
5 months
🚀Excited to introduce BOW: A novel RL framework that rethinks vanilla next-word prediction as reasoning path exploration! Across 10 benchmarks, we show BOW leads to better zero-shot capabilities and next-word reasoning. 📄Paper: https://t.co/JAkGqOQ8kf 🧵Details below
1
6
6
@AnnieFeng6
Yu Feng
5 months
👥 We’re looking for reviewers for the COLM 2025 Workshop on AI Agents: Capabilities & Safety @COLM_conf! 🔗 Sign up: https://t.co/9lHImy8qpk Help shape exciting research on AI agents, their capabilities, and the safety challenges they raise. 🧠 #AI #AIagents #COLM2025
Tweet card summary image
docs.google.com
Thank you for your interest in serving as a Program Committee (PC) member for the COLM 2025 Workshop on AI Agents: Capabilities and Safety. This workshop will bring together leading researchers and...
@AnnieFeng6
Yu Feng
6 months
🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,
1
7
42
@jeffrey_ch0
Jeffrey (Young-Min) Cho
6 months
🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH
0
8
13
@AnnieFeng6
Yu Feng
6 months
🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,
4
19
85
@cogcomp
Cognitive Computation Group
7 months
Excited to share our papers at #ICLR2025 in Singapore! Check out the summaries on our blog ( https://t.co/ySVrTtA0W6), and then check out the papers at oral session 1B (BIRD) and poster session 2 (for all three)! @AnnieFeng6, @XingyuFu2, @BenZhou96, @muhao_chen, @DanRothNLP
0
5
8
@JunlinWang3
Junlin Wang
7 months
Excited to share work from my @togethercompute internship—a deep dive into inference‑time scaling methods 🧠 We rigorously evaluated verifier‑free inference-time scaling methods across both reasoning and non‑reasoning LLMs. Some key findings: 🔑 Even with huge rollout budgets,
1
52
174
@AnnieFeng6
Yu Feng
7 months
🚀 BIRD + Llama-70B outperforms GPT-4 by a massive 30% on probability estimation accuracy! We tested using human pairwise preference judgments: Can the model tell which supporting evidence is stronger by assigning higher probability? (Unlike unfair direct comparisons (EC)
0
0
6
@AnnieFeng6
Yu Feng
7 months
BIRD's probability magic ✨: We use constrained optimization (not directly asking LLMs) to estimate conditional probabilities. Key steps: - Minimize the distributional distance between LLM-verbalized coarse probabilities under complete information and BIRD estimated conditional
1
0
7
@AnnieFeng6
Yu Feng
7 months
While LLMs struggle with reliable decisions under uncertainty alone, they shine at: ✅ Identifying key real-world factors influencing a decision ✅ Producing coarse verbalized probabilities if given complete context ✅ Linking specific contexts to relevant factors through
1
0
7
@XingyuFu2
Xingyu Fu
2 years
Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 https://t.co/7Ia9u9e0EY, a novel benchmark that studies visual perception
@_akhaliq
AK
2 years
BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans
9
125
408
@_akhaliq
AK
2 years
BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans
4
91
366
@AnnieFeng6
Yu Feng
2 years
Please check our paper for more details if you are interested! Data and code are here: https://t.co/yIy91kTQpt Many thanks to my collaborators @BenZhou96 @Haoyu_Wang_97 @DanRothNLP @cogcomp (N/N)
0
0
1