Yu Feng
@AnnieFeng6
Followers
373
Following
19
Media
17
Statuses
28
CS PhD Student @Penn | NLP & ML @cogcomp @upennnlp @duke_nlp @ RUC | 🧗🏻♀️🎨🩰🎹
Joined August 2018
#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD
2
40
259
Building on this verification capability, we leverage VeriCoT to actively enhance an LLM’s reasoning. - We use VeriCoT for inference-time self-reflection - VeriCoT helps create a high-fidelity dataset of verified CoTs for SFT and as a source of pairwise reward signals for DPO
0
0
1
VeriCoT can detect ungrounded or incorrect reasoning, and VeriCoT-validation is a strong predictor of final answer correctness, with validated CoT attaining higher precision than task-level accuracy. (4/n)
1
0
1
VeriCoT provides multi-faceted feedback. It identifies: - Whether a CoT can be represented in formal logic - How the CoT’s steps are logically supported - What underlying NL premises need to be accepted in order to accept the CoT’s reasoning (3/n)
1
0
1
Here's how VeriCoT works: 1. It maintains a growing set of first-order-logic (FOL) premises by inferring them from the NL context. 2. It autoformalizes each new CoT step into an FOL formula. 3. Using a constraint solver, it checks if this new step is logically entailed by the
1
0
1
LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT
1
9
23
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
1
24
61
🚀Excited to introduce BOW: A novel RL framework that rethinks vanilla next-word prediction as reasoning path exploration! Across 10 benchmarks, we show BOW leads to better zero-shot capabilities and next-word reasoning. 📄Paper: https://t.co/JAkGqOQ8kf 🧵Details below
1
6
6
👥 We’re looking for reviewers for the COLM 2025 Workshop on AI Agents: Capabilities & Safety @COLM_conf! 🔗 Sign up: https://t.co/9lHImy8qpk Help shape exciting research on AI agents, their capabilities, and the safety challenges they raise. 🧠 #AI #AIagents #COLM2025
docs.google.com
Thank you for your interest in serving as a Program Committee (PC) member for the COLM 2025 Workshop on AI Agents: Capabilities and Safety. This workshop will bring together leading researchers and...
🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,
1
7
42
🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH
0
8
13
🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,
4
19
85
Excited to share our papers at #ICLR2025 in Singapore! Check out the summaries on our blog ( https://t.co/ySVrTtA0W6), and then check out the papers at oral session 1B (BIRD) and poster session 2 (for all three)! @AnnieFeng6, @XingyuFu2, @BenZhou96, @muhao_chen, @DanRothNLP
0
5
8
Excited to share work from my @togethercompute internship—a deep dive into inference‑time scaling methods 🧠 We rigorously evaluated verifier‑free inference-time scaling methods across both reasoning and non‑reasoning LLMs. Some key findings: 🔑 Even with huge rollout budgets,
1
52
174
🚀 BIRD + Llama-70B outperforms GPT-4 by a massive 30% on probability estimation accuracy! We tested using human pairwise preference judgments: Can the model tell which supporting evidence is stronger by assigning higher probability? (Unlike unfair direct comparisons (EC)
0
0
6
BIRD's probability magic ✨: We use constrained optimization (not directly asking LLMs) to estimate conditional probabilities. Key steps: - Minimize the distributional distance between LLM-verbalized coarse probabilities under complete information and BIRD estimated conditional
1
0
7
While LLMs struggle with reliable decisions under uncertainty alone, they shine at: ✅ Identifying key real-world factors influencing a decision ✅ Producing coarse verbalized probabilities if given complete context ✅ Linking specific contexts to relevant factors through
1
0
7
Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻♀️ Introducing BLINK👁 https://t.co/7Ia9u9e0EY, a novel benchmark that studies visual perception
BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans
9
125
408
BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans
4
91
366
Please check our paper for more details if you are interested! Data and code are here: https://t.co/yIy91kTQpt Many thanks to my collaborators @BenZhou96 @Haoyu_Wang_97 @DanRothNLP @cogcomp (N/N)
0
0
1