Yu Feng @AnnieFeng6 X Profile

Yu Feng

@AnnieFeng6

Followers

373

Following

19

Media

17

Statuses

28

CS PhD Student @Penn | NLP & ML @cogcomp @upennnlp @duke_nlp @ RUC | 🧗🏻‍♀️🎨🩰🎹

https://t.co/9PwjIclvrC

Joined August 2018

Don't wanna be here? Send us removal request.

Yu Feng

@AnnieFeng6

7 months

#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵‍💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD

2

40

259

Yu Feng

@AnnieFeng6

13 days

Building on this verification capability, we leverage VeriCoT to actively enhance an LLM’s reasoning. - We use VeriCoT for inference-time self-reflection - VeriCoT helps create a high-fidelity dataset of verified CoTs for SFT and as a source of pairwise reward signals for DPO

0

1

Yu Feng

@AnnieFeng6

13 days

VeriCoT can detect ungrounded or incorrect reasoning, and VeriCoT-validation is a strong predictor of final answer correctness, with validated CoT attaining higher precision than task-level accuracy. (4/n)

1

0

1

Yu Feng

@AnnieFeng6

13 days

VeriCoT provides multi-faceted feedback. It identifies: - Whether a CoT can be represented in formal logic - How the CoT’s steps are logically supported - What underlying NL premises need to be accepted in order to accept the CoT’s reasoning (3/n)

1

0

1

Yu Feng

@AnnieFeng6

13 days

Here's how VeriCoT works: 1. It maintains a growing set of first-order-logic (FOL) premises by inferring them from the NL context. 2. It autoformalizes each new CoT step into an FOL formula. 3. Using a constraint solver, it checks if this new step is logically entailed by the

1

0

1

Yu Feng

@AnnieFeng6

13 days

LLM CoT reasoning looks smart but can be logically flawed or... just made up. It's time to hold reasoning accountable! We built VeriCoT to do just that. VeriCoT extracts the core argument of the CoT using well-formed symbolic notions of logical support. It formalizes every CoT

1

9

23

Li S. Yifei

@realliyifei

3 months

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using

1

24

61

Zhikun Xu

@JerrryKun

5 months

🚀Excited to introduce BOW: A novel RL framework that rethinks vanilla next-word prediction as reasoning path exploration! Across 10 benchmarks, we show BOW leads to better zero-shot capabilities and next-word reasoning. 📄Paper: https://t.co/JAkGqOQ8kf 🧵Details below

1

6

Yu Feng

@AnnieFeng6

5 months

👥 We’re looking for reviewers for the COLM 2025 Workshop on AI Agents: Capabilities & Safety @COLM_conf! 🔗 Sign up: https://t.co/9lHImy8qpk Help shape exciting research on AI agents, their capabilities, and the safety challenges they raise. 🧠 #AI #AIagents #COLM2025

docs.google.com

Thank you for your interest in serving as a Program Committee (PC) member for the COLM 2025 Workshop on AI Agents: Capabilities and Safety. This workshop will bring together leading researchers and...

Yu Feng

@AnnieFeng6

6 months

🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,

1

7

42

Jeffrey (Young-Min) Cho

@jeffrey_ch0

6 months

🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH

0

8

13

Yu Feng

@AnnieFeng6

6 months

🚨COLM 2025 Workshop on AI Agents: Capabilities and Safety @COLM_conf This workshop explores AI agents’ capabilities—including reasoning and planning, interaction and embodiment, and real-world applications—as well as critical safety challenges related to reliability, ethics,

4

19

85

Cognitive Computation Group

@cogcomp

7 months

Excited to share our papers at #ICLR2025 in Singapore! Check out the summaries on our blog ( https://t.co/ySVrTtA0W6), and then check out the papers at oral session 1B (BIRD) and poster session 2 (for all three)! @AnnieFeng6, @XingyuFu2, @BenZhou96, @muhao_chen, @DanRothNLP

0

5

8

Junlin Wang

@JunlinWang3

7 months

Excited to share work from my @togethercompute internship—a deep dive into inference‑time scaling methods 🧠 We rigorously evaluated verifier‑free inference-time scaling methods across both reasoning and non‑reasoning LLMs. Some key findings: 🔑 Even with huge rollout budgets,

1

52

174

Yu Feng

@AnnieFeng6

7 months

🚀 BIRD + Llama-70B outperforms GPT-4 by a massive 30% on probability estimation accuracy! We tested using human pairwise preference judgments: Can the model tell which supporting evidence is stronger by assigning higher probability? (Unlike unfair direct comparisons (EC)

0

6

Yu Feng

@AnnieFeng6

7 months

BIRD's probability magic ✨: We use constrained optimization (not directly asking LLMs) to estimate conditional probabilities. Key steps: - Minimize the distributional distance between LLM-verbalized coarse probabilities under complete information and BIRD estimated conditional

1

0

7

Yu Feng

@AnnieFeng6

7 months

While LLMs struggle with reliable decisions under uncertainty alone, they shine at: ✅ Identifying key real-world factors influencing a decision ✅ Producing coarse verbalized probabilities if given complete context ✅ Linking specific contexts to relevant factors through

1

0

7

Xingyu Fu

@XingyuFu2

2 years

Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 https://t.co/7Ia9u9e0EY, a novel benchmark that studies visual perception

AK

@_akhaliq

2 years

BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans

9

125

408

AK

@_akhaliq

2 years

BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans

4

91

366

Yu Feng

@AnnieFeng6

2 years

Please check our paper for more details if you are interested! Data and code are here: https://t.co/yIy91kTQpt Many thanks to my collaborators @BenZhou96 @Haoyu_Wang_97 @DanRothNLP @cogcomp (N/N)

0

1