UPenn NLP @upennnlp X Profile

UPenn NLP

@upennnlp

Followers

1K

Following

99

Media

3

Statuses

82

@Penn Natural Language Processing group

https://t.co/NrTQhwHukY

Philadelphia

Joined December 2022

Don't wanna be here? Send us removal request.

Siyi Liu

@liusiyi64198

6 days

📷 New #EMNLP2025 Findings survey paper! “Conflicts in Texts: Data, Implications, and Challenges” Paper: https://t.co/y9l472CyTk Conflicts are everywhere in NLP — news articles reflecting different perspectives or opposing views, annotators who disagree, LLMs that hallucinate

0

3

10

Delip Rao e/σ

@deliprao

8 days

If you are at EMNLP 2025, check out this cool work led by @WeiqiuYou and chat with her about LLM reasoning soundness guarantees. Weiqiu is wrapping up her PhD and just entering the job market; she’s smart, tenacious, and overall amazing. I highly recommend her for your team.

Weiqiu You @ EMNLP2025

@WeiqiuYou

8 days

I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: https://t.co/OmsR1oFwMv Paper: https://t.co/0JjxNATLPj Code: https://t.co/A6Hqa0ZLGa

1

3

8

LM4SCI @ COLM2025

@lm4sci

1 month

📆The full agenda for LM4Sci Workshop (10 Oct 2025, Montreal 🇨🇦): keynotes, panels, poster sessions & more! https://t.co/q01etSzF4P

lm4sci.github.io

Bridge the gap between AI researchers and domain scientists by fostering interdisciplinary dialogue on foundation models can enhance scientific reasoning, assist human researchers, and transform...

0

7

18

Li S. Yifei

@realliyifei

2 months

How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using

1

23

60

Tomer Wolfson

@TomerWolfson

3 months

Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)

1

14

41

Ai2

@allen_ai

4 months

In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇

5

29

160

Bryan Li

@bryanlics

4 months

In a world of geopolitical conflicts, how can AI help us navigate? Our #ACL2025-F work studies RAG robustness across 49 languages. TL;DR: 📈 boost robustness w/ multilingual RAG, 🤔 take care w/ low-resource citations 📜 https://t.co/1YFiLEAiMG 🤗 https://t.co/wJl062UkCd 1/4 🧵

2

3

10

Chaitanya Malaviya

@cmalaviya11

5 months

Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓

1

23

76

Jeffrey (Young-Min) Cho

@jeffrey_ch0

6 months

🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH

0

8

13

Jeffrey (Young-Min) Cho

@jeffrey_ch0

6 months

#NAACL2025 How to compare cultural differences with social media data in scale? Our work uses lexica to annotate X 🇺🇸 & Weibo 🇨🇳 posts with valence (😄☹️) & arousal (🔥❄️) scores, revealing cross-cultural differences in emotional expression. https://t.co/2tNFceO9GD

aclanthology.org

Young Min Cho, Dandan Pang, Stuti Thapa, Garrick Sherman, Lyle Ungar, Louis Tay, Sharath Chandra Guntuku. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.

0

4

13

Jialuo Li

@JialuoLi1007

7 months

🚀 Introducing Science-T2I - Towards bridging the gap between AI imagination and scientific reality in image generation! [CVPR 2025] 📜 Paper: https://t.co/ybG6z3MQbd 🌐 Project: https://t.co/IBJodI0Uvm 💻 Code: https://t.co/voFOyXPRhi 🤗 Dataset: https://t.co/fjKgXkiB8q 🔍

4

32

140

Yu Feng

@AnnieFeng6

7 months

#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵‍💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD

2

40

260

Delip Rao e/σ

@deliprao

7 months

@soldni Heard good things about DataDreamer from @upennnlp https://t.co/Qq4VGHlJ9f

1

6

Thomas Talhelm

@ThomasTalhelm

8 months

New study with a billion words! Here’s the 60-second version. ⏲️ https://t.co/xW5lgEqbVs @NaturePortfolio @sharathguntuku @UChicago

7

22

57

Yue Yang

@YueYangAI

9 months

We share Code-Guided Synthetic Data Generation: using LLM-generated code to create multimodal datasets for text-rich images, such as charts📊, documents📄, etc., to enhance Vision-Language Models. Website: https://t.co/9IQ4CgeKMF Dataset: https://t.co/yiERrZup8X Paper:

6

48

196

Shreya Havaldar

@shreyahavaldar

10 months

🚨 LLMs must grasp implied language to reason about emotions, social cues, etc. Our @GoogleDeepMind paper presents the Implied NLI dataset. Targeting social norms 🌎 and conversational dynamics 💬, we enhance LLM understanding of real-world implication! https://t.co/qHMoziVf2H

arxiv.org

Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and...

1

16

55

Xingyu Fu

@XingyuFu2

10 months

Teach GPT-4o to edit on charts and tables to ReFocus 🔍 and facilitate reasoning 🧠! 🔥 We introduce ReFocus, which edits input table and chart images to better reason visually https://t.co/YcmJYSjE9H 🤔 Can we teach smaller models to learn such visual CoT reasoning? 🚀 Yes --

8

43

213

Liam Dugan

@LiamDugan_

10 months

🗣️ New Paper 🗣️ Can a single AI text detector generalize to a fixed set of LLMs and domains? Our shared task results suggest yes! Winners @pangramlabs and @LeidosInc got over 99% TPR across 467k documents spanning 11 LLMs, 8 domains, and 4 decoding strategies See thread 🧵

3

9

18

Delip Rao e/σ

@deliprao

11 months

Excited to share our first preprint on a comprehensive analysis of withdrawn papers from arXiv spanning its entire history through Sept 2024, in collaboration with @tdietterich and Jonathan Young from the @arxiv team! A quick summary and link to the paper in this thread:

3

21

114

Chaitanya Malaviya

@cmalaviya11

1 year

Excited to share ✨ Contextualized Evaluations ✨! Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓

2

31

122