UPenn NLP
@upennnlp
Followers
1K
Following
99
Media
3
Statuses
82
@Penn Natural Language Processing group
Philadelphia
Joined December 2022
📷 New #EMNLP2025 Findings survey paper! “Conflicts in Texts: Data, Implications, and Challenges” Paper: https://t.co/y9l472CyTk Conflicts are everywhere in NLP — news articles reflecting different perspectives or opposing views, annotators who disagree, LLMs that hallucinate
0
3
10
If you are at EMNLP 2025, check out this cool work led by @WeiqiuYou and chat with her about LLM reasoning soundness guarantees. Weiqiu is wrapping up her PhD and just entering the job market; she’s smart, tenacious, and overall amazing. I highly recommend her for your team.
I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: https://t.co/OmsR1oFwMv Paper: https://t.co/0JjxNATLPj Code: https://t.co/A6Hqa0ZLGa
1
3
8
📆The full agenda for LM4Sci Workshop (10 Oct 2025, Montreal 🇨🇦): keynotes, panels, poster sessions & more! https://t.co/q01etSzF4P
lm4sci.github.io
Bridge the gap between AI researchers and domain scientists by fostering interdisciplinary dialogue on foundation models can enhance scientific reasoning, assist human researchers, and transform...
0
7
18
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
1
23
60
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)
1
14
41
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
5
29
160
In a world of geopolitical conflicts, how can AI help us navigate? Our #ACL2025-F work studies RAG robustness across 49 languages. TL;DR: 📈 boost robustness w/ multilingual RAG, 🤔 take care w/ low-resource citations 📜 https://t.co/1YFiLEAiMG 🤗 https://t.co/wJl062UkCd 1/4 🧵
2
3
10
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓
1
23
76
🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH
0
8
13
#NAACL2025 How to compare cultural differences with social media data in scale? Our work uses lexica to annotate X 🇺🇸 & Weibo 🇨🇳 posts with valence (😄☹️) & arousal (🔥❄️) scores, revealing cross-cultural differences in emotional expression. https://t.co/2tNFceO9GD
aclanthology.org
Young Min Cho, Dandan Pang, Stuti Thapa, Garrick Sherman, Lyle Ungar, Louis Tay, Sharath Chandra Guntuku. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.
0
4
13
🚀 Introducing Science-T2I - Towards bridging the gap between AI imagination and scientific reality in image generation! [CVPR 2025] 📜 Paper: https://t.co/ybG6z3MQbd 🌐 Project: https://t.co/IBJodI0Uvm 💻 Code: https://t.co/voFOyXPRhi 🤗 Dataset: https://t.co/fjKgXkiB8q 🔍
4
32
140
#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD
2
40
260
1
1
6
New study with a billion words! Here’s the 60-second version. ⏲️ https://t.co/xW5lgEqbVs
@NaturePortfolio @sharathguntuku @UChicago
7
22
57
We share Code-Guided Synthetic Data Generation: using LLM-generated code to create multimodal datasets for text-rich images, such as charts📊, documents📄, etc., to enhance Vision-Language Models. Website: https://t.co/9IQ4CgeKMF Dataset: https://t.co/yiERrZup8X Paper:
6
48
196
🚨 LLMs must grasp implied language to reason about emotions, social cues, etc. Our @GoogleDeepMind paper presents the Implied NLI dataset. Targeting social norms 🌎 and conversational dynamics 💬, we enhance LLM understanding of real-world implication! https://t.co/qHMoziVf2H
arxiv.org
Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and...
1
16
55
Teach GPT-4o to edit on charts and tables to ReFocus 🔍 and facilitate reasoning 🧠! 🔥 We introduce ReFocus, which edits input table and chart images to better reason visually https://t.co/YcmJYSjE9H 🤔 Can we teach smaller models to learn such visual CoT reasoning? 🚀 Yes --
8
43
213
🗣️ New Paper 🗣️ Can a single AI text detector generalize to a fixed set of LLMs and domains? Our shared task results suggest yes! Winners @pangramlabs and @LeidosInc got over 99% TPR across 467k documents spanning 11 LLMs, 8 domains, and 4 decoding strategies See thread 🧵
3
9
18
Excited to share our first preprint on a comprehensive analysis of withdrawn papers from arXiv spanning its entire history through Sept 2024, in collaboration with @tdietterich and Jonathan Young from the @arxiv team! A quick summary and link to the paper in this thread:
3
21
114
Excited to share ✨ Contextualized Evaluations ✨! Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓
2
31
122