upennnlp Profile Banner
UPenn NLP Profile
UPenn NLP

@upennnlp

Followers
1K
Following
99
Media
3
Statuses
82

@Penn Natural Language Processing group

Philadelphia
Joined December 2022
Don't wanna be here? Send us removal request.
@liusiyi64198
Siyi Liu
6 days
📷 New #EMNLP2025 Findings survey paper! “Conflicts in Texts: Data, Implications, and Challenges” Paper: https://t.co/y9l472CyTk Conflicts are everywhere in NLP — news articles reflecting different perspectives or opposing views, annotators who disagree, LLMs that hallucinate
0
3
10
@deliprao
Delip Rao e/σ
8 days
If you are at EMNLP 2025, check out this cool work led by @WeiqiuYou and chat with her about LLM reasoning soundness guarantees. Weiqiu is wrapping up her PhD and just entering the job market; she’s smart, tenacious, and overall amazing. I highly recommend her for your team.
@WeiqiuYou
Weiqiu You @ EMNLP2025
8 days
I'll be presenting our work "Probabilistic Soundness Guarantees in LLM Reasoning Chains" at EMNLP 2025 Today (Nov 5) Hall C 14:30-16:00 802-Main Blog: https://t.co/OmsR1oFwMv Paper: https://t.co/0JjxNATLPj Code: https://t.co/A6Hqa0ZLGa
1
3
8
@realliyifei
Li S. Yifei
2 months
How well can LLMs & deep research systems synthesize long-form answers to *thousands of research queries across diverse domains*? Excited to announce 🎓📖 ResearchQA: a large-scale benchmark to evaluate long-form scholarly question answering at scale across 75 fields, using
1
23
60
@TomerWolfson
Tomer Wolfson
3 months
Many factual QA benchmarks have become saturated, yet factuality still poses a very real issue! ✨We present MoNaCo, an Ai2 benchmark of human-written time-consuming questions that, on average, require 43.3 documents per question!✨ 📣Blogpost: https://t.co/GQD83gdHgg 🧵(1/5)
1
14
41
@allen_ai
Ai2
4 months
In our new paper, “Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries,” we find that adding just a bit of missing context can reorder model leaderboards—and surface hidden biases. 🧵👇
5
29
160
@bryanlics
Bryan Li
4 months
In a world of geopolitical conflicts, how can AI help us navigate? Our #ACL2025-F work studies RAG robustness across 49 languages. TL;DR: 📈 boost robustness w/ multilingual RAG, 🤔 take care w/ low-resource citations 📜 https://t.co/1YFiLEAiMG 🤗 https://t.co/wJl062UkCd 1/4 🧵
2
3
10
@cmalaviya11
Chaitanya Malaviya
5 months
Ever wondered what makes language models generate overly verbose, vague, or sycophantic responses? Our new paper investigates these and other idiosyncratic biases in preference models, and presents a simple post-training recipe to mitigate them! Thread below 🧵↓
1
23
76
@jeffrey_ch0
Jeffrey (Young-Min) Cho
6 months
🤖💬 Herding instincts… in AIs? Yes, even LLMs can follow the crowd! • 📉 Conformity ↑ when agents lack confidence but trust peers • 🧠 Presentation format shapes peer influence • 🎯 Controlled herding can boost collaboration outcomes 👉 Read more: https://t.co/Ym0rtKyVzH
0
8
13
@jeffrey_ch0
Jeffrey (Young-Min) Cho
6 months
#NAACL2025 How to compare cultural differences with social media data in scale? Our work uses lexica to annotate X 🇺🇸 & Weibo 🇨🇳 posts with valence (😄☹️) & arousal (🔥❄️) scores, revealing cross-cultural differences in emotional expression. https://t.co/2tNFceO9GD
Tweet card summary image
aclanthology.org
Young Min Cho, Dandan Pang, Stuti Thapa, Garrick Sherman, Lyle Ungar, Louis Tay, Sharath Chandra Guntuku. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.
0
4
13
@JialuoLi1007
Jialuo Li
7 months
🚀 Introducing Science-T2I - Towards bridging the gap between AI imagination and scientific reality in image generation! [CVPR 2025] 📜 Paper: https://t.co/ybG6z3MQbd 🌐 Project: https://t.co/IBJodI0Uvm 💻 Code: https://t.co/voFOyXPRhi 🤗 Dataset: https://t.co/fjKgXkiB8q 🔍
4
32
140
@AnnieFeng6
Yu Feng
7 months
#ICLR2025 Oral LLMs often struggle with reliable and consistent decisions under uncertainty 😵‍💫 — largely because they can't reliably estimate the probability of each choice. We propose BIRD 🐦, a framework that significantly enhances LLM decision making under uncertainty. BIRD
2
40
260
@deliprao
Delip Rao e/σ
7 months
@soldni Heard good things about DataDreamer from @upennnlp https://t.co/Qq4VGHlJ9f
1
1
6
@ThomasTalhelm
Thomas Talhelm
8 months
New study with a billion words! Here’s the 60-second version. ⏲️ https://t.co/xW5lgEqbVs @NaturePortfolio @sharathguntuku @UChicago
7
22
57
@YueYangAI
Yue Yang
9 months
We share Code-Guided Synthetic Data Generation: using LLM-generated code to create multimodal datasets for text-rich images, such as charts📊, documents📄, etc., to enhance Vision-Language Models. Website: https://t.co/9IQ4CgeKMF Dataset: https://t.co/yiERrZup8X Paper:
6
48
196
@shreyahavaldar
Shreya Havaldar
10 months
🚨 LLMs must grasp implied language to reason about emotions, social cues, etc. Our @GoogleDeepMind paper presents the Implied NLI dataset. Targeting social norms 🌎 and conversational dynamics 💬, we enhance LLM understanding of real-world implication! https://t.co/qHMoziVf2H
Tweet card summary image
arxiv.org
Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and...
1
16
55
@XingyuFu2
Xingyu Fu
10 months
Teach GPT-4o to edit on charts and tables to ReFocus 🔍 and facilitate reasoning 🧠! 🔥 We introduce ReFocus, which edits input table and chart images to better reason visually https://t.co/YcmJYSjE9H 🤔 Can we teach smaller models to learn such visual CoT reasoning? 🚀 Yes --
8
43
213
@LiamDugan_
Liam Dugan
10 months
🗣️ New Paper 🗣️ Can a single AI text detector generalize to a fixed set of LLMs and domains? Our shared task results suggest yes! Winners @pangramlabs and @LeidosInc got over 99% TPR across 467k documents spanning 11 LLMs, 8 domains, and 4 decoding strategies See thread 🧵
3
9
18
@deliprao
Delip Rao e/σ
11 months
Excited to share our first preprint on a comprehensive analysis of withdrawn papers from arXiv spanning its entire history through Sept 2024, in collaboration with @tdietterich and Jonathan Young from the @arxiv team! A quick summary and link to the paper in this thread:
3
21
114
@cmalaviya11
Chaitanya Malaviya
1 year
Excited to share ✨ Contextualized Evaluations ✨! Benchmarks like Chatbot Arena contain underspecified queries, which can lead to arbitrary eval judgments. What happens if we provide evaluators with context (e.g who's the user, what's their intent) when judging LM outputs? 🧵↓
2
31
122