enfleisig Profile Banner
Eve Fleisig @ EMNLP 2025 Profile
Eve Fleisig @ EMNLP 2025

@enfleisig

Followers
680
Following
493
Media
8
Statuses
211

PhD student @Berkeley_EECS | Princeton ‘21 | NLP, ethical + equitable AI, and sociolinguistics enthusiast

Joined July 2020
Don't wanna be here? Send us removal request.
@enfleisig
Eve Fleisig @ EMNLP 2025
4 days
Excited to present this joint work with @morlikow today, 2-3:30PM, in the NLPerspectives poster session at #EMNLP2025!
0
0
4
@NLPerspectivWS
NLPerspectives Workshop
3 days
@CamachoCollados We finished with the panel feat. @CamachoCollados, @enfleisig, and @Beiduo_Chen discussing the state and challenges of NLPerspectivism. See you at the 5th edition in 2026!
0
2
7
@enfleisig
Eve Fleisig @ EMNLP 2025
4 days
It's challenging to maintain data quality while preserving variation in data labels! We find that spam filtering for data annotation removes annotators who disagree instead of actual spammers, distorting data label distributions. 📄 https://t.co/ccwyvArvqV
1
4
30
@enfleisig
Eve Fleisig @ EMNLP 2025
5 days
Very excited to join the NLPerspectives invited panel this Saturday!
@NLPerspectivWS
NLPerspectives Workshop
5 days
Detailed programme now up on website. Looking forward to 14 research papers, results of the 3rd Shared Task on Learning with Disagreements (LeWiDi), a talk from @CamachoCollados, and a panel discussion feat. Jose, @enfleisig, and @Beiduo_Chen. See you in Room A305 or online!
0
1
7
@tiancheng_hu
Tiancheng Hu
14 days
Can AI simulate human behavior? 🧠 The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality? To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
3
22
53
@enfleisig
Eve Fleisig @ EMNLP 2025
22 days
Lucy is an absolute gem, run don’t walk to go work with her!
@miserlis_
Alexander Hoyle
22 days
Work with Lucy! Not only does she do great work, she is also great people
0
0
7
@enfleisig
Eve Fleisig @ EMNLP 2025
22 days
It was a pleasure working with @kaylalee278 at Berkeley—wonderful to see what she’s up to now!
@ycombinator
Y Combinator
1 month
Game Arena from @generalityinc is the largest LLM strategy game tournament to date. Games are great for measuring LLMs on instruction following, long-horizon planning, and problem-solving. In fact, models that are Olympiad-level at math and coding often struggle to make accurate
0
1
2
@jessicadai_
jessica dai
4 months
individual reporting for post-deployment evals — a little manifesto (& new preprints!) tldr: end users have unique insights about how deployed systems are failing; we should figure out how to translate their experiences into formal evaluations of those systems.
7
31
142
@joabaum
Joachim Baumann
2 months
🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: https://t.co/24Fyb4Ik3v
16
109
515
@oshaikh13
Omar Shaikh
4 months
BREAKING NEWS! Most people aren’t prompting models with IMO problems :) They’re prompting with tasks that need more context, like “plz make talk slides.” In an ACL oral, I’ll cover challenges in human-LM grounding (in 60K+ real interactions) & introduce a benchmark: RIFTS. 🧵
5
54
274
@enfleisig
Eve Fleisig @ EMNLP 2025
4 months
Correction: this is in Room 1.62 (Human-Centered NLP session) at 2!
@enfleisig
Eve Fleisig @ EMNLP 2025
4 months
I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.
0
0
12
@enfleisig
Eve Fleisig @ EMNLP 2025
4 months
And, if you’re interested in making LLMs serve the needs of complicated and varied populations—whether that manifests as building fair models, incorporating disagreement, preventing undue trust in miscalibrated models, or preventing harms in general—I’m always happy to chat!
1
0
1
@enfleisig
Eve Fleisig @ EMNLP 2025
4 months
I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.
1
1
7
@ChengleiSi
CLS
4 months
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
194
635
@chengmyra1
Myra Cheng
5 months
Do people actually like human-like LLMs? In our #ACL2025 paper HumT DumT, we find a kind of uncanny valley effect: users dislike LLM outputs that are *too human-like*. We thus develop methods to reduce human-likeness without sacrificing performance.
5
27
170
@enfleisig
Eve Fleisig @ EMNLP 2025
6 months
Same in reverse:
0
0
2
@enfleisig
Eve Fleisig @ EMNLP 2025
6 months
It's 2025 and Google Translate still consistently mistranslates the genders of stereotypically gender-associated professions. Surely we can do better!
1
0
20
@YooYeonSung1
Yoo Yeon Sung@ACL2025
6 months
I feel so honored to win this award at #naaclmeeting #naacl2025 🥹 Cannot say how much grateful I am to my wonderful advisor @boydgraber and could not have done it without @maharshigor, @enfleisig, @IshaniMond66436 🙏
1
8
52
@YooYeonSung1
Yoo Yeon Sung@ACL2025
7 months
🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 @naaclmeeting!! If you want to learn how to make your benchmark *actually* adversarial, come find me: 📍Poster Session 5 - HC: Human-centered NLP 📅May 1 @ 2PM Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼
@YooYeonSung1
Yoo Yeon Sung@ACL2025
9 months
Thrilled to share that AdvScore paper has been accepted to NAACL Main 🚀! Looking forward to pushing forward human-centered model evaluation and benchmark creation. Huge thanks to my amazing collaborators! 🎈🌵🏜️ #NAACL2025
2
19
65
@arduinfindeis
Arduin Findeis
8 months
🕵🏻💬 Introducing Feedback Forensics: a new tool to investigate pairwise preference data. Feedback data is notoriously difficult to interpret and has many known issues – our app aims to help! Try it at https://t.co/4HubCg52Pi Three example use-cases 👇🧵
2
11
33