Eve Fleisig @ EMNLP 2025
@enfleisig
Followers
680
Following
493
Media
8
Statuses
211
PhD student @Berkeley_EECS | Princeton ‘21 | NLP, ethical + equitable AI, and sociolinguistics enthusiast
Joined July 2020
Excited to present this joint work with @morlikow today, 2-3:30PM, in the NLPerspectives poster session at #EMNLP2025!
0
0
4
@CamachoCollados We finished with the panel feat. @CamachoCollados, @enfleisig, and @Beiduo_Chen discussing the state and challenges of NLPerspectivism. See you at the 5th edition in 2026!
0
2
7
It's challenging to maintain data quality while preserving variation in data labels! We find that spam filtering for data annotation removes annotators who disagree instead of actual spammers, distorting data label distributions. 📄 https://t.co/ccwyvArvqV
1
4
30
Very excited to join the NLPerspectives invited panel this Saturday!
Detailed programme now up on website. Looking forward to 14 research papers, results of the 3rd Shared Task on Learning with Disagreements (LeWiDi), a talk from @CamachoCollados, and a panel discussion feat. Jose, @enfleisig, and @Beiduo_Chen. See you in Room A305 or online!
0
1
7
Can AI simulate human behavior? 🧠 The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality? To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)
3
22
53
It was a pleasure working with @kaylalee278 at Berkeley—wonderful to see what she’s up to now!
Game Arena from @generalityinc is the largest LLM strategy game tournament to date. Games are great for measuring LLMs on instruction following, long-horizon planning, and problem-solving. In fact, models that are Olympiad-level at math and coding often struggle to make accurate
0
1
2
individual reporting for post-deployment evals — a little manifesto (& new preprints!) tldr: end users have unique insights about how deployed systems are failing; we should figure out how to translate their experiences into formal evaluations of those systems.
7
31
142
🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: https://t.co/24Fyb4Ik3v
16
109
515
BREAKING NEWS! Most people aren’t prompting models with IMO problems :) They’re prompting with tasks that need more context, like “plz make talk slides.” In an ACL oral, I’ll cover challenges in human-LM grounding (in 60K+ real interactions) & introduce a benchmark: RIFTS. 🧵
5
54
274
Correction: this is in Room 1.62 (Human-Centered NLP session) at 2!
I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.
0
0
12
And, if you’re interested in making LLMs serve the needs of complicated and varied populations—whether that manifests as building fair models, incorporating disagreement, preventing undue trust in miscalibrated models, or preventing harms in general—I’m always happy to chat!
1
0
1
I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.
1
1
7
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
194
635
Do people actually like human-like LLMs? In our #ACL2025 paper HumT DumT, we find a kind of uncanny valley effect: users dislike LLM outputs that are *too human-like*. We thus develop methods to reduce human-likeness without sacrificing performance.
5
27
170
It's 2025 and Google Translate still consistently mistranslates the genders of stereotypically gender-associated professions. Surely we can do better!
1
0
20
I feel so honored to win this award at #naaclmeeting #naacl2025 🥹 Cannot say how much grateful I am to my wonderful advisor @boydgraber and could not have done it without @maharshigor, @enfleisig, @IshaniMond66436 🙏
1
8
52
🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 @naaclmeeting!! If you want to learn how to make your benchmark *actually* adversarial, come find me: 📍Poster Session 5 - HC: Human-centered NLP 📅May 1 @ 2PM Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼
Thrilled to share that AdvScore paper has been accepted to NAACL Main 🚀! Looking forward to pushing forward human-centered model evaluation and benchmark creation. Huge thanks to my amazing collaborators! 🎈🌵🏜️ #NAACL2025
2
19
65
🕵🏻💬 Introducing Feedback Forensics: a new tool to investigate pairwise preference data. Feedback data is notoriously difficult to interpret and has many known issues – our app aims to help! Try it at https://t.co/4HubCg52Pi Three example use-cases 👇🧵
2
11
33