Eve Fleisig @ EMNLP 2025 @enfleisig X Profile

Eve Fleisig @ EMNLP 2025

@enfleisig

Followers

680

Following

493

Media

8

Statuses

211

PhD student @Berkeley_EECS | Princeton ‘21 | NLP, ethical + equitable AI, and sociolinguistics enthusiast

https://t.co/uYrVzsay62

Joined July 2020

Don't wanna be here? Send us removal request.

Eve Fleisig @ EMNLP 2025

@enfleisig

4 days

Excited to present this joint work with @morlikow today, 2-3:30PM, in the NLPerspectives poster session at #EMNLP2025!

0

4

NLPerspectives Workshop

@NLPerspectivWS

3 days

@CamachoCollados We finished with the panel feat. @CamachoCollados, @enfleisig, and @Beiduo_Chen discussing the state and challenges of NLPerspectivism. See you at the 5th edition in 2026!

0

2

7

Eve Fleisig @ EMNLP 2025

@enfleisig

4 days

It's challenging to maintain data quality while preserving variation in data labels! We find that spam filtering for data annotation removes annotators who disagree instead of actual spammers, distorting data label distributions. 📄 https://t.co/ccwyvArvqV

1

4

30

Eve Fleisig @ EMNLP 2025

@enfleisig

5 days

Very excited to join the NLPerspectives invited panel this Saturday!

NLPerspectives Workshop

@NLPerspectivWS

5 days

Detailed programme now up on website. Looking forward to 14 research papers, results of the 3rd Shared Task on Learning with Disagreements (LeWiDi), a talk from @CamachoCollados, and a panel discussion feat. Jose, @enfleisig, and @Beiduo_Chen. See you in Room A305 or online!

0

1

7

Tiancheng Hu

@tiancheng_hu

14 days

Can AI simulate human behavior? 🧠 The promise is revolutionary for science & policy. But there’s a huge "IF": Do these simulations actually reflect reality? To find out, we introduce SimBench: The first large-scale benchmark for group-level social simulation. (1/9)

3

22

53

Eve Fleisig @ EMNLP 2025

@enfleisig

22 days

Lucy is an absolute gem, run don’t walk to go work with her!

Alexander Hoyle

@miserlis_

22 days

Work with Lucy! Not only does she do great work, she is also great people

0

7

Eve Fleisig @ EMNLP 2025

@enfleisig

22 days

It was a pleasure working with @kaylalee278 at Berkeley—wonderful to see what she’s up to now!

Y Combinator

@ycombinator

1 month

Game Arena from @generalityinc is the largest LLM strategy game tournament to date. Games are great for measuring LLMs on instruction following, long-horizon planning, and problem-solving. In fact, models that are Olympiad-level at math and coding often struggle to make accurate

0

1

2

jessica dai

@jessicadai_

4 months

individual reporting for post-deployment evals — a little manifesto (& new preprints!) tldr: end users have unique insights about how deployed systems are failing; we should figure out how to translate their experiences into formal evaluations of those systems.

7

31

142

Joachim Baumann

@joabaum

2 months

🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: https://t.co/24Fyb4Ik3v

16

109

515

Omar Shaikh

@oshaikh13

4 months

BREAKING NEWS! Most people aren’t prompting models with IMO problems :) They’re prompting with tasks that need more context, like “plz make talk slides.” In an ACL oral, I’ll cover challenges in human-LM grounding (in 60K+ real interactions) & introduce a benchmark: RIFTS. 🧵

5

54

274

Eve Fleisig @ EMNLP 2025

@enfleisig

4 months

Correction: this is in Room 1.62 (Human-Centered NLP session) at 2!

Eve Fleisig @ EMNLP 2025

@enfleisig

4 months

I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.

0

12

Eve Fleisig @ EMNLP 2025

@enfleisig

4 months

And, if you’re interested in making LLMs serve the needs of complicated and varied populations—whether that manifests as building fair models, incorporating disagreement, preventing undue trust in miscalibrated models, or preventing harms in general—I’m always happy to chat!

1

0

1

Eve Fleisig @ EMNLP 2025

@enfleisig

4 months

I’m at #ACL2025! Stop by Room 1.61 at 2pm Monday for the oral presentation on GRACE, our new, uniquely human-grounded calibration benchmark—or come talk to me & @YooYeonSung1 about how we directly compare with humans to measure LLM overconfidence and avoid miscalibration.

1

7

CLS

@ChengleiSi

4 months

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

12

194

635

Myra Cheng

@chengmyra1

5 months

Do people actually like human-like LLMs? In our #ACL2025 paper HumT DumT, we find a kind of uncanny valley effect: users dislike LLM outputs that are *too human-like*. We thus develop methods to reduce human-likeness without sacrificing performance.

5

27

170

Eve Fleisig @ EMNLP 2025

@enfleisig

6 months

Same in reverse:

0

2

Eve Fleisig @ EMNLP 2025

@enfleisig

6 months

It's 2025 and Google Translate still consistently mistranslates the genders of stereotypically gender-associated professions. Surely we can do better!

1

0

20

Yoo Yeon Sung@ACL2025

@YooYeonSung1

6 months

I feel so honored to win this award at #naaclmeeting #naacl2025 🥹 Cannot say how much grateful I am to my wonderful advisor @boydgraber and could not have done it without @maharshigor, @enfleisig, @IshaniMond66436 🙏

1

8

52

Yoo Yeon Sung@ACL2025

@YooYeonSung1

7 months

🏆ADVSCORE won an Outstanding Paper Award at #NAACL2025 @naaclmeeting!! If you want to learn how to make your benchmark *actually* adversarial, come find me: 📍Poster Session 5 - HC: Human-centered NLP 📅May 1 @ 2PM Hiring for human-focused AI dev/LLM eval? Let’s talk! 💼

Yoo Yeon Sung@ACL2025

@YooYeonSung1

9 months

Thrilled to share that AdvScore paper has been accepted to NAACL Main 🚀! Looking forward to pushing forward human-centered model evaluation and benchmark creation. Huge thanks to my amazing collaborators! 🎈🌵🏜️ #NAACL2025

2

19

65

Arduin Findeis

@arduinfindeis

8 months

🕵🏻💬 Introducing Feedback Forensics: a new tool to investigate pairwise preference data. Feedback data is notoriously difficult to interpret and has many known issues – our app aims to help! Try it at https://t.co/4HubCg52Pi Three example use-cases 👇🧵

2

11

33