Lisa Dunlap @lisabdunlap X Profile

Lisa Dunlap

@lisabdunlap

Followers

1K

Following

1K

Media

79

Statuses

508

messin around with model evals @berkeley_ai | prev @lmarena_ai

https://t.co/kDBTP6V6zb

Joined October 2021

Don't wanna be here? Send us removal request.

Lisa Dunlap

@lisabdunlap

2 days

Amazing work!!!

Aryan Vichare

@aryanvichare10

3 days

Excited for you all to try Code Arena – our new evaluation platform to test models' agentic coding capabilities in building real-world apps and websites. Let me know if you have any feedback or what you think :)

1

0

4

Lisa Dunlap

@lisabdunlap

4 days

trust me, you could not ask for a better advisor :D

Ritwik Gupta 🇺🇦

@Ritwik_G

4 days

I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵

0

1

21

Lisa Dunlap

@lisabdunlap

8 days

Honored to be part of the Slingshot inaugural batch! It's been amazing working with the team :)

Laude Institute

@LaudeInstitute

9 days

Meet Slingshots // One. This inaugural batch includes leading-edge researchers advancing the science and practice of AI - with benchmarks, frameworks, and agents that ship real impact into the world. We're honored to support research from: @alexgshaw @Mike_A_Merrill

0

19

Lisa Dunlap

@lisabdunlap

13 days

So is the formula to just name the most famous institutions and call it an X paper? Neither the first or last author are from Anthropic or Stanford. I get that reputation matters for publicity but it does seem a little disrespectful

Rohan Paul

@rohanpaul_ai

13 days

New Stanford+Anthropic paper shows long step-by-step prompts can break model safety and trigger harmful answers. 😟 Long reasoning can quietly neutralize safety checks that people assume are working. The trick adds a benign puzzle and long reasoning before the harmful ask, plus

18

25

436

Melissa Pan

@melissapan

15 days

The Sky’s Fun Committee, representing the ppl of sky, just dropped the new lab theme: ⚫️💖 Black Pink x Halloween 🎃🦇 We have: - Gru & the minions - kpop ??? 🫰😉

8

52

Charles Packer

@charlespacker

16 days

Today we're releasing Context-Bench, a benchmark (and live leaderboard!) measuring LLMs on Agentic Context Engineering. C-Bench measures an agent's ability to manipulate its own context window, a necessary skill for AI agents that can self-improve and continually learn.

19

34

328

Christopher Chou

@chrischou03

17 days

Check out our 32B base model and our entire training recipe (including failures)!

Percy Liang

@percyliang

17 days

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

0

1

5

Elana Simon

@ElanaPearl

23 days

New blog post: The bug that taught me more about PyTorch than years of using it started with a simple training loss plateau... ended up digging through optimizer states, memory layouts, kernel dispatch, and finally understanding how PyTorch works!

44

178

2K

Tony Wang

@TonyWangIV

23 days

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵

13

58

576

Lisa Dunlap

@lisabdunlap

24 days

@ICCVConference

0

Lisa Dunlap

@lisabdunlap

24 days

Come see my poster this afternoon @ICCVConference ! 🖊️ #252, 2:45pm Ehibit Hall 1 🔗 https://t.co/GPBVjfERzl We build a framework for automatically uncovering semantic concepts that diffusion models represent differently.

1

0

3

Lisa Dunlap

@lisabdunlap

25 days

The new version of asking your friends crush out for them is asking your frontend coding agent to construct a prompt to give to your backend coding agent. ..I'm assuming there is an easier cross repo agent system but i kinda like to nostalgia of telephone

0

4

Lisa Dunlap

@lisabdunlap

25 days

Iconic concert art (very cool work as well)

Tsung-Han (Patrick) Wu

@tsunghan_wu

25 days

Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY

1

0

3

Lisa Dunlap

@lisabdunlap

26 days

@arena in my ~20 samples only got 1 Meta one

1

0

Lisa Dunlap

@lisabdunlap

26 days

@arena imagen is slightly more diverse

1

0

1

Lisa Dunlap

@lisabdunlap

26 days

Looks like self bias extends to image generation: asked a ton of models on @arena to generate the name of the company that is leading the LLM market: OAI models particularly have quite a lot of brand loyalty

1

0

6

Chris Donahue

@chrisdonahuey

26 days

🎵Music Arena ⚔️ was accepted to the NeurIPS 2025 Creativity Track, and we've released a big update to celebrate! Includes new models from @SonautoAI and @elevenlabsio. Also, Music Arena is now available as a 🤗 @huggingface space and dataset!

1

8

42

Lisa Dunlap

@lisabdunlap

26 days

how tf did this get 200k likes

Elon Musk

@elonmusk

26 days

𝕏 works

1

0

3

Lisa Dunlap

@lisabdunlap

26 days

We need more in the wild evals, plus the images here are very fun to look through ;)

Jiaxin Ge

@aomaru_21490

26 days

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ

1

0

7