Lisa Dunlap
@lisabdunlap
Followers
1K
Following
1K
Media
79
Statuses
508
messin around with model evals @berkeley_ai | prev @lmarena_ai
Joined October 2021
trust me, you could not ask for a better advisor :D
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
0
1
21
Honored to be part of the Slingshot inaugural batch! It's been amazing working with the team :)
Meet Slingshots // One. This inaugural batch includes leading-edge researchers advancing the science and practice of AI - with benchmarks, frameworks, and agents that ship real impact into the world. We're honored to support research from: @alexgshaw @Mike_A_Merrill
0
0
19
So is the formula to just name the most famous institutions and call it an X paper? Neither the first or last author are from Anthropic or Stanford. I get that reputation matters for publicity but it does seem a little disrespectful
New Stanford+Anthropic paper shows long step-by-step prompts can break model safety and trigger harmful answers. 😟 Long reasoning can quietly neutralize safety checks that people assume are working. The trick adds a benign puzzle and long reasoning before the harmful ask, plus
18
25
436
The Sky’s Fun Committee, representing the ppl of sky, just dropped the new lab theme: ⚫️💖 Black Pink x Halloween 🎃🦇 We have: - Gru & the minions - kpop ??? 🫰😉
8
8
52
Today we're releasing Context-Bench, a benchmark (and live leaderboard!) measuring LLMs on Agentic Context Engineering. C-Bench measures an agent's ability to manipulate its own context window, a necessary skill for AI agents that can self-improve and continually learn.
19
34
328
Check out our 32B base model and our entire training recipe (including failures)!
⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:
0
1
5
New blog post: The bug that taught me more about PyTorch than years of using it started with a simple training loss plateau... ended up digging through optimizer states, memory layouts, kernel dispatch, and finally understanding how PyTorch works!
44
178
2K
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
13
58
576
Come see my poster this afternoon @ICCVConference ! 🖊️ #252, 2:45pm Ehibit Hall 1 🔗 https://t.co/GPBVjfERzl We build a framework for automatically uncovering semantic concepts that diffusion models represent differently.
1
0
3
The new version of asking your friends crush out for them is asking your frontend coding agent to construct a prompt to give to your backend coding agent. ..I'm assuming there is an easier cross repo agent system but i kinda like to nostalgia of telephone
0
0
4
Iconic concert art (very cool work as well)
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY
1
0
3
Looks like self bias extends to image generation: asked a ton of models on @arena to generate the name of the company that is leading the LLM market: OAI models particularly have quite a lot of brand loyalty
1
0
6
🎵Music Arena ⚔️ was accepted to the NeurIPS 2025 Creativity Track, and we've released a big update to celebrate! Includes new models from @SonautoAI and @elevenlabsio. Also, Music Arena is now available as a 🤗 @huggingface space and dataset!
1
8
42
We need more in the wild evals, plus the images here are very fun to look through ;)
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
1
0
7