lisabdunlap Profile Banner
Lisa Dunlap Profile
Lisa Dunlap

@lisabdunlap

Followers
1K
Following
1K
Media
79
Statuses
508

messin around with model evals @berkeley_ai | prev @lmarena_ai

Joined October 2021
Don't wanna be here? Send us removal request.
@lisabdunlap
Lisa Dunlap
2 days
Amazing work!!!
@aryanvichare10
Aryan Vichare
3 days
Excited for you all to try Code Arena – our new evaluation platform to test models' agentic coding capabilities in building real-world apps and websites. Let me know if you have any feedback or what you think :)
1
0
4
@lisabdunlap
Lisa Dunlap
4 days
trust me, you could not ask for a better advisor :D
@Ritwik_G
Ritwik Gupta 🇺🇦
4 days
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
0
1
21
@lisabdunlap
Lisa Dunlap
8 days
Honored to be part of the Slingshot inaugural batch! It's been amazing working with the team :)
@LaudeInstitute
Laude Institute
9 days
Meet Slingshots // One. This inaugural batch includes leading-edge researchers advancing the science and practice of AI - with benchmarks, frameworks, and agents that ship real impact into the world. We're honored to support research from: @alexgshaw @Mike_A_Merrill
0
0
19
@lisabdunlap
Lisa Dunlap
13 days
So is the formula to just name the most famous institutions and call it an X paper? Neither the first or last author are from Anthropic or Stanford. I get that reputation matters for publicity but it does seem a little disrespectful
@rohanpaul_ai
Rohan Paul
13 days
New Stanford+Anthropic paper shows long step-by-step prompts can break model safety and trigger harmful answers. 😟 Long reasoning can quietly neutralize safety checks that people assume are working. The trick adds a benign puzzle and long reasoning before the harmful ask, plus
18
25
436
@melissapan
Melissa Pan
15 days
The Sky’s Fun Committee, representing the ppl of sky, just dropped the new lab theme: ⚫️💖 Black Pink x Halloween 🎃🦇 We have: - Gru & the minions - kpop ??? 🫰😉
8
8
52
@charlespacker
Charles Packer
16 days
Today we're releasing Context-Bench, a benchmark (and live leaderboard!) measuring LLMs on Agentic Context Engineering. C-Bench measures an agent's ability to manipulate its own context window, a necessary skill for AI agents that can self-improve and continually learn.
19
34
328
@chrischou03
Christopher Chou
17 days
Check out our 32B base model and our entire training recipe (including failures)!
@percyliang
Percy Liang
17 days
⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:
0
1
5
@ElanaPearl
Elana Simon
23 days
New blog post: The bug that taught me more about PyTorch than years of using it started with a simple training loss plateau... ended up digging through optimizer states, memory layouts, kernel dispatch, and finally understanding how PyTorch works!
44
178
2K
@TonyWangIV
Tony Wang
23 days
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
13
58
576
@lisabdunlap
Lisa Dunlap
24 days
0
0
0
@lisabdunlap
Lisa Dunlap
24 days
Come see my poster this afternoon @ICCVConference ! 🖊️ #252, 2:45pm Ehibit Hall 1 🔗 https://t.co/GPBVjfERzl We build a framework for automatically uncovering semantic concepts that diffusion models represent differently.
1
0
3
@lisabdunlap
Lisa Dunlap
25 days
The new version of asking your friends crush out for them is asking your frontend coding agent to construct a prompt to give to your backend coding agent. ..I'm assuming there is an easier cross repo agent system but i kinda like to nostalgia of telephone
0
0
4
@lisabdunlap
Lisa Dunlap
25 days
Iconic concert art (very cool work as well)
@tsunghan_wu
Tsung-Han (Patrick) Wu
25 days
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY
1
0
3
@lisabdunlap
Lisa Dunlap
26 days
@arena in my ~20 samples only got 1 Meta one
1
0
0
@lisabdunlap
Lisa Dunlap
26 days
@arena imagen is slightly more diverse
1
0
1
@lisabdunlap
Lisa Dunlap
26 days
Looks like self bias extends to image generation: asked a ton of models on @arena to generate the name of the company that is leading the LLM market: OAI models particularly have quite a lot of brand loyalty
1
0
6
@chrisdonahuey
Chris Donahue
26 days
🎵Music Arena ⚔️ was accepted to the NeurIPS 2025 Creativity Track, and we've released a big update to celebrate! Includes new models from @SonautoAI and @elevenlabsio. Also, Music Arena is now available as a 🤗 @huggingface space and dataset!
1
8
42
@lisabdunlap
Lisa Dunlap
26 days
how tf did this get 200k likes
@elonmusk
Elon Musk
26 days
𝕏 works
1
0
3
@lisabdunlap
Lisa Dunlap
26 days
We need more in the wild evals, plus the images here are very fun to look through ;)
@aomaru_21490
Jiaxin Ge
26 days
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
1
0
7