Valentina Pyatkin
@valentina__py
Followers
3K
Following
7K
Media
58
Statuses
759
Postdoc at the Allen Institute for AI @allen_ai and @uwnlp
Zürich
Joined October 2016
Our new @NeurIPSConf paper: Measuring What Matters📄 We reviewed 445 LLM benchmarks from top AI conferences and found systematic weaknesses in: 1️⃣ Statistical rigour 2️⃣ Concept definition 3️⃣ Dataset construction blog + paper 👇 https://t.co/orubCJ3V8G
1
10
42
Agent benchmarks don't measure true *AI* advances We built one that's hard & trustworthy 👉AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems 👉SOTA results across 22 agent *classes* 👉AgentBaselines agents suite 🆕 https://t.co/BFjdGCAp1w 🧵👇
arxiv.org
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry;...
4
21
28
N-gram novelty is widely used as a measure of creativity and generalization. But if LLMs produce highly n-gram novel expressions that don’t make sense or sound awkward, should they still be called creative? In a new paper, we investigate how n-gram novelty relates to creativity.
1
13
44
This semester I’m teaching a seminar on data attribution. As researchers, it’s always gratifying when someone reads your paper, let alone an entire class! But we rarely get to hear about it. So this thread is a shoutout to the papers and authors we’ve read and discussed in class.
3
3
77
New HF bible just out!! Learn anything you need to train amazing LLMs (from the combined work of our science teams): data, pre-training, post-training, evals, infra, and way more! https://t.co/pRBAeA2FQn Congrats to the amazing @LoubnaBenAllal1 who led this effort! 🤩
2
4
19
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵
Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇
1
6
20
I will be giving a talk at @ETH_AI_Center next week, on RLVR for verifiable instruction following, generalization, and reasoning! 📢 Join if you are in Zurich and interested in hearing about IFBench and our latest Olmo and Tülu works at @allen_ai
3
10
103
What are good optimizers for diffusion models? 🍂 TLDR: Muon and SOAP are very good. Paper: https://t.co/TYqRpfcu5t
7
45
332
Come talk to @Ara_Krishnan and me about our recent paper on frequency effects of unlearning and how @allen_ai 's Olmo model and toolkit made this work so much easier. 🚀
Olmo isn’t just open weights—it’s an open research stack. Try it in the Ai2 Playground: https://t.co/qGd4UW8ALv AMA on Discord: Tues, Oct 28 @ 8:00 AM PT with some of the researchers behind these studies + an Ai2 Olmo teammate. Join: https://t.co/GnxLPhM3MW
1
6
19
Thank you to @Ale_Raganato for hosting me in Milano and for listening to me talk about verifiable constraints and RLVR!
0
1
33
Cool to see that Tinker has Tulu 3 SFT as an example in their cookbook :) https://t.co/D09igpMEJG
0
9
36
Claude Skills shows performance benefits from leveraging LLM skill catalogs at inference time. Our previous work (linked under thread 5/5) showed the same 6 months ago! 🌟Our new work, STAT, shows that leveraging skills during training can greatly help too‼️, e.g., Qwen can
8
42
198
🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵
4
23
67
Go work with Jonathan! I’m sure he’ll be a fantastic advisor!
🎺 Big personal news: I've joined @imperialcollege as a Visiting Professor! 🎓Excited to collaborate with brilliant colleagues and students. If you're interested in a Machine Learning PhD, please reach out 📨 More exciting news to follow soon...
0
0
15
@tvergarabrowne and I were quiet over the summer with our podcast "Behind the Research of AI"... But now we're back! And with an awesome guest! We interviewed @jxmnop during @COLM_conf and had a blast chatting, eating snacks together and reflecting on phd life/research ideas
1
12
33
and that’s a wrap of COLM and SoLaR!
💡We kicked off the SoLaR workshop at #COLM2025 with a great opinion talk by Michelle Ding & Jo Gasior Kavishe (joint work with Victor Ojewale and @SureshVenkat46) on "Testing LLMs in a sandbox isn't responsible. Focusing on community use and needs is."
0
4
69