
Dr. Karen Ullrich
@karen_ullrich
Followers
5K
Following
584
Media
37
Statuses
272
Research scientist at FAIR NY + collab w/ Vector Institute. ❤️ Machine Learning + Information Theory. Previously, PhD at UoAmsterdam, intern at DeepMind + MSRC.
she / her
Joined December 2013
#Tokenization is undeniably a key player in the success story of #LLMs but we poorly understand why. I want to highlight progress we made in understanding the role of tokenization, developing the core incidents and mitigating its problems. 🧵👇.
15
95
608
Plus, we generate importance maps showing where in the transformer the concept is encoded — providing interpretable insights into model internals.
1
0
4
SAMI: Diminishes or amplifies these modules to control the concept's influence. With SAMI, we can scale the importance of these modules — either amplifying or suppressing specific concepts.
1
0
2
SAMD: Finds the attention heads most correlated with a concept. Using SAMD, we find that only a few attention heads are crucial for a wide range of concepts—confirming the sparse, modular nature of knowledge in transformers.
1
0
2
How would you make an LLM "forget" the concept of dog — or any other arbitrary concept? 🐶❓. We introduce SAMD & SAMI — a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.
3
12
78
Aligned Multi-Objective Optimization (A-🐮) has been accepted at #ICML2025! 🎉 .We explore optimization scenarios where objectives align rather than conflict, introducing new scalable algorithms with theoretical guarantees. #MachineLearning #AIResearch #Optimization #MLCommunity
3
12
88
RT @buutphan: Our work got accepted to #ICLR2025 @iclr_conf! Learn more about tokenization bias and how to convert your tokenized LLM to by….
github.com
Example implementation of "Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles" by Buu Phan, Brandon Amos, Itai Gat, Marton Havasi, Mat...
0
4
0
0
0
4
🎉Our paper just got accepted to #ICLR2025! 🎉. Byte-level LLMs without training and guaranteed performance? Curious how? Dive into our work! 📚✨ . Paper: Github:
2
14
111
RT @brandondamos: 📢 My team at Meta is hiring visiting PhD students from CMU, UW, Berkeley, and NYU! We study core ML, optimization, amorti….
0
42
0
RT @hall__melissa: Excited to release EvalGIM, an easy-to-use evaluation library for generative image models. EvalGIM ("EvalGym") unifies….
github.com
🦾 EvalGIM (pronounced as "EvalGym") is an evaluation library for generative image models. It enables easy-to-use, reproducible automatic evaluations of text-to-image models and su...
0
15
0
Thursday is busy:.9-11am I will be at the Meta AI Booth.12.30-2pm.Mission Impossible: A Statistical Perspective on Jailbreaking LLMs (.OR.End-To-End Causal Effect Estimation from Unstructured Natural Language Data (.
0
0
8
RT @KempeLab: For those into jailbreaking LLMs: our poster "Mission Impossible" today shows the fundamental limits of LLM alignment - and i….
0
4
0
Starting with Fei-Fei Li’s talk 2.30, after that I will mostly be meeting people and wonder the poster sessions.
0
0
3
Folks, I am posting my NeurIPS schedule daily in hopes to see folks, thanks @tkipf the idea ;). 11-12.30 WiML round tables.1.30-4 Beyond Decoding, Tutorial.
0
0
10
RT @NYUDataScience: Researchers at CDS and @AIatMeta prove vulnerabilities in AI language models are unavoidable, but introduce E-RLHF, a m….
nyudatascience.medium.com
CDS and Meta AI researchers have shown that language model vulnerabilities are inevitable but have developed a new method to make them…
0
9
0
What do you think do we need to sharpen our understanding of tokenization? Or will we soon be rid of it by developing models such as "MegaByte" by @liliyu_lili et al (@lukezettlemoyer)? . And add more paper to the threat!
3
2
29
In "The Foundations of Tokenization:.Statistical and Computational Concerns", Gastaldi et al. try to make first steps towards defining what a tokenizer should be and define properties it ought to have.
1
1
26