Pietro Lesci
@pietro_lesci
Followers
718
Following
8K
Media
23
Statuses
763
Final-year PhD student @cambridge_uni. Causality & language models | ex @bainandcompany @ecb @amazonscience. Passionate musician, professional debugger.
Cambridge
Joined July 2018
Super excited and grateful that our paper received the best paper award at #ACL2024 🎉 Huge thanks to my fantastic co-authors — @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms — the reviewers that recommended our paper, and the award committee #ACL2024NLP
Happy to share our #ACL2024 paper: "Causal Estimation of Memorisation Profiles" 🎉 Drawing from econometrics, we propose a principled and efficient method to estimate memorisation using only observational data! See 🧵 +@clara__meister, Thomas Hofmann, @vlachos_nlp, @tpimentelms
7
7
76
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
283
1K
11K
Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵
2
39
119
🚨New Preprint! In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵
5
30
283
Had a really great and fun time with @yanaiela, @niloofar_mire, and @rzshokri discussing memorisation at the @l2m2_workshop panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025
I had a lot of fun contemplating about memorization questions at the @l2m2_workshop panel yesterday together with @niloofar_mire and @rzshokri, moderated by @pietro_lesci who did a fantastic job! #ACL2025
0
7
40
Starting in one hour at 11:00! See you in Room 1.32
And I'm giving a talk at the @l2m2_workshop on Distributional Memorization, next Friday! Curious what's that all about? Make sure to attend the workshop!
0
2
17
Super excited for our new #ACL2025 workshop tomorrow on LLM Memorization, featuring talks by the fantastic @rzshokri @yanaiela and @niloofar_mire, and with a dream team of co-organizers @johntzwei @vernadankers @pietro_lesci @tpimentelms @pratyushmaini @YangsiboHuang !
L2M2 will be tomorrow at VIC, room 1.31-32! We hope you will join us for a day of invited talks, orals, and posters on LLM memorization. The full schedule and accepted papers are now on our website:
0
12
47
Well deserved!!!
Honoured to receive two (!!) Senior Area Chair awards at #ACL2025 😁 (Conveniently placed on the same slide!) With the amazing Philip Whittington, @GregorBachmann1 and @weGotlieb, @CuiDing_CL, Giovanni Acampa, @a_stadt, @tamaregev
0
0
6
L2M2 is happening this Friday in Vienna at @aclmeeting #ACL2025NLP! We look forward to the gathering of memorization researchers in the NLP community. Invited talks include: @yanaiela @niloofar_mire @rzshokri and see our website for the full program.
sites.google.com
Program
0
13
27
Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!
0
0
0
Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias. Let’s talk about it and why it matters👇 @aclmeeting #ACL2025 #NLProc
1
4
19
@tokshop2025 @icmlconf @tweetByZeb @pietro_lesci @julius_gulius @cambridgenlp I will also be sharing more Tokenisation work from @cambridgenlp at TokShop– this time on Tokenisation Bias by @pietro_lesci and @vlachos_nlp, @clara__meister, Thomas Hofmann and @tpimentelms.
0
2
5
I'm in Vancouver for TokShop @tokshop2025 at ICML @icmlconf to present joint work with my labmates, @tweetByZeb, @pietro_lesci and @julius_gulius, and Paula Buttery. Our work, ByteSpan, is an information-driven subword tokenisation method inspired by human word segmentation.
1
6
20
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
8
33
241
Looking forward to this year's edition! With great speakers: Ryan McDonald @yulanhe Vlad Niculae @anas_ant @raquel_dmg @annargrs @preslav_nakov @mohitban47 @eunsolc @MarieMarneffe !
📢 10 Days Left to apply for the AthNLP - Athens Natural Language Processing Summer School! ✍ Get your applications in before June 15th! https://t.co/2D4hlmXvsn
1
12
24
Paper 📄: https://t.co/CzVwyzyq6j Code 💻: https://t.co/U4cD4z2h25 Joint work with amazing collaborators: @clara__meister, Thomas Hofmann, @vlachos_nlp , and @tpimentelms!
github.com
Contribute to pietrolesci/tokenisation-bias development by creating an account on GitHub.
0
0
2
Also, we find that: – Tokenisation bias appears early in training – Doesn’t go away as models improve or with scale We hope this approach can support: – More principled vocabulary design – Better understanding of generalisation trade-offs – Fairer and more stable LMs
1
0
1
As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead
1
0
1
The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!
1
0
1
So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊
1
0
1
While intuitive, this question is tricky. We can’t just compare 1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency 2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
1
0
1