pietro_lesci Profile Banner
Pietro Lesci Profile
Pietro Lesci

@pietro_lesci

Followers
718
Following
8K
Media
23
Statuses
763

Final-year PhD student @cambridge_uni. Causality & language models | ex @bainandcompany @ecb @amazonscience. Passionate musician, professional debugger.

Cambridge
Joined July 2018
Don't wanna be here? Send us removal request.
@pietro_lesci
Pietro Lesci
1 year
Super excited and grateful that our paper received the best paper award at #ACL2024 🎉 Huge thanks to my fantastic co-authors — @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms — the reviewers that recommended our paper, and the award committee #ACL2024NLP
@pietro_lesci
Pietro Lesci
1 year
Happy to share our #ACL2024 paper: "Causal Estimation of Memorisation Profiles" 🎉 Drawing from econometrics, we propose a principled and efficient method to estimate memorisation using only observational data! See 🧵 +@clara__meister, Thomas Hofmann, @vlachos_nlp, @tpimentelms
7
7
76
@GladiaLab
GLADIA Research Lab
9 days
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
283
1K
11K
@johntzwei
Johnny Tian-Zheng Wei
12 days
Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵
2
39
119
@clara__meister
Clara Isabel Meister
3 months
🚨New Preprint! In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵
5
30
283
@pietro_lesci
Pietro Lesci
3 months
Had a really great and fun time with @yanaiela, @niloofar_mire, and @rzshokri discussing memorisation at the @l2m2_workshop panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025
@yanaiela
Yanai Elazar
3 months
I had a lot of fun contemplating about memorization questions at the @l2m2_workshop panel yesterday together with @niloofar_mire and @rzshokri, moderated by @pietro_lesci who did a fantastic job! #ACL2025
0
7
40
@yanaiela
Yanai Elazar
3 months
Starting in one hour at 11:00! See you in Room 1.32
@yanaiela
Yanai Elazar
3 months
And I'm giving a talk at the @l2m2_workshop on Distributional Memorization, next Friday! Curious what's that all about? Make sure to attend the workshop!
0
2
17
@robinomial
Robin Jia
3 months
Super excited for our new #ACL2025 workshop tomorrow on LLM Memorization, featuring talks by the fantastic @rzshokri @yanaiela and @niloofar_mire, and with a dream team of co-organizers @johntzwei @vernadankers @pietro_lesci @tpimentelms @pratyushmaini @YangsiboHuang !
@l2m2_workshop
Workshop on Large Language Model Memorization
3 months
L2M2 will be tomorrow at VIC, room 1.31-32! We hope you will join us for a day of invited talks, orals, and posters on LLM memorization. The full schedule and accepted papers are now on our website:
0
12
47
@pietro_lesci
Pietro Lesci
3 months
Well deserved!!!
@tpimentelms
Tiago Pimentel
3 months
Honoured to receive two (!!) Senior Area Chair awards at #ACL2025 😁 (Conveniently placed on the same slide!) With the amazing Philip Whittington, @GregorBachmann1 and @weGotlieb, @CuiDing_CL, Giovanni Acampa, @a_stadt, @tamaregev
0
0
6
@l2m2_workshop
Workshop on Large Language Model Memorization
3 months
L2M2 is happening this Friday in Vienna at @aclmeeting #ACL2025NLP! We look forward to the gathering of memorization researchers in the NLP community. Invited talks include: @yanaiela @niloofar_mire @rzshokri and see our website for the full program.
sites.google.com
Program
0
13
27
@pietro_lesci
Pietro Lesci
3 months
Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!
0
0
0
@pietro_lesci
Pietro Lesci
3 months
Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!
@pietro_lesci
Pietro Lesci
5 months
All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias. Let’s talk about it and why it matters👇 @aclmeeting #ACL2025 #NLProc
1
4
19
@suchirsalhan
Suchir Salhan
4 months
@tokshop2025 @icmlconf @tweetByZeb @pietro_lesci @julius_gulius @cambridgenlp I will also be sharing more Tokenisation work from @cambridgenlp at TokShop– this time on Tokenisation Bias by @pietro_lesci and @vlachos_nlp, @clara__meister, Thomas Hofmann and @tpimentelms.
0
2
5
@suchirsalhan
Suchir Salhan
4 months
I'm in Vancouver for TokShop @tokshop2025 at ICML @icmlconf to present joint work with my labmates, @tweetByZeb, @pietro_lesci and @julius_gulius, and Paula Buttery. Our work, ByteSpan, is an information-driven subword tokenisation method inspired by human word segmentation.
1
6
20
@tpimentelms
Tiago Pimentel
4 months
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
8
33
241
@vlachos_nlp
Andreas Vlachos
5 months
Looking forward to this year's edition! With great speakers: Ryan McDonald @yulanhe Vlad Niculae @anas_ant @raquel_dmg @annargrs @preslav_nakov @mohitban47 @eunsolc @MarieMarneffe !
@AthensNLP
Athens NLP Summer School
5 months
📢 10 Days Left to apply for the AthNLP - Athens Natural Language Processing Summer School! ✍ Get your applications in before June 15th! https://t.co/2D4hlmXvsn
1
12
24
@pietro_lesci
Pietro Lesci
5 months
Also, we find that: – Tokenisation bias appears early in training – Doesn’t go away as models improve or with scale We hope this approach can support: – More principled vocabulary design – Better understanding of generalisation trade-offs – Fairer and more stable LMs
1
0
1
@pietro_lesci
Pietro Lesci
5 months
As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead
1
0
1
@pietro_lesci
Pietro Lesci
5 months
The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!
1
0
1
@pietro_lesci
Pietro Lesci
5 months
So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊
1
0
1
@pietro_lesci
Pietro Lesci
5 months
While intuitive, this question is tricky. We can’t just compare 1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency 2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training
1
0
1