Pietro Lesci @pietro_lesci X Profile

Pietro Lesci

@pietro_lesci

Followers

718

Following

8K

Media

23

Statuses

763

Final-year PhD student @cambridge_uni. Causality & language models | ex @bainandcompany @ecb @amazonscience. Passionate musician, professional debugger.

https://t.co/cH1Y5QgPHf

Cambridge

Joined July 2018

Don't wanna be here? Send us removal request.

Pietro Lesci

@pietro_lesci

1 year

Super excited and grateful that our paper received the best paper award at #ACL2024 🎉 Huge thanks to my fantastic co-authors — @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms — the reviewers that recommended our paper, and the award committee #ACL2024NLP

Pietro Lesci

@pietro_lesci

1 year

Happy to share our #ACL2024 paper: "Causal Estimation of Memorisation Profiles" 🎉 Drawing from econometrics, we propose a principled and efficient method to estimate memorisation using only observational data! See 🧵 +@clara__meister, Thomas Hofmann, @vlachos_nlp, @tpimentelms

7

76

GLADIA Research Lab

@GladiaLab

9 days

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

283

1K

11K

Johnny Tian-Zheng Wei

@johntzwei

12 days

Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵

2

39

119

Clara Isabel Meister

@clara__meister

3 months

🚨New Preprint! In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵

5

30

283

Pietro Lesci

@pietro_lesci

3 months

Had a really great and fun time with @yanaiela, @niloofar_mire, and @rzshokri discussing memorisation at the @l2m2_workshop panel. Thanks to the entire organising team and attendees for making this such a fantastic workshop! #ACL2025

Yanai Elazar

@yanaiela

3 months

I had a lot of fun contemplating about memorization questions at the @l2m2_workshop panel yesterday together with @niloofar_mire and @rzshokri, moderated by @pietro_lesci who did a fantastic job! #ACL2025

0

7

40

Yanai Elazar

@yanaiela

3 months

Starting in one hour at 11:00! See you in Room 1.32

Yanai Elazar

@yanaiela

3 months

And I'm giving a talk at the @l2m2_workshop on Distributional Memorization, next Friday! Curious what's that all about? Make sure to attend the workshop!

0

2

17

Robin Jia

@robinomial

3 months

Super excited for our new #ACL2025 workshop tomorrow on LLM Memorization, featuring talks by the fantastic @rzshokri @yanaiela and @niloofar_mire, and with a dream team of co-organizers @johntzwei @vernadankers @pietro_lesci @tpimentelms @pratyushmaini @YangsiboHuang !

Workshop on Large Language Model Memorization

@l2m2_workshop

3 months

L2M2 will be tomorrow at VIC, room 1.31-32! We hope you will join us for a day of invited talks, orals, and posters on LLM memorization. The full schedule and accepted papers are now on our website:

0

12

47

Pietro Lesci

@pietro_lesci

3 months

Well deserved!!!

Tiago Pimentel

@tpimentelms

3 months

Honoured to receive two (!!) Senior Area Chair awards at #ACL2025 😁 (Conveniently placed on the same slide!) With the amazing Philip Whittington, @GregorBachmann1 and @weGotlieb, @CuiDing_CL, Giovanni Acampa, @a_stadt, @tamaregev

0

6

Workshop on Large Language Model Memorization

@l2m2_workshop

3 months

L2M2 is happening this Friday in Vienna at @aclmeeting #ACL2025NLP! We look forward to the gathering of memorization researchers in the NLP community. Invited talks include: @yanaiela @niloofar_mire @rzshokri and see our website for the full program.

sites.google.com

Program

0

13

27

Pietro Lesci

@pietro_lesci

3 months

Also, got burning questions about memorisation? Send them my way—we'll make sure to pose them to our panelists during the workshop!

0

Pietro Lesci

@pietro_lesci

3 months

Headed to Vienna for #ACL2025 to present our tokenisation bias paper and co-organise the L2M2 workshop on memorisation in language models. Reach out to chat about tokenisation, memorisation, and all things pre-training (esp. data-related topics)!

Pietro Lesci

@pietro_lesci

5 months

All modern LLMs run on top of a tokeniser, an often overlooked “preprocessing detail”. But what if that tokeniser systematically affects model behaviour? We call this tokenisation bias. Let’s talk about it and why it matters👇 @aclmeeting #ACL2025 #NLProc

1

4

19

Suchir Salhan

@suchirsalhan

4 months

@tokshop2025 @icmlconf @tweetByZeb @pietro_lesci @julius_gulius @cambridgenlp I will also be sharing more Tokenisation work from @cambridgenlp at TokShop– this time on Tokenisation Bias by @pietro_lesci and @vlachos_nlp, @clara__meister, Thomas Hofmann and @tpimentelms.

0

2

5

Suchir Salhan

@suchirsalhan

4 months

I'm in Vancouver for TokShop @tokshop2025 at ICML @icmlconf to present joint work with my labmates, @tweetByZeb, @pietro_lesci and @julius_gulius, and Paula Buttery. Our work, ByteSpan, is an information-driven subword tokenisation method inspired by human word segmentation.

1

6

20

Tiago Pimentel

@tpimentelms

4 months

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

8

33

241

Andreas Vlachos

@vlachos_nlp

5 months

Looking forward to this year's edition! With great speakers: Ryan McDonald @yulanhe Vlad Niculae @anas_ant @raquel_dmg @annargrs @preslav_nakov @mohitban47 @eunsolc @MarieMarneffe !

Athens NLP Summer School

@AthensNLP

5 months

📢 10 Days Left to apply for the AthNLP - Athens Natural Language Processing Summer School! ✍ Get your applications in before June 15th! https://t.co/2D4hlmXvsn

1

12

24

Pietro Lesci

@pietro_lesci

5 months

Paper 📄: https://t.co/CzVwyzyq6j Code 💻: https://t.co/U4cD4z2h25 Joint work with amazing collaborators: @clara__meister, Thomas Hofmann, @vlachos_nlp , and @tpimentelms!

github.com

Contribute to pietrolesci/tokenisation-bias development by creating an account on GitHub.

0

2

Pietro Lesci

@pietro_lesci

5 months

Also, we find that: – Tokenisation bias appears early in training – Doesn’t go away as models improve or with scale We hope this approach can support: – More principled vocabulary design – Better understanding of generalisation trade-offs – Fairer and more stable LMs

1

0

1

Pietro Lesci

@pietro_lesci

5 months

As our main result, we find that when a token is in a model’s vocabulary—i.e., when its characters are tokenised as a single symbol—the model may assign it up to 17x more probability than if it had been split into two tokens instead

1

0

1

Pietro Lesci

@pietro_lesci

5 months

The trick: tokenisers build vocabs incrementally up to a fixed size (e.g., 32k). This defines a "cutoff": tokens near it are similar (e.g., frequency), but those inside appear as one while those outside as two symbols. Perfect setup for regression discontinuity! Details in 📄!

1

0

1

Pietro Lesci

@pietro_lesci

5 months

So, did we train thousands of models, with and without each token in our vocabulary? No! Our method works observationally! 👀📊

1

0

1

Pietro Lesci

@pietro_lesci

5 months

While intuitive, this question is tricky. We can’t just compare 1️⃣ in- vs. out-of-vocab words (like "hello" vs "appoggiatura") as they differ systematically, e.g., in frequency 2️⃣ different tokenisations (e.g., ⟨he,llo⟩or ⟨hello⟩) as the model only sees one during training

1

0

1