Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models
- 👤 Speaker: Pietro Lesci (University of Cambridge)
- 📅 Date & Time: Friday 30 May 2025, 12:00 - 13:00
- 📍 Venue: Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09
Abstract
While model design gets much of the spotlight, subtle data choices, such as which documents are seen and how they’re represented, can profoundly shape the behaviour of language models. Nowadays, training data is the secret sauce behind a language model’s success, yet it remains relatively understudied. In this talk, I will discuss how training data influences a model’s behaviour via two key phenomena: memorisation and tokenisation bias. First, I’ll present our work on memorisation, asking: To what extent does a model remember specific documents it was trained on? Directly answering this question is computationally expensive. Instead, we frame memorisation as a causal question and introduce an efficient method to estimate it without re-training. This reveals how memorisation depends on factors such as data order and model size. Next, I’ll discuss how subword tokenisation, often seen as a preprocessing detail, systematically biases model predictions. We ask: How would a model’s output change if a piece of text were tokenised as one subword instead of two? Using tools from econometrics, we estimate this counterfactual question without re-training the model using a different vocabulary. We show that when a piece of text is tokenised into fewer subwords, it consistently receives a higher probability. Together, these results show that training data profoundly shapes a model’s behaviour. Causal methods let us efficiently estimate and understand these phenomena, offering insight into how to better train language models.
Bio: Pietro Lesci is a final-year PhD student in Computer Science at the University of Cambridge, working with Prof Andreas Vlachos. His research explores how training data shape a model’s behaviour, focusing on memorisation, tokenisation, and generalisation. To study this question, he draws on causal methods from econometrics. His work has been presented at major machine learning conferences such as ICLR , ACL, NAACL , and EMNLP . He has received the Best Paper Award at ACL 2024 , the Paper of the Year Award from Cambridge’s Department of Computer Science and Technology, and funding from Translated’s Imminent Research Grant. Pietro’s experience spans academia and industry, including 3+ years working in research labs, consulting firms, and international institutions. He holds an MSc in Economic and Social Sciences from Bocconi University.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09
- rp587
- School of Technology
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Friday 30 May 2025, 12:00-13:00