University of Cambridge > Talks.cam > NLIP Seminar Series > Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

Log in

Google

Microsoft

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

Download to your calendar using vCal

Pietro Lesci (University of Cambridge)
Friday 30 May 2025, 12:00-13:00
Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09.

If you have a question about this talk, please contact Suchir Salhan .

While model design gets much of the spotlight, subtle data choices, such as which documents are seen and how they’re represented, can profoundly shape the behaviour of language models. Nowadays, training data is the secret sauce behind a language model’s success, yet it remains relatively understudied. In this talk, I will discuss how training data influences a model’s behaviour via two key phenomena: memorisation and tokenisation bias. First, I’ll present our work on memorisation, asking: To what extent does a model remember specific documents it was trained on? Directly answering this question is computationally expensive. Instead, we frame memorisation as a causal question and introduce an efficient method to estimate it without re-training. This reveals how memorisation depends on factors such as data order and model size. Next, I’ll discuss how subword tokenisation, often seen as a preprocessing detail, systematically biases model predictions. We ask: How would a model’s output change if a piece of text were tokenised as one subword instead of two? Using tools from econometrics, we estimate this counterfactual question without re-training the model using a different vocabulary. We show that when a piece of text is tokenised into fewer subwords, it consistently receives a higher probability. Together, these results show that training data profoundly shapes a model’s behaviour. Causal methods let us efficiently estimate and understand these phenomena, offering insight into how to better train language models.

Bio: Pietro Lesci is a final-year PhD student in Computer Science at the University of Cambridge, working with Prof Andreas Vlachos. His research explores how training data shape a model’s behaviour, focusing on memorisation, tokenisation, and generalisation. To study this question, he draws on causal methods from econometrics. His work has been presented at major machine learning conferences such as ICLR , ACL, NAACL , and EMNLP . He has received the Best Paper Award at ACL 2024 , the Paper of the Year Award from Cambridge’s Department of Computer Science and Technology, and funding from Translated’s Imminent Research Grant. Pietro’s experience spans academia and industry, including 3+ years working in research labs, consulting firms, and international institutions. He holds an MSc in Economic and Social Sciences from Bocconi University.

This talk is part of the NLIP Seminar Series series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

📅 Download to calendar (vCal)

👤 Speaker: Pietro Lesci (University of Cambridge)
📅 Date & Time: Friday 30 May 2025, 12:00 - 13:00
📍 Venue: Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

Questions? Contact Suchir Salhan

Abstract

Series This talk is part of the NLIP Seminar Series series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

This talk is included in these lists:

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

This talk is included in these lists:

Other lists

Other talks

Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

Abstract

Included in Lists