BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Unveiling the Secret Sauce: A Causal Look at Data Memorisation and
  Tokenisation in Language Models - Pietro Lesci (University of Cambridge)
DTSTART:20250530T110000Z
DTEND:20250530T120000Z
UID:TALK225856@talks.cam.ac.uk
CONTACT:Suchir Salhan
DESCRIPTION:While model design gets much of the spotlight\, subtle data ch
 oices\, such as which documents are seen and how they’re represented\, c
 an profoundly shape the behaviour of language models. Nowadays\, training 
 data is the secret sauce behind a language model’s success\, yet it rema
 ins relatively understudied. In this talk\, I will discuss how training da
 ta influences a model’s behaviour via two key phenomena: **memorisation*
 * and **tokenisation bias**.\nFirst\, I’ll present our work on **memoris
 ation**\, asking: *To what extent does a model remember specific documents
  it was trained on?* Directly answering this question is computationally e
 xpensive. Instead\, we frame memorisation as a causal question and introdu
 ce an efficient method to estimate it without re-training. This reveals ho
 w memorisation depends on factors such as data order and model size.\nNext
 \, I’ll discuss how **subword tokenisation**\, often seen as a preproces
 sing detail\, systematically biases model predictions. We ask: *How would 
 a model’s output change if a piece of text were tokenised as one subword
  instead of two?* Using tools from econometrics\, we estimate this counter
 factual question without re-training the model using a different vocabular
 y. We show that when a piece of text is tokenised into fewer subwords\, it
  consistently receives a higher probability.\nTogether\, these results sho
 w that training data profoundly shapes a model’s behaviour. Causal metho
 ds let us efficiently estimate and understand these phenomena\, offering i
 nsight into how to better train language models.\n\nBio: Pietro Lesci is a
  final-year PhD student in Computer Science at the University of Cambridge
 \, working with Prof Andreas Vlachos. His research explores how training d
 ata shape a model’s behaviour\, focusing on memorisation\, tokenisation\
 , and generalisation. To study this question\, he draws on causal methods 
 from econometrics. His work has been presented at major machine learning c
 onferences such as ICLR\, ACL\, NAACL\, and EMNLP. He has received the Bes
 t Paper Award at ACL 2024\, the Paper of the Year Award from Cambridge’s
  Department of Computer Science and Technology\, and funding from Translat
 ed’s Imminent Research Grant. Pietro’s experience spans academia and i
 ndustry\, including 3+ years working in research labs\, consulting firms\,
  and international institutions. He holds an MSc in Economic and Social Sc
 iences from Bocconi University.
LOCATION:Room FW26 with Hybrid Format. Here is the Zoom link for those tha
 t wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0
 eG1wZldVWG1GVVhrTzFIZz09
END:VEVENT
END:VCALENDAR