BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Rethinking the role of tokenization in the NLP pipeline - Kris Cao
  (DeepMind)
DTSTART:20221202T120000Z
DTEND:20221202T130000Z
UID:TALK192341@talks.cam.ac.uk
CONTACT:Michael Schlichtkrull
DESCRIPTION:Abstract:\n\nTokenization is an integral part of the modern NL
 P pipeline\, yet it is often treated as a black box without regard for the
  design choices that must be made when choosing a tokenizer. I will give a
 n overview of how the two currently dominant tokenization algorithms work\
 , and discuss their limitations from both a computational and a typologica
 l perspective. I will then talk about my recent EMNLP paper\, which sugges
 ts using multiple tokenizations from the tokenizer to overcome the limitat
 ions of taking a single tokenization. Finally\, I will discuss some ongoin
 g work which uses character-based tokenization for masked language modelli
 ng\, and examines which modelling architectures work well in this setting.
 \n\nBio:\n\nKris is a senior research scientist in the Language team at De
 epMind. His research interests are at the intersection of linguistics\, NL
 P and machine learning\, and he is primarily focused on problems of unsupe
 rvised structure induction from language. He received his PhD from the Uni
 versity of Cambridge\, where he worked on deep generative models for text 
 generation.
LOCATION:Computer Lab\, FW26
END:VEVENT
END:VCALENDAR
