Rethinking the role of tokenization in the NLP pipeline
- đ¤ Speaker: Kris Cao (DeepMind) đ Website
- đ Date & Time: Friday 02 December 2022, 12:00 - 13:00
- đ Venue: Computer Lab, FW26
Abstract
Abstract:
Tokenization is an integral part of the modern NLP pipeline, yet it is often treated as a black box without regard for the design choices that must be made when choosing a tokenizer. I will give an overview of how the two currently dominant tokenization algorithms work, and discuss their limitations from both a computational and a typological perspective. I will then talk about my recent EMNLP paper, which suggests using multiple tokenizations from the tokenizer to overcome the limitations of taking a single tokenization. Finally, I will discuss some ongoing work which uses character-based tokenization for masked language modelling, and examines which modelling architectures work well in this setting.
Bio:
Kris is a senior research scientist in the Language team at DeepMind. His research interests are at the intersection of linguistics, NLP and machine learning, and he is primarily focused on problems of unsupervised structure induction from language. He received his PhD from the University of Cambridge, where he worked on deep generative models for text generation.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computer Lab, FW26
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Kris Cao (DeepMind) 
Friday 02 December 2022, 12:00-13:00