BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Balanced and Efficient tokenization across languages - Prof. Hila 
 Gonen (University of Biritish Columbia)
DTSTART:20250515T150000Z
DTEND:20250515T160000Z
UID:TALK231034@talks.cam.ac.uk
CONTACT:Shun Shao
DESCRIPTION:Abstract: In this talk I will present our work showing dispari
 ties in the way different languages are processed in today's languages mod
 els and point to the challenges of current tokenization schemes. I will th
 en propose two different ways to overcome those challenges: (1) by implici
 tly tokenizing the text during training\; and (2) by removing tokenization
  and working with a new byte-level mapping. Together those methods pave th
 e way to a more controlled and balanced preprocessing of multiple language
 s\, resulting in more efficient language modeling.  \n\nBio: Hila is an in
 coming Assistant Professor at UBC\, currently a postdoctoral researcher at
  the University of Washington. In her research\, Hila works towards two ma
 in goals: (1) developing algorithms and methods for controlling the model
 ’s behavior\; (2) making cutting-edge language technology available and 
 fair across speakers of different languages and users of different socio-d
 emographic groups.\n\nBefore joining UW\, Hila was a postdoctoral research
 er at Amazon and Meta AI. Prior to that she did her Ph.D in Computer Scien
 ce at the NLP lab at Bar Ilan University. She obtained her Ms.C. in Comput
 er Science from the Hebrew University. Hila is the recipient of several pr
 estigious postdoc awards and an EECS Rising Stars award. Her work received
  the best paper awards at CoNLL 2019 and at the repL4nlp workshop 2022.
LOCATION:https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBd
 XVpOXFvdz09
END:VEVENT
END:VCALENDAR
