BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:“End-to-end multi-speaker neural TTS with LLM-based prosody pred
 iction” - Penny Karanasou\, Amazon R&amp\;D
DTSTART:20240129T120000Z
DTEND:20240129T130000Z
UID:TALK210628@talks.cam.ac.uk
CONTACT:Simon Webster McKnight
DESCRIPTION:In recent years\, Neural Text-to-Speech (NTTS) has revolutioni
 sed the TTS field and resulted in more natural\, more expressive speech. I
 n Amazon with products like Alexa and AWS Polly services\, we are bringing
  generated speech of tens of voices and languages to millions of people. I
 n Amazon TTS Research we are tackling a variety of research problems\, fro
 m generative TTS\, prosody transfer\, neural front-end to machine dubbing 
 and on-device TTS. In this presentation I will focus on part of the resear
 ch of my team as published in Interspeech and SSW 2023. First\, I will giv
 e you a summary of the Amazon TTS papers presented in these two conference
 s in 2023. I will then present our work on eCat\, a novel end-to-end multi
 -speaker model capable of: a) generating long-context speech with expressi
 ve and contextually appropriate prosody\, and b) performing fine-grained p
 rosody transfer between any pair of seen speakers. eCat improves TTS perfo
 rmance over our previous internal baselines\, and when compared to VITS\, 
 a state-of-the-art TTS model\, it is statistically significantly preferred
 . I will continue will a comparative study of fifteen pretrained language 
 models for two TTS tasks: prosody prediction and pause prediction. Our fin
 dings revealed a logarithmic relationship between model size and quality\,
  as well as significant performance differences between neutral and expres
 sive prosody.
LOCATION:Hybrid: JDB Teaching Room\, Engineering Department or Zoom: https
 ://cam-ac-uk.zoom.us/j/87012963681?pwd=bXRwNis2SW93aHhxUndScnp2MUVTQT09
END:VEVENT
END:VCALENDAR