BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Selection of Talks from Interspeech 2020 - Yiting 'Edie' Lu\, Vyas
  Raina\, Qingyun Dou\, Cambridge University Speech Research Group
DTSTART:20210126T120000Z
DTEND:20210126T130000Z
UID:TALK156676@talks.cam.ac.uk
CONTACT:Dr Kate Knill
DESCRIPTION:The first seminar of Lent Term will be 3 presentations of pape
 rs from Interspeech 2020:\n\n* *Spoken Language ‘Grammatical Error Corre
 ction\, Yiting 'Edie' Lu*\n* *Universal Adversarial Attacks on Spoken Lang
 uage Assessment Systems\, Vyas Raina*\n* *Attention Forcing for Speech Syn
 thesis\, Qingyun Dou*\n\n*'Spoken Language ‘Grammatical Error Correction
 ’\, Yiting 'Edie' Lu\, Mark J.F. Gales. Yu Wang*\n\nSpoken language ‘g
 rammatical error correction’ (GEC) is an important mechanism to help lea
 rners of a foreign language\, here English\, improve their spoken grammar.
  GEC is challenging for non-native spoken language due to interruptions fr
 om disfluent speech events such as repetitions and false starts and issues
  in strictly defining what is acceptable in spoken language. Furthermore t
 here is little labelled data to train models. One way to mitigate the impa
 ct of speech events is to use a disfluency detection (DD) model. Removing 
 the detected disfluencies converts the speech transcript to be closer to w
 ritten language\, which has significantly more labelled training data. Thi
 s paper considers two types of approaches to leveraging DD models to boost
  spoken GEC performance. One is sequential\, a separately trained DD model
  acts as a pre-processing module providing a more structured input to the 
 GEC model. The second approach is to train DD and GEC models in an end-to-
 end fashion\, simultaneously optimising both modules. Embeddings enable en
 d-to-end models to have a richer information flow. Experimental results sh
 ow that DD effectively regulates GEC input\; end-to-end training works wel
 l when fine-tuned on limited labelled in-domain data\; and improving DD by
  incorporating acoustic information helps improve spoken GEC.\n\n*Universa
 l Adversarial Attacks on Spoken Language Assessment Systems\, Vyas Raina\,
  Mark J.F. Gales\, Kate M. Knill*\n\nThere is an increasing demand for aut
 omated spoken language assessment (SLA) systems\, partly driven by the per
 formance improvements that have come from deep learning based approaches. 
 One aspect of deep learning systems is that they do not require expert der
 ived features\, operating directly on the original signal such as a speech
  recognition (ASR) transcript. This\, however\, increases their potential 
 susceptibility to adversarial attacks as a form of candidate malpractice. 
 In this paper the sensitivity of SLA systems to a universal black-box atta
 ck on the ASR text output is explored. The aim is to obtain a single\, uni
 versal phrase to maximally increase any candidate’s score. Four approach
 es to detect such adversarial attacks are also described. All the systems\
 , and associated detection approaches\, are evaluated on a free (spontaneo
 us) speaking section from a Business English test. It is shown that on dee
 p learning based SLA systems the average candidate score can be increased 
 by almost one grade level using a single six word phrase appended to the e
 nd of the response hypothesis. Although these large gains can be obtained\
 , they can be easily detected based on detection shifts from the scores of
  a “traditional” Gaussian Process based grader.\n\n*'Attention Forcing
  for Speech Synthesis'\, Qingyun Dou\, Joshua Efiong\, Mark J.F. Gales*\n\
 nAuto-regressive sequence-to-sequence models with attention mechanisms hav
 e achieved state-of-the-art performance in various tasks including speech 
 synthesis. Training these models can be difficult. The standard approach g
 uides a model with the reference output history during training. However d
 uring synthesis the generated output history must be used. This mismatch c
 an impact performance. Several approaches have been proposed to handle thi
 s\, normally by selectively using the generated output history. To make tr
 aining stable\, these approaches often require a heuristic schedule or an 
 auxiliary classifier. This paper introduces attention forcing\, which guid
 es the model with the generated output history and reference attention. Th
 is approach reduces the training-evaluation mismatch without the need for 
 a schedule or a classifier. Additionally\, for standard training approache
 s\, the frame rate is often reduced to prevent models from copying the out
 put history. As attention forcing does not feed the reference output histo
 ry to the model\, it allows using a higher frame rate\, which improves the
  speech quality. Finally\, attention forcing allows the model to generate 
 output sequences aligned with the references\, which is important for some
  down-stream tasks such as training neural vocoders. Experiments show that
  attention forcing allows doubling the frame rate\, and yields significant
  gain in speech quality.\n
LOCATION:Zoom: https://zoom.us/j/95352633552?pwd=RzJVK2UzOGZyNU5mVHd1Y1VPT
 2tDUT09
END:VEVENT
END:VCALENDAR
