BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:ICASSP presentations - Various
DTSTART:20140425T110000Z
DTEND:20140425T120000Z
UID:TALK51337@talks.cam.ac.uk
CONTACT:Rogier van Dalen
DESCRIPTION:The ICASSP conference will be on 4–9 May. These people from 
 the speech group will present their ICASSP papers:\n\n*Posters*\n\n* Pierr
 e Lanchantin et al.\, _Multiple-Average-Voice-based Speech Synthesis_\n\n*
  Jingzhou Yang et al.\, _Infinite Structured Support Vector Machines for S
 peech Recognition_\n\n* Chao Zhang et al.\, _Standalone Training of Contex
 t-Dependent Deep Neural Network Acoustic Models_\n\n* Xie Chen et al.\, _I
 mpact Of Single-Microphone Dereverberation On DNN-Based Meeting Transcript
 ion Systems_\n\n* Pirros Tsiakoulis et al.\, _Dialogue Context Sensitive H
 MM-Based Speech Synthesis_\n\n* Anton Ragni et al.\, _Investigation Of Uns
 upervised Adaptation of DNN Acoustic Models With Filter Bank Input _\n\n*A
 bstracts*\n\n*Pierre Lanchantin*\, Mark Gales\, Simon King\, Junichi Yamag
 ishi\n\n_Multiple-Average-Voice-based Speech Synthesis_\n\nThis paper desc
 ribes a novel approach for the speaker adaptation of statistical parametri
 c speech synthesis systems based on the interpolation of a set of average 
 voice models (AVM). Recent results have shown that the quality/naturalness
  of adapted voices depends on the distance from the average voice model us
 ed for speaker adaptation. This suggests the use of several AVMs trained o
 n carefully chosen speaker clusters from which a more suitable AVM can be 
 selected/interpolated during the adaptation. In the proposed approach a se
 t of AVMs\, a multiple-AVM\, is trained on distinct clusters of speakers w
 hich are iteratively re-assigned during the estimation process initialised
  according to metadata. During adaptation\, each AVM from the multiple-AVM
  is first adapted towards the target speaker. The adapted means from the A
 VMs are then interpolated to yield the final speaker adapted mean for synt
 hesis. It is shown\, performing speaker adaptation on a corpus of British 
 speakers with various regional accents\, that the quality/naturalness of s
 ynthetic speech of adapted voices is significantly higher than when consid
 ering a single factor-independent AVM selected according to the target spe
 aker characteristics.\n\n*Jingzhou Yang*\, Rogier van Dalen\, Shi-Xiong Zh
 ang and Mark Gales\n\n_Infinite Structured Support Vector Machines for Spe
 ech Recognition_\n\nDiscriminative models\, like support vector machines (
 SVMs)\, have been successfully applied to speech recognition and improved 
 performance. A Bayesian non-parametric version of the SVM\, the infinite S
 VM\, improves on the SVM by allowing more flexible decision boundaries. Ho
 wever\, like SVMs\, infinite SVMs model each class separately\, which rest
 ricts them to classifying one word at a time. A generalisation of the SVM 
 is the structured SVM\, whose classes can be sequences of words that share
  parameters. This paper studies a combination of Bayesian non-parametrics 
 and structured models. One specific instance called infinite structured SV
 M is discussed in detail\, which brings the advantages of the infinite SVM
  to continuous speech recognition.\n\n\n*Chao Zhang*\, Phil Woodland\n\n_S
 tandalone Training of Context-Dependent Deep Neural Network Acoustic Model
 s_\n\nRecently\, context-dependent (CD) deep neural network (DNN) hidden M
 arkov models (HMMs) have been widely used as acoustic models for speech re
 cognition. However\, the standard method to build such models requires tar
 get training labels from a system using HMMs with Gaussian mixture model o
 utput distributions (GMM-HMMs). In this paper\, we introduce a method for 
 training state-of-the-art CD-DNN-HMMs without relying on such a pre-existi
 ng system. We achieve this in two steps: build a context-independent (CI) 
 DNN iteratively with word transcriptions\, and then cluster the equivalent
  output distributions of the untied CD-DNN HMM states using the decision t
 ree based state tying approach. Experiments have been performed on the Wal
 l Street Journal corpus and the resulting system gave comparable word erro
 r rates (WER) to CD-DNNs built based on GMM-HMM alignments and state-clust
 ering.\n\nTakuya Yoshioka\, *Xie Chen*\, and Mark Gales\n\n_Impact Of Sing
 le-Microphone Dereverberation On Dnn-Based Meeting Transcription Systems_\
 n\nOver the past few decades\, a range of front-end techniques have been p
 roposed to improve the robustness of automatic speech recognition systems 
 against environmental distortion. While these techniques are effective for
  small tasks consisting of carefully designed data sets\, especially when 
 used with a classical acoustic model\, there has been limited evidence tha
 t they are useful for a state-of-theart system with large scale realistic 
 data. This paper focuses on reverberation as a type of distortion and inve
 stigates the degree to which dereverberation processing can improve the pe
 rformance of various forms of acoustic models based on deep neural network
 s (DNNs) in a challenging meeting transcription task using a single distan
 t microphone. Experimental results show that dereverberation improves the 
 recognition performance regardless of the acoustic model structure and the
  type of the feature vectors input into the neural networks\, providing ad
 ditional relative improvements of 4.7% and 4.1% to our best configured spe
 aker-independent and speaker adaptive DNN-based systems\, respectively.\n\
 n*Pirros Tsiakoulis*\, Catherine Breslin\, Milica Gasic\, Matthew Henderso
 n\, Dongho Kim\, Martin Szummer\, Blaise Thomson\, Steve Young\n\n_Dialogu
 e Context Sensitive HMM-Based Speech Synthesis_\n\nThe focus of this work 
 is speech synthesis tailored to the needs of spoken dialogue systems.\nMor
 e specifically\, the framework of HMM-based speech synthesis is utilized t
 o train an emphatic voice that also considers dialogue context for decisio
 n tree state clustering.\nTo achieve this\, we designed and recorded a spe
 ech corpus comprising system prompts from human-computer interaction\, as 
 well as additional prompts for slot-level emphasis.\nThis corpus\, combine
 d with a general purpose text-to-speech one\, was used to train voices usi
 ng a) baseline context features\, b) additional emphasis features\, and c)
  additional dialogue context features.\nBoth emphasis and dialogue context
  features are extracted from the dialogue act semantic representation.\nTh
 e voices were evaluated in pairs for dialogue appropriateness using a pref
 erence listening test.\nThe results show that the emphatic voice is prefer
 red to the baseline when emphasis markup is present\, while the dialogue c
 ontext-sensitive voice is preferred to the plain emphatic one when no emph
 asis markup is present and preferable to the baseline in both cases.\nThis
  demonstrates that including dialogue context features for decision tree s
 tate clustering significantly improves the quality of the synthetic voice 
 for dialogue.\n\nTakuya Yoshioka\, *Anton Ragni*\, Mark J. F. Gales \n\n_I
 nvestigation Of Unsupervised Adaptation of DNN Acoustic Models With Filter
  Bank Input_\n\nAdaptation to speaker variations is an essential component
  of speech recognition systems. One common approach to adapting deep neura
 l network (DNN) acoustic models is to perform global constrained maximum l
 ikelihood linear regression (CMLLR) at some point of the systems. Using CM
 LLR (or more generally\, generative approaches) is advantageous especially
  in unsupervised adaptation scenarios with high baseline error rates. On t
 he other hand\, as the\nDNNs are less sensitive to the increase in the inp
 ut dimensionality than GMMs\, it is becoming more popular to use rich spee
 ch representations\, such as log mel-filter bank channel outputs\, instead
  of conventional low-dimensional feature vectors\, such as MFCCs and PLP c
 oefficients. This work discusses and compares three different conﬁgurati
 ons of DNN acoustic models that allow CMLLR-based\nspeaker adaptive traini
 ng (SAT) to be performed in systems with\nfilter bank inputs. Results of u
 nsupervised adaptation experiments conducted on three different data sets 
 are presented\, demonstrating that\, by choosing an appropriate configurat
 ion\, SAT with CMLLR can improve the performance of a well-trained filter 
 bank-based\nspeaker independent DNN system by 10.6% relative in a challeng
 ing task with a baseline error rate above 40%. It is also shown that the f
 ilter bank features are advantageous than the conventional features even w
 hen they are used with SAT models. Some other insights are also presented\
 , including the effects of block diagonal transforms and system combinatio
 n. 
LOCATION:Department of Engineering - LR6
END:VEVENT
END:VCALENDAR
