TY  - CONF
AB  - This  paper  presents  an  approach  to  voice  conversion,  whichdoes neither require parallel data nor speaker or phone labels fortraining.  It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers  in  the  encoder  and  decoder  blocks,  which  is  shown  toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms.  Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance.
AU  - Gburrek, Tobias
AU  - Glarner, Thomas
AU  - Ebbers, Janek
AU  - Haeb-Umbach, Reinhold
AU  - Wagner, Petra
ID  - 15237
T2  - Proc. 10th ISCA Speech Synthesis Workshop
TI  - Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion
ER  - 
TY  - CONF
AB  - The invention of the Variational Autoencoder enables the application of Neural Networks to a wide range of tasks in unsupervised learning, including the field of Acoustic Unit Discovery (AUD). The recently proposed Hidden Markov Model Variational Autoencoder (HMMVAE) allows a joint training of a neural network based feature extractor and a structured prior for the latent space given by a Hidden Markov Model. It has been shown that the HMMVAE significantly outperforms pure GMM-HMM based systems on the AUD task. However, the HMMVAE cannot autonomously infer the number of acoustic units and thus relies on the GMM-HMM system for initialization. This paper introduces the Bayesian Hidden Markov Model Variational Autoencoder (BHMMVAE) which solves these issues by embedding the HMMVAE in a Bayesian framework with a Dirichlet Process Prior for the distribution of the acoustic units, and diagonal or full-covariance Gaussians as emission distributions. Experiments on TIMIT and Xitsonga show that the BHMMVAE is able to autonomously infer a reasonable number of acoustic units, can be initialized without supervision by a GMM-HMM system, achieves computationally efficient stochastic variational inference by using natural gradient descent, and, additionally, improves the AUD performance over the HMMVAE.
AU  - Glarner, Thomas
AU  - Hanebrink, Patrick
AU  - Ebbers, Janek
AU  - Haeb-Umbach, Reinhold
ID  - 11907
T2  - INTERSPEECH 2018, Hyderabad, India
TI  - Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery
ER  - 
TY  - CONF
AB  - In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin.
AU  - Glarner, Thomas
AU  - Boenninghoff, Benedikt
AU  - Walter, Oliver
AU  - Haeb-Umbach, Reinhold
ID  - 11770
T2  - INTERSPEECH 2017, Stockholm, Schweden
TI  - Leveraging Text Data for Word Segmentation for Underresourced Languages
ER  - 
TY  - CONF
AB  - Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based approximate Bayesian inference for observation models for which exact inference is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference in the presence of both discrete and continuous latent variables. Inspired by this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent models. We applied the resulting HMM-VAE to the task of acoustic unit discovery in a zero resource scenario. Starting from an initial model based on variational inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities, the accuracy of the acoustic unit discovery could be significantly improved by the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning task what is well-known in the supervised learning case: Neural networks provide superior modeling power compared to GMMs.
AU  - Ebbers, Janek
AU  - Heymann, Jahn
AU  - Drude, Lukas
AU  - Glarner, Thomas
AU  - Haeb-Umbach, Reinhold
AU  - Raj, Bhiksha
ID  - 11759
T2  - INTERSPEECH 2017, Stockholm, Schweden
TI  - Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery
ER  - 
TY  - CONF
AB  - This paper is concerned with speech presence probability estimation employing an explicit model of the temporal and spectral correlations of speech. An undirected graphical model is introduced, based on a Factor Graph formulation. It is shown that this undirected model cures some of the theoretical issues of an earlier directed graphical model. Furthermore, we formulate a message passing inference scheme based on an approximate graph factorization, identify this inference scheme as a particular message passing schedule based on the turbo principle and suggest further alternative schedules. The experiments show an improved performance over speech presence probability estimation based on an IID assumption, and a slightly better performance of the turbo schedule over the alternatives.
AU  - Glarner, Thomas
AU  - Mahdi Momenzadeh, Mohammad
AU  - Drude, Lukas
AU  - Haeb-Umbach, Reinhold
ID  - 11771
T2  - 12. ITG Fachtagung Sprachkommunikation (ITG 2016)
TI  - Factor Graph Decoding for Speech Presence Probability Estimation
ER  -