TY - CONF AB - This paper presents an approach to voice conversion, whichdoes neither require parallel data nor speaker or phone labels fortraining. It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers in the encoder and decoder blocks, which is shown toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms. Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance. AU - Gburrek, Tobias AU - Glarner, Thomas AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Wagner, Petra ID - 15237 T2 - Proc. 10th ISCA Speech Synthesis Workshop TI - Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion ER - TY - CONF AB - The invention of the Variational Autoencoder enables the application of Neural Networks to a wide range of tasks in unsupervised learning, including the field of Acoustic Unit Discovery (AUD). The recently proposed Hidden Markov Model Variational Autoencoder (HMMVAE) allows a joint training of a neural network based feature extractor and a structured prior for the latent space given by a Hidden Markov Model. It has been shown that the HMMVAE significantly outperforms pure GMM-HMM based systems on the AUD task. However, the HMMVAE cannot autonomously infer the number of acoustic units and thus relies on the GMM-HMM system for initialization. This paper introduces the Bayesian Hidden Markov Model Variational Autoencoder (BHMMVAE) which solves these issues by embedding the HMMVAE in a Bayesian framework with a Dirichlet Process Prior for the distribution of the acoustic units, and diagonal or full-covariance Gaussians as emission distributions. Experiments on TIMIT and Xitsonga show that the BHMMVAE is able to autonomously infer a reasonable number of acoustic units, can be initialized without supervision by a GMM-HMM system, achieves computationally efficient stochastic variational inference by using natural gradient descent, and, additionally, improves the AUD performance over the HMMVAE. AU - Glarner, Thomas AU - Hanebrink, Patrick AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 11907 T2 - INTERSPEECH 2018, Hyderabad, India TI - Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery ER - TY - CONF AB - In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin. AU - Glarner, Thomas AU - Boenninghoff, Benedikt AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11770 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Leveraging Text Data for Word Segmentation for Underresourced Languages ER - TY - CONF AB - Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based approximate Bayesian inference for observation models for which exact inference is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference in the presence of both discrete and continuous latent variables. Inspired by this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent models. We applied the resulting HMM-VAE to the task of acoustic unit discovery in a zero resource scenario. Starting from an initial model based on variational inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities, the accuracy of the acoustic unit discovery could be significantly improved by the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning task what is well-known in the supervised learning case: Neural networks provide superior modeling power compared to GMMs. AU - Ebbers, Janek AU - Heymann, Jahn AU - Drude, Lukas AU - Glarner, Thomas AU - Haeb-Umbach, Reinhold AU - Raj, Bhiksha ID - 11759 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery ER - TY - CONF AB - This paper is concerned with speech presence probability estimation employing an explicit model of the temporal and spectral correlations of speech. An undirected graphical model is introduced, based on a Factor Graph formulation. It is shown that this undirected model cures some of the theoretical issues of an earlier directed graphical model. Furthermore, we formulate a message passing inference scheme based on an approximate graph factorization, identify this inference scheme as a particular message passing schedule based on the turbo principle and suggest further alternative schedules. The experiments show an improved performance over speech presence probability estimation based on an IID assumption, and a slightly better performance of the turbo schedule over the alternatives. AU - Glarner, Thomas AU - Mahdi Momenzadeh, Mohammad AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11771 T2 - 12. ITG Fachtagung Sprachkommunikation (ITG 2016) TI - Factor Graph Decoding for Speech Presence Probability Estimation ER -