Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

T. Gburrek, T. Glarner, J. Ebbers, R. Haeb-Umbach, P. Wagner, in: Proc. 10th ISCA Speech Synthesis Workshop, 2019, pp. 81–86.

Conference Paper | English
Abstract
This paper presents an approach to voice conversion, whichdoes neither require parallel data nor speaker or phone labels fortraining. It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers in the encoder and decoder blocks, which is shown toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms. Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance.
Publishing Year
Proceedings Title
Proc. 10th ISCA Speech Synthesis Workshop
Page
81-86
Conference
10th ISCA Speech Synthesis Workshop
Conference Location
Vienna
LibreCat-ID

Cite this

Gburrek T, Glarner T, Ebbers J, Haeb-Umbach R, Wagner P. Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. In: Proc. 10th ISCA Speech Synthesis Workshop. ; 2019:81-86. doi:10.21437/SSW.2019-15
Gburrek, T., Glarner, T., Ebbers, J., Haeb-Umbach, R., & Wagner, P. (2019). Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. Proc. 10th ISCA Speech Synthesis Workshop, 81–86. https://doi.org/10.21437/SSW.2019-15
@inproceedings{Gburrek_Glarner_Ebbers_Haeb-Umbach_Wagner_2019, title={Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion}, DOI={10.21437/SSW.2019-15}, booktitle={Proc. 10th ISCA Speech Synthesis Workshop}, author={Gburrek, Tobias and Glarner, Thomas and Ebbers, Janek and Haeb-Umbach, Reinhold and Wagner, Petra}, year={2019}, pages={81–86} }
Gburrek, Tobias, Thomas Glarner, Janek Ebbers, Reinhold Haeb-Umbach, and Petra Wagner. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion.” In Proc. 10th ISCA Speech Synthesis Workshop, 81–86, 2019. https://doi.org/10.21437/SSW.2019-15.
T. Gburrek, T. Glarner, J. Ebbers, R. Haeb-Umbach, and P. Wagner, “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion,” in Proc. 10th ISCA Speech Synthesis Workshop, Vienna, 2019, pp. 81–86, doi: 10.21437/SSW.2019-15.
Gburrek, Tobias, et al. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion.” Proc. 10th ISCA Speech Synthesis Workshop, 2019, pp. 81–86, doi:10.21437/SSW.2019-15.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)
Access Level
Restricted Closed Access
External material:
Supplementary Material
Description
Listening examples

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar