---
_id: '15237'
abstract:
- lang: eng
  text: This  paper  presents  an  approach  to  voice  conversion,  whichdoes neither
    require parallel data nor speaker or phone labels fortraining.  It can convert
    between speakers which are not in thetraining set by employing the previously
    proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand
    speaker induced variations are separated upon the notionthat content induced variations
    change at a much shorter timescale, i.e., at the segment level, than speaker induced
    variations,which vary at the longer utterance level. In this contribution wepropose
    to employ convolutional instead of recurrent networklayers  in  the  encoder  and  decoder  blocks,  which  is  shown  toachieve
    better phone recognition accuracy on the latent segmentvariables at frame-level
    due to their better temporal resolution.For voice conversion the mean of the utterance
    variables is re-placed with the respective estimated mean of the target speaker.The
    resulting log-mel spectra of the decoder output are used aslocal conditions of
    a WaveNet which is utilized for synthesis ofthe speech waveforms.  Experiments
    show both good disentan-glement properties of the latent space variables, and
    good voiceconversion performance.
author:
- first_name: Tobias
  full_name: Gburrek, Tobias
  id: '44006'
  last_name: Gburrek
- first_name: Thomas
  full_name: Glarner, Thomas
  id: '14169'
  last_name: Glarner
- first_name: Janek
  full_name: Ebbers, Janek
  id: '34851'
  last_name: Ebbers
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
- first_name: Petra
  full_name: Wagner, Petra
  last_name: Wagner
citation:
  ama: 'Gburrek T, Glarner T, Ebbers J, Haeb-Umbach R, Wagner P. Unsupervised Learning
    of a Disentangled Speech Representation for Voice Conversion. In: <i>Proc. 10th
    ISCA Speech Synthesis Workshop</i>. ; 2019:81-86. doi:<a href="https://doi.org/10.21437/SSW.2019-15">10.21437/SSW.2019-15</a>'
  apa: Gburrek, T., Glarner, T., Ebbers, J., Haeb-Umbach, R., &#38; Wagner, P. (2019).
    Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion.
    <i>Proc. 10th ISCA Speech Synthesis Workshop</i>, 81–86. <a href="https://doi.org/10.21437/SSW.2019-15">https://doi.org/10.21437/SSW.2019-15</a>
  bibtex: '@inproceedings{Gburrek_Glarner_Ebbers_Haeb-Umbach_Wagner_2019, title={Unsupervised
    Learning of a Disentangled Speech Representation for Voice Conversion}, DOI={<a
    href="https://doi.org/10.21437/SSW.2019-15">10.21437/SSW.2019-15</a>}, booktitle={Proc.
    10th ISCA Speech Synthesis Workshop}, author={Gburrek, Tobias and Glarner, Thomas
    and Ebbers, Janek and Haeb-Umbach, Reinhold and Wagner, Petra}, year={2019}, pages={81–86}
    }'
  chicago: Gburrek, Tobias, Thomas Glarner, Janek Ebbers, Reinhold Haeb-Umbach, and
    Petra Wagner. “Unsupervised Learning of a Disentangled Speech Representation for
    Voice Conversion.” In <i>Proc. 10th ISCA Speech Synthesis Workshop</i>, 81–86,
    2019. <a href="https://doi.org/10.21437/SSW.2019-15">https://doi.org/10.21437/SSW.2019-15</a>.
  ieee: 'T. Gburrek, T. Glarner, J. Ebbers, R. Haeb-Umbach, and P. Wagner, “Unsupervised
    Learning of a Disentangled Speech Representation for Voice Conversion,” in <i>Proc.
    10th ISCA Speech Synthesis Workshop</i>, Vienna, 2019, pp. 81–86, doi: <a href="https://doi.org/10.21437/SSW.2019-15">10.21437/SSW.2019-15</a>.'
  mla: Gburrek, Tobias, et al. “Unsupervised Learning of a Disentangled Speech Representation
    for Voice Conversion.” <i>Proc. 10th ISCA Speech Synthesis Workshop</i>, 2019,
    pp. 81–86, doi:<a href="https://doi.org/10.21437/SSW.2019-15">10.21437/SSW.2019-15</a>.
  short: 'T. Gburrek, T. Glarner, J. Ebbers, R. Haeb-Umbach, P. Wagner, in: Proc.
    10th ISCA Speech Synthesis Workshop, 2019, pp. 81–86.'
conference:
  location: Vienna
  name: 10th ISCA Speech Synthesis Workshop
date_created: 2019-12-04T08:12:29Z
date_updated: 2023-11-17T06:20:39Z
department:
- _id: '54'
doi: 10.21437/SSW.2019-15
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://www.isca-speech.org/archive/pdfs/ssw_2019/gburrek19_ssw.pdf
oa: '1'
page: 81-86
publication: Proc. 10th ISCA Speech Synthesis Workshop
quality_controlled: '1'
related_material:
  link:
  - description: Listening examples
    relation: supplementary_material
    url: http://go.upb.de/vcex
status: public
title: Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion
type: conference
user_id: '44006'
year: '2019'
...
---
_id: '11907'
abstract:
- lang: eng
  text: The invention of the Variational Autoencoder enables the application of Neural
    Networks to a wide range of tasks in unsupervised learning, including the field
    of Acoustic Unit Discovery (AUD). The recently proposed Hidden Markov Model Variational
    Autoencoder (HMMVAE) allows a joint training of a neural network based feature
    extractor and a structured prior for the latent space given by a Hidden Markov
    Model. It has been shown that the HMMVAE significantly outperforms pure GMM-HMM
    based systems on the AUD task. However, the HMMVAE cannot autonomously infer the
    number of acoustic units and thus relies on the GMM-HMM system for initialization.
    This paper introduces the Bayesian Hidden Markov Model Variational Autoencoder
    (BHMMVAE) which solves these issues by embedding the HMMVAE in a Bayesian framework
    with a Dirichlet Process Prior for the distribution of the acoustic units, and
    diagonal or full-covariance Gaussians as emission distributions. Experiments on
    TIMIT and Xitsonga show that the BHMMVAE is able to autonomously infer a reasonable
    number of acoustic units, can be initialized without supervision by a GMM-HMM
    system, achieves computationally efficient stochastic variational inference by
    using natural gradient descent, and, additionally, improves the AUD performance
    over the HMMVAE.
author:
- first_name: Thomas
  full_name: Glarner, Thomas
  id: '14169'
  last_name: Glarner
- first_name: Patrick
  full_name: Hanebrink, Patrick
  last_name: Hanebrink
- first_name: Janek
  full_name: Ebbers, Janek
  id: '34851'
  last_name: Ebbers
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Glarner T, Hanebrink P, Ebbers J, Haeb-Umbach R. Full Bayesian Hidden Markov
    Model Variational Autoencoder for Acoustic Unit Discovery. In: <i>INTERSPEECH
    2018, Hyderabad, India</i>. ; 2018.'
  apa: Glarner, T., Hanebrink, P., Ebbers, J., &#38; Haeb-Umbach, R. (2018). Full
    Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery.
    <i>INTERSPEECH 2018, Hyderabad, India</i>.
  bibtex: '@inproceedings{Glarner_Hanebrink_Ebbers_Haeb-Umbach_2018, title={Full Bayesian
    Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery}, booktitle={INTERSPEECH
    2018, Hyderabad, India}, author={Glarner, Thomas and Hanebrink, Patrick and Ebbers,
    Janek and Haeb-Umbach, Reinhold}, year={2018} }'
  chicago: Glarner, Thomas, Patrick Hanebrink, Janek Ebbers, and Reinhold Haeb-Umbach.
    “Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery.”
    In <i>INTERSPEECH 2018, Hyderabad, India</i>, 2018.
  ieee: T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian Hidden
    Markov Model Variational Autoencoder for Acoustic Unit Discovery,” 2018.
  mla: Glarner, Thomas, et al. “Full Bayesian Hidden Markov Model Variational Autoencoder
    for Acoustic Unit Discovery.” <i>INTERSPEECH 2018, Hyderabad, India</i>, 2018.
  short: 'T. Glarner, P. Hanebrink, J. Ebbers, R. Haeb-Umbach, in: INTERSPEECH 2018,
    Hyderabad, India, 2018.'
date_created: 2019-07-12T05:30:34Z
date_updated: 2023-11-22T08:29:22Z
department:
- _id: '54'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://groups.uni-paderborn.de/nt/pubs/2018/INTERSPEECH_2018_Glarner_Paper.pdf
oa: '1'
publication: INTERSPEECH 2018, Hyderabad, India
quality_controlled: '1'
related_material:
  link:
  - description: Slides
    relation: supplementary_material
    url: https://groups.uni-paderborn.de/nt/pubs/2018/INTERSPEECH_2018_Glarner_Slides.pdf
status: public
title: Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit
  Discovery
type: conference
user_id: '34851'
year: '2018'
...
---
_id: '11770'
abstract:
- lang: eng
  text: 'In this contribution we show how to exploit text data to support word discovery
    from audio input in an underresourced target language. Given audio, of which a
    certain amount is transcribed at the word level, and additional unrelated text
    data, the approach is able to learn a probabilistic mapping from acoustic units
    to characters and utilize it to segment the audio data into words without the
    need of a pronunciation dictionary. This is achieved by three components: an unsupervised
    acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme
    converter, and a word discovery system, which is initialized with a language model
    trained on the text data. Experiments for multiple setups show that the initialization
    of the language model with text data improves the word segementation performance
    by a large margin.'
author:
- first_name: Thomas
  full_name: Glarner, Thomas
  id: '14169'
  last_name: Glarner
- first_name: Benedikt
  full_name: Boenninghoff, Benedikt
  last_name: Boenninghoff
- first_name: Oliver
  full_name: Walter, Oliver
  last_name: Walter
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Glarner T, Boenninghoff B, Walter O, Haeb-Umbach R. Leveraging Text Data for
    Word Segmentation for Underresourced Languages. In: <i>INTERSPEECH 2017, Stockholm,
    Schweden</i>. ; 2017.'
  apa: Glarner, T., Boenninghoff, B., Walter, O., &#38; Haeb-Umbach, R. (2017). Leveraging
    Text Data for Word Segmentation for Underresourced Languages. In <i>INTERSPEECH
    2017, Stockholm, Schweden</i>.
  bibtex: '@inproceedings{Glarner_Boenninghoff_Walter_Haeb-Umbach_2017, title={Leveraging
    Text Data for Word Segmentation for Underresourced Languages}, booktitle={INTERSPEECH
    2017, Stockholm, Schweden}, author={Glarner, Thomas and Boenninghoff, Benedikt
    and Walter, Oliver and Haeb-Umbach, Reinhold}, year={2017} }'
  chicago: Glarner, Thomas, Benedikt Boenninghoff, Oliver Walter, and Reinhold Haeb-Umbach.
    “Leveraging Text Data for Word Segmentation for Underresourced Languages.” In
    <i>INTERSPEECH 2017, Stockholm, Schweden</i>, 2017.
  ieee: T. Glarner, B. Boenninghoff, O. Walter, and R. Haeb-Umbach, “Leveraging Text
    Data for Word Segmentation for Underresourced Languages,” in <i>INTERSPEECH 2017,
    Stockholm, Schweden</i>, 2017.
  mla: Glarner, Thomas, et al. “Leveraging Text Data for Word Segmentation for Underresourced
    Languages.” <i>INTERSPEECH 2017, Stockholm, Schweden</i>, 2017.
  short: 'T. Glarner, B. Boenninghoff, O. Walter, R. Haeb-Umbach, in: INTERSPEECH
    2017, Stockholm, Schweden, 2017.'
date_created: 2019-07-12T05:27:55Z
date_updated: 2022-01-06T06:51:08Z
department:
- _id: '54'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://groups.uni-paderborn.de/nt/pubs/2017/INTERSPEECH_2017_Glarner_paper.pdf
oa: '1'
publication: INTERSPEECH 2017, Stockholm, Schweden
related_material:
  link:
  - description: Poster
    relation: supplementary_material
    url: https://groups.uni-paderborn.de/nt/pubs/2017/INTERSPEECH_2017_Glarner_poster.pdf
status: public
title: Leveraging Text Data for Word Segmentation for Underresourced Languages
type: conference
user_id: '44006'
year: '2017'
...
---
_id: '11759'
abstract:
- lang: eng
  text: 'Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based
    approximate Bayesian inference for observation models for which exact inference
    is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference
    in the presence of both discrete and continuous latent variables. Inspired by
    this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent
    models. We applied the resulting HMM-VAE to the task of acoustic unit discovery
    in a zero resource scenario. Starting from an initial model based on variational
    inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities,
    the accuracy of the acoustic unit discovery could be significantly improved by
    the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning
    task what is well-known in the supervised learning case: Neural networks provide
    superior modeling power compared to GMMs.'
author:
- first_name: Janek
  full_name: Ebbers, Janek
  id: '34851'
  last_name: Ebbers
- first_name: Jahn
  full_name: Heymann, Jahn
  id: '9168'
  last_name: Heymann
- first_name: Lukas
  full_name: Drude, Lukas
  id: '11213'
  last_name: Drude
- first_name: Thomas
  full_name: Glarner, Thomas
  id: '14169'
  last_name: Glarner
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
- first_name: Bhiksha
  full_name: Raj, Bhiksha
  last_name: Raj
citation:
  ama: 'Ebbers J, Heymann J, Drude L, Glarner T, Haeb-Umbach R, Raj B. Hidden Markov
    Model Variational Autoencoder for Acoustic Unit Discovery. In: <i>INTERSPEECH
    2017, Stockholm, Schweden</i>. ; 2017.'
  apa: Ebbers, J., Heymann, J., Drude, L., Glarner, T., Haeb-Umbach, R., &#38; Raj,
    B. (2017). Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery.
    <i>INTERSPEECH 2017, Stockholm, Schweden</i>.
  bibtex: '@inproceedings{Ebbers_Heymann_Drude_Glarner_Haeb-Umbach_Raj_2017, title={Hidden
    Markov Model Variational Autoencoder for Acoustic Unit Discovery}, booktitle={INTERSPEECH
    2017, Stockholm, Schweden}, author={Ebbers, Janek and Heymann, Jahn and Drude,
    Lukas and Glarner, Thomas and Haeb-Umbach, Reinhold and Raj, Bhiksha}, year={2017}
    }'
  chicago: Ebbers, Janek, Jahn Heymann, Lukas Drude, Thomas Glarner, Reinhold Haeb-Umbach,
    and Bhiksha Raj. “Hidden Markov Model Variational Autoencoder for Acoustic Unit
    Discovery.” In <i>INTERSPEECH 2017, Stockholm, Schweden</i>, 2017.
  ieee: J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden
    Markov Model Variational Autoencoder for Acoustic Unit Discovery,” 2017.
  mla: Ebbers, Janek, et al. “Hidden Markov Model Variational Autoencoder for Acoustic
    Unit Discovery.” <i>INTERSPEECH 2017, Stockholm, Schweden</i>, 2017.
  short: 'J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, B. Raj, in:
    INTERSPEECH 2017, Stockholm, Schweden, 2017.'
date_created: 2019-07-12T05:27:42Z
date_updated: 2023-11-22T08:29:06Z
department:
- _id: '54'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://groups.uni-paderborn.de/nt/pubs/2017/INTERSPEECH_2017_Ebbers_paper.pdf
oa: '1'
publication: INTERSPEECH 2017, Stockholm, Schweden
quality_controlled: '1'
related_material:
  link:
  - description: Poster
    relation: supplementary_material
    url: https://groups.uni-paderborn.de/nt/pubs/2017/INTERSPEECH_2017_Ebbers_poster.pdf
  - description: Slides
    relation: supplementary_material
    url: https://groups.uni-paderborn.de/nt/pubs/2017/INTERSPEECH_2017_Ebbers_slides.pdf
status: public
title: Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery
type: conference
user_id: '34851'
year: '2017'
...
---
_id: '11771'
abstract:
- lang: eng
  text: This paper is concerned with speech presence probability estimation employing
    an explicit model of the temporal and spectral correlations of speech. An undirected
    graphical model is introduced, based on a Factor Graph formulation. It is shown
    that this undirected model cures some of the theoretical issues of an earlier
    directed graphical model. Furthermore, we formulate a message passing inference
    scheme based on an approximate graph factorization, identify this inference scheme
    as a particular message passing schedule based on the turbo principle and suggest
    further alternative schedules. The experiments show an improved performance over
    speech presence probability estimation based on an IID assumption, and a slightly
    better performance of the turbo schedule over the alternatives.
author:
- first_name: Thomas
  full_name: Glarner, Thomas
  id: '14169'
  last_name: Glarner
- first_name: Mohammad
  full_name: Mahdi Momenzadeh, Mohammad
  last_name: Mahdi Momenzadeh
- first_name: Lukas
  full_name: Drude, Lukas
  id: '11213'
  last_name: Drude
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Glarner T, Mahdi Momenzadeh M, Drude L, Haeb-Umbach R. Factor Graph Decoding
    for Speech Presence Probability Estimation. In: <i>12. ITG Fachtagung Sprachkommunikation
    (ITG 2016)</i>. ; 2016.'
  apa: Glarner, T., Mahdi Momenzadeh, M., Drude, L., &#38; Haeb-Umbach, R. (2016).
    Factor Graph Decoding for Speech Presence Probability Estimation. In <i>12. ITG
    Fachtagung Sprachkommunikation (ITG 2016)</i>.
  bibtex: '@inproceedings{Glarner_Mahdi Momenzadeh_Drude_Haeb-Umbach_2016, title={Factor
    Graph Decoding for Speech Presence Probability Estimation}, booktitle={12. ITG
    Fachtagung Sprachkommunikation (ITG 2016)}, author={Glarner, Thomas and Mahdi
    Momenzadeh, Mohammad and Drude, Lukas and Haeb-Umbach, Reinhold}, year={2016}
    }'
  chicago: Glarner, Thomas, Mohammad Mahdi Momenzadeh, Lukas Drude, and Reinhold Haeb-Umbach.
    “Factor Graph Decoding for Speech Presence Probability Estimation.” In <i>12.
    ITG Fachtagung Sprachkommunikation (ITG 2016)</i>, 2016.
  ieee: T. Glarner, M. Mahdi Momenzadeh, L. Drude, and R. Haeb-Umbach, “Factor Graph
    Decoding for Speech Presence Probability Estimation,” in <i>12. ITG Fachtagung
    Sprachkommunikation (ITG 2016)</i>, 2016.
  mla: Glarner, Thomas, et al. “Factor Graph Decoding for Speech Presence Probability
    Estimation.” <i>12. ITG Fachtagung Sprachkommunikation (ITG 2016)</i>, 2016.
  short: 'T. Glarner, M. Mahdi Momenzadeh, L. Drude, R. Haeb-Umbach, in: 12. ITG Fachtagung
    Sprachkommunikation (ITG 2016), 2016.'
date_created: 2019-07-12T05:27:56Z
date_updated: 2022-01-06T06:51:08Z
department:
- _id: '54'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://groups.uni-paderborn.de/nt/pubs/2016/itgspeech2016_08_Glarner.pdf
oa: '1'
publication: 12. ITG Fachtagung Sprachkommunikation (ITG 2016)
related_material:
  link:
  - description: Slides
    relation: supplementary_material
    url: https://groups.uni-paderborn.de/nt/pubs/2016/itgspeech2016_08_Glarner_slides.pdf
status: public
title: Factor Graph Decoding for Speech Presence Probability Estimation
type: conference
user_id: '44006'
year: '2016'
...
