Leveraging Text Data for Word Segmentation for Underresourced Languages

T. Glarner, B. Boenninghoff, O. Walter, R. Haeb-Umbach, in: INTERSPEECH 2017, Stockholm, Schweden, 2017.

Conference Paper | English
Author
Glarner, ThomasLibreCat; Boenninghoff, Benedikt; Walter, Oliver; Haeb-Umbach, ReinholdLibreCat
Abstract
In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin.
Publishing Year
Proceedings Title
INTERSPEECH 2017, Stockholm, Schweden
LibreCat-ID

Cite this

Glarner T, Boenninghoff B, Walter O, Haeb-Umbach R. Leveraging Text Data for Word Segmentation for Underresourced Languages. In: INTERSPEECH 2017, Stockholm, Schweden. ; 2017.
Glarner, T., Boenninghoff, B., Walter, O., & Haeb-Umbach, R. (2017). Leveraging Text Data for Word Segmentation for Underresourced Languages. In INTERSPEECH 2017, Stockholm, Schweden.
@inproceedings{Glarner_Boenninghoff_Walter_Haeb-Umbach_2017, title={Leveraging Text Data for Word Segmentation for Underresourced Languages}, booktitle={INTERSPEECH 2017, Stockholm, Schweden}, author={Glarner, Thomas and Boenninghoff, Benedikt and Walter, Oliver and Haeb-Umbach, Reinhold}, year={2017} }
Glarner, Thomas, Benedikt Boenninghoff, Oliver Walter, and Reinhold Haeb-Umbach. “Leveraging Text Data for Word Segmentation for Underresourced Languages.” In INTERSPEECH 2017, Stockholm, Schweden, 2017.
T. Glarner, B. Boenninghoff, O. Walter, and R. Haeb-Umbach, “Leveraging Text Data for Word Segmentation for Underresourced Languages,” in INTERSPEECH 2017, Stockholm, Schweden, 2017.
Glarner, Thomas, et al. “Leveraging Text Data for Word Segmentation for Underresourced Languages.” INTERSPEECH 2017, Stockholm, Schweden, 2017.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)
Access Level
Restricted Closed Access
External material:
Supplementary Material
Description
Poster

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar