Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01)

O. Walter, R. Haeb-Umbach, J. Strunk, N. P. Himmelmann, Lexicon Discovery for Language Preservation Using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01), 2015.

Report | English
Author
Walter, Oliver; Haeb-Umbach, ReinholdLibreCat; Strunk, Jan; P. Himmelmann, Nikolaus
Abstract
In this paper we show that recently developed algorithms for unsupervised word segmentation can be a valuable tool for the documentation of endangered languages. We applied an unsupervised word segmentation algorithm based on a nested Pitman-Yor language model to two austronesian languages, Wooi and Waima'a. The algorithm was then modified and parameterized to cater the needs of linguists for high precision of lexical discovery: We obtained a lexicon precision of of 69.2\% and 67.5\% for Wooi and Waima'a, respectively, if single-letter words and words found less than three times were discarded. A comparison with an English word segmentation task showed comparable performance, verifying that the assumptions underlying the Pitman-Yor language model, the universality of Zipf's law and the power of n-gram structures, do also hold for languages as exotic as Wooi and Waima'a.
Publishing Year
LibreCat-ID

Cite this

Walter O, Haeb-Umbach R, Strunk J, P. Himmelmann N. Lexicon Discovery for Language Preservation Using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01).; 2015.
Walter, O., Haeb-Umbach, R., Strunk, J., & P. Himmelmann, N. (2015). Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01).
@book{Walter_Haeb-Umbach_Strunk_P. Himmelmann_2015, title={Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01)}, author={Walter, Oliver and Haeb-Umbach, Reinhold and Strunk, Jan and P. Himmelmann, Nikolaus }, year={2015} }
Walter, Oliver, Reinhold Haeb-Umbach, Jan Strunk, and Nikolaus P. Himmelmann. Lexicon Discovery for Language Preservation Using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01), 2015.
O. Walter, R. Haeb-Umbach, J. Strunk, and N. P. Himmelmann, Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01). 2015.
Walter, Oliver, et al. Lexicon Discovery for Language Preservation Using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01). 2015.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)
Access Level
Restricted Closed Access

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar