IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran

Gusmita, Ria Hari; Firmansyah, Asep Fajar; Moussallem, Diego; Ngonga Ngomo, Axel-Cyrille

IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran

R.H. Gusmita, A.F. Firmansyah, D. Moussallem, A.-C. Ngonga Ngomo, in: Natural Language Processing and Information Systems, Springer Nature Switzerland, Cham, 2023.

Download

No fulltext has been uploaded.

DOI

10.1007/978-3-031-35320-8_12

Book Chapter | Published | English

Author

Gusmita, Ria Hari^LibreCat; Firmansyah, Asep Fajar^LibreCat; Moussallem, Diego^LibreCat; Ngonga Ngomo, Axel-Cyrille^LibreCat

Department

Fakultät für Elektrotechnik, Informatik und Mathematik
Data Science / Heinz Nixdorf Institut

Abstract

Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world’s largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research’s domain range.

Keywords

NER benchmark dataset; Indonesian; specific domain

Publishing Year

2023

Book Title

Natural Language Processing and Information Systems

Conference

International Conference on Applications of Natural Language to Information Systems (NLDB) 2023

Conference Location

Derby, UK

Conference Date

2023-06-21 – 2023-06-23

ISBN

9783031353192, 9783031353208

ISSN

0302-9743, 1611-3349

LibreCat-ID

46572

Cite this

Gusmita RH, Firmansyah AF, Moussallem D, Ngonga Ngomo A-C. IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran. In: Natural Language Processing and Information Systems. Springer Nature Switzerland; 2023. doi:10.1007/978-3-031-35320-8_12

Gusmita, R. H., Firmansyah, A. F., Moussallem, D., & Ngonga Ngomo, A.-C. (2023). IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran. In Natural Language Processing and Information Systems. International Conference on Applications of Natural Language to Information Systems (NLDB) 2023, Derby, UK. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-35320-8_12

@inbook{Gusmita_Firmansyah_Moussallem_Ngonga Ngomo_2023, place={Cham}, title={IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran}, DOI={10.1007/978-3-031-35320-8_12}, booktitle={Natural Language Processing and Information Systems}, publisher={Springer Nature Switzerland}, author={Gusmita, Ria Hari and Firmansyah, Asep Fajar and Moussallem, Diego and Ngonga Ngomo, Axel-Cyrille}, year={2023} }

Gusmita, Ria Hari, Asep Fajar Firmansyah, Diego Moussallem, and Axel-Cyrille Ngonga Ngomo. “IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran.” In Natural Language Processing and Information Systems. Cham: Springer Nature Switzerland, 2023. https://doi.org/10.1007/978-3-031-35320-8_12.

R. H. Gusmita, A. F. Firmansyah, D. Moussallem, and A.-C. Ngonga Ngomo, “IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran,” in Natural Language Processing and Information Systems, Cham: Springer Nature Switzerland, 2023.

Gusmita, Ria Hari, et al. “IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran.” Natural Language Processing and Information Systems, Springer Nature Switzerland, 2023, doi:10.1007/978-3-031-35320-8_12.

External material:

Confirmation Letter

URL

https://link.springer.com/chapter/10.1007/978-3-031-35320-8_12

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar
ISBN Search