IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains

R.H. Gusmita, M.F.A. Abshar, D. Moussallem, A.-C. Ngonga Ngomo, in: Lecture Notes in Computer Science, Springer Nature Switzerland, Cham, 2024.

Download
No fulltext has been uploaded.
Book Chapter | Published | English
Abstract
In recent years, there has been a surge in natural language processing research focused on low-resource languages (LrLs), underscoring the growing recognition that LrLs deserve the same attention as high-resource languages (HrLs). This shift is crucial for ensuring linguistic diversity and inclusivity in the digital age. Despite Indonesian ranking as the 11th most spoken language globally, it remains under-resourced in terms of computational tools and datasets. Within the semantic web domain, Entity Linking (EL) is pivotal, linking textual entity mentions to their corresponding entries in knowledge bases. This process is foundational for advanced information extraction tasks, including relation extraction and event detection. To bolster EL research in Indonesian, we introduce IndEL, the first benchmark dataset tailored for both general and specific domains. IndEL was manually curated using Wikidata, adhering to a rigorous set of annotation guidelines. We used two Named Entity Recognition (NER) benchmark datasets for entity extraction: NER UI for the general domain and IndQNER for the specific domain. IndQNER focused on entities from the Indonesian translation of the Quran. IndEL comprises 4765 entities in the general domain and 2453 in the specific domain. Using the GERBIL framework, we use IndEL to evaluate the performance of various EL systems, such as Babelfy, DBpedia Spotlight, MAG, OpenTapioca, and WAT. Our further investigation reveals that within Wikidata, a significant number of NIL entities remain unlinked due to the limited number of Indonesian labels and the use of acronyms. Especially in the specific domain, transliteration and translation processes performed to create the Indonesian translation of the Quran contribute to the presence of entities in a descriptive form and as synonyms.
Publishing Year
Book Title
Lecture Notes in Computer Science
Conference
The 29th Annual International Conference on Natural Language & Information Systems (NLDB 2024)
Conference Location
Turin, Italy
Conference Date
2024-06-25 – 2024-06-27
LibreCat-ID

Cite this

Gusmita RH, Abshar MFA, Moussallem D, Ngonga Ngomo A-C. IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains. In: Lecture Notes in Computer Science. Springer Nature Switzerland; 2024. doi:10.1007/978-3-031-70239-6_34
Gusmita, R. H., Abshar, M. F. A., Moussallem, D., & Ngonga Ngomo, A.-C. (2024). IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains. In Lecture Notes in Computer Science. The 29th Annual International Conference on Natural Language & Information Systems (NLDB 2024), Turin, Italy. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-70239-6_34
@inbook{Gusmita_Abshar_Moussallem_Ngonga Ngomo_2024, place={Cham}, title={IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains}, DOI={10.1007/978-3-031-70239-6_34}, booktitle={Lecture Notes in Computer Science}, publisher={Springer Nature Switzerland}, author={Gusmita, Ria Hari and Abshar, Muhammad Faruq Amiral and Moussallem, Diego and Ngonga Ngomo, Axel-Cyrille}, year={2024} }
Gusmita, Ria Hari, Muhammad Faruq Amiral Abshar, Diego Moussallem, and Axel-Cyrille Ngonga Ngomo. “IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains.” In Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2024. https://doi.org/10.1007/978-3-031-70239-6_34.
R. H. Gusmita, M. F. A. Abshar, D. Moussallem, and A.-C. Ngonga Ngomo, “IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains,” in Lecture Notes in Computer Science, Cham: Springer Nature Switzerland, 2024.
Gusmita, Ria Hari, et al. “IndEL: Indonesian Entity Linking Benchmark Dataset for General and Specific Domains.” Lecture Notes in Computer Science, Springer Nature Switzerland, 2024, doi:10.1007/978-3-031-70239-6_34.
External material:
Confirmation Letter

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar
ISBN Search