Named Entity Recognition for Nepali: Data Sets and Algorithms

Nobal Niraula; Jeevan Chapagain

doi:10.32473/flairs.v35i.130725

Autor/innen

Nobal Niraula Nowa Lab
Jeevan Chapagain

DOI:

https://doi.org/10.32473/flairs.v35i.130725

Schlagworte:

Named Entity Recognition, Data Set, Nepali, Low-resource

Abstract

Named Entity Recognition (NER) task involves locating Named Entities (NEs) in free text and classifying them into predefined categories such as Person Name, Location and Organization. Although the NER task has been studied widely in resource-rich languages, it has not been studied thoroughly for Nepali, a resource-poor language. In this paper, we present the systematic study of NER for Nepali language with clear Annotation Guidelines obtaining high inter-annotator agreements. The annotation produces EverestNER, the largest human annotated NER data set for Nepali which has 24,587 entities in total. It has 308,353 tokens corresponding to 15,798 sentences which are annotated into five categories: Person, Location, Organization, Date and Event. We split the EverestNER data set into EverestNER-train and EverestNER-test. These standard data sets, therefore, become the first benchmark data sets for evaluating Nepali NER systems. We release the EverestNER benchmark data sets to facilitate the research in Nepali language at https://github.com/nowalab/everest-ner. We report a comprehensive evaluation of state-of-the-art Neural and Transformer models using these data sets. We also discuss the remaining challenges for discovering NEs for Nepali.

Named Entity Recognition for Nepali: Data Sets and Algorithms

Autor/innen

DOI:

Schlagworte:

Abstract

Downloads

Veröffentlicht

Zitationsvorschlag

Ausgabe

Rubrik

Lizenz

entwickelt von

Beitrag einreichen

Sprache