DanfeNER - Named Entity Recognition in Nepali Tweets

Nobal Niraula; Jeevan Chapagain

doi:10.32473/flairs.36.133384

Authors

Nobal Niraula
Jeevan Chapagain University of Memphis https://orcid.org/0009-0009-7185-0815

DOI:

https://doi.org/10.32473/flairs.36.133384

Abstract

Twitter allows users to easily post tweets on any subject or event anytime, generating massive amounts of rich text content on diverse topics. Automated methods such as Named Entity Recognition (NER) are required to process the massive tweet data. Processing tweets, however, poses a special challenge as they are informal posts with incomplete context and often contain acronyms, hashtags, misspellings, abbreviations, and URLs due to length constraints. This paper presents the first systematic study of NER in Nepali tweets corresponding to five different entity types: Person Name (PER), Location (LOC), Organization (ORG), Date (DAT), and Event (EVT). We develop DanfeNER, the first human-labeled high-quality NER benchmark data sets for the low-resource language Nepali. DanfeNER contains 5,366 records and 3,463 entities in its train set and 2,301 records and 1,503 entities in its test set. Using this data set, we benchmark several state-of-the-art Nepali monolingual and multilingual transformer models, obtaining micro-averaged F1 scores up to 81%.

DanfeNER - Named Entity Recognition in Nepali Tweets

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Make a Submission

Language