Systematic Analysis of Tokenization Properties in Low-Resource Polysynthetic NMT

Authors

DOI:

https://doi.org/10.32473/flairs.39.1.141784

Abstract

Tokenization is critical for neural machine translation (NMT), especially for polysynthetic and low-resource languages with complex morphology. While subword methods like Byte Pair Encoding (BPE) are common in modern NMT, their effectiveness varies across languages, particularly those unseen during tokenizer training. For languages like Cherokee, characterized by long, information-dense words and limited parallel data, it remains unclear which tokenization properties most influence translation performance. This work examines the impact of different tokenization characteristics on English-Cherokee neural machine translation in these settings. We evaluate seven BPE-based tokenizers by fine-tuning a BART model with a frozen decoder to isolate the effects of tokenization. Using intrinsic metrics, we analyze how tokenizer properties relate to translation performance.
Our results show that translation performance depends on balancing token reusability and meaningful lexical representation, with normalized entropy exhibiting the strongest correlation. While higher information density can improve performance, extreme compression, like character-level tokenization, does not improve BLEU scores. This suggests that both vocabulary compactness and semantic richness are necessary for effective tokenization in polysynthetic, low-resource languages.

Downloads

Published

06-05-2026

How to Cite

Ndayegamiye, D., Roberts, J., & Frey, B. (2026). Systematic Analysis of Tokenization Properties in Low-Resource Polysynthetic NMT. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141784

Issue

Section

Special Track: Applied Natural Language Processing