Systematic Analysis of Tokenization Properties in Low-Resource Polysynthetic NMT
DOI:
https://doi.org/10.32473/flairs.39.1.141784Abstract
Tokenization is critical for neural machine translation (NMT), especially for polysynthetic and low-resource languages with complex morphology. While subword methods like Byte Pair Encoding (BPE) are common in modern NMT, their effectiveness varies across languages, particularly those unseen during tokenizer training. For languages like Cherokee, characterized by long, information-dense words and limited parallel data, it remains unclear which tokenization properties most influence translation performance. This work examines the impact of different tokenization characteristics on English-Cherokee neural machine translation in these settings. We evaluate seven BPE-based tokenizers by fine-tuning a BART model with a frozen decoder to isolate the effects of tokenization. Using intrinsic metrics, we analyze how tokenizer properties relate to translation performance.
Our results show that translation performance depends on balancing token reusability and meaningful lexical representation, with normalized entropy exhibiting the strongest correlation. While higher information density can improve performance, extreme compression, like character-level tokenization, does not improve BLEU scores. This suggests that both vocabulary compactness and semantic richness are necessary for effective tokenization in polysynthetic, low-resource languages.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Daniel Ndayegamiye, Jesse Roberts, Benjamin Frey

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.