Domain Adapted NLP for Multi-Label Crash Narrative Classification under Extreme Class Imbalance

Authors

  • Md Sajjad Hossain Prairie View A & M University
  • Lin Li
  • Judy Perkins
  • John Clary
  • Joel Meyer
  • Ramon Gomez

DOI:

https://doi.org/10.32473/flairs.39.1.141831

Abstract

Police crash reports contain unstructured narrative descriptions that document the circumstances and contributing factors of roadway incidents. Manual analysis of these narratives is labor-intensive and difficult to scale, limiting timely and consistent safety analysis by transportation agencies. Automating this task is challenging due to domain-specific shorthand, inconsistent writing, and extreme class imbalance across contributing factor categories. In this study, we investigate multi-label classification of police crash narratives and present a practical end-to-end Natural Language Processing (NLP) pipeline tailored to real-world crash data. Through a systematic comparison of neural architectures, including transformer-based models, we find that increasing model complexity does not guarantee significant performance gains under severe label imbalance, while representation quality and inference strategy play a more critical role. We introduce a domain-aware vocabulary normalization method to recover semantic information from crash-specific abbreviations and apply per-class decision threshold optimization to improve minority-class detection. Experiments on a real-world crash narrative dataset show that the proposed approach achieves classification accuracy around 72\% and improves minority-class F1 scores over standard preprocessing and fixed-threshold baselines, demonstrating that lightweight models combined with domain adaptation and calibrated inference provide effective and scalable automation for safety-critical crash narrative analysis.

Downloads

Published

06-05-2026

How to Cite

Hossain, M. S., Li, L., Perkins, J., Clary, J., Meyer, J., & Gomez, R. (2026). Domain Adapted NLP for Multi-Label Crash Narrative Classification under Extreme Class Imbalance. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141831

Issue

Section

Special Track: Applied Natural Language Processing