Domain Adapted NLP for Multi-Label Crash Narrative Classification under Extreme Class Imbalance
DOI:
https://doi.org/10.32473/flairs.39.1.141831Abstract
Police crash reports contain unstructured narrative descriptions that document the circumstances and contributing factors of roadway incidents. Manual analysis of these narratives is labor-intensive and difficult to scale, limiting timely and consistent safety analysis by transportation agencies. Automating this task is challenging due to domain-specific shorthand, inconsistent writing, and extreme class imbalance across contributing factor categories. In this study, we investigate multi-label classification of police crash narratives and present a practical end-to-end Natural Language Processing (NLP) pipeline tailored to real-world crash data. Through a systematic comparison of neural architectures, including transformer-based models, we find that increasing model complexity does not guarantee significant performance gains under severe label imbalance, while representation quality and inference strategy play a more critical role. We introduce a domain-aware vocabulary normalization method to recover semantic information from crash-specific abbreviations and apply per-class decision threshold optimization to improve minority-class detection. Experiments on a real-world crash narrative dataset show that the proposed approach achieves classification accuracy around 72\% and improves minority-class F1 scores over standard preprocessing and fixed-threshold baselines, demonstrating that lightweight models combined with domain adaptation and calibrated inference provide effective and scalable automation for safety-critical crash narrative analysis.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Md Sajjad Hossain, Lin Li, Judy Perkins, John Clary, Joel Meyer, Ramon Gomez

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.