The Impact of Data Augmentation on the Hate Speech Detection in Portuguese Language


  • Félix Silva Universidade Federal de Pelotas
  • Artur Cerri Universidade Federal de Pelotas
  • Ulisses Brisolara Corrêa Universidade Federal de Pelotas
  • Larissa A. de Freitas Universidade Federal de Pelotas



Online communities allow users to establish a web presence, manage their identities, and stay connected with others. The internet has facilitated global outreach with just a click on the World Wide Web. However, the current landscape of online social media platforms are marred by various issues, with hate speech prominently taking center stage. Hate speech is characterized by hostile and malicious language driven by prejudice, targeting individuals or groups based on their innate, natural, or perceived characteristics. Detecting such speech is crucial for maintaining a safe online environment. This study examines the impact of dataset regularization techniques on the performance of BERTimbau-based models when applied to four Portuguese hate speech datasets: Fortuna et al. (2019), OFFCOMBR-2, ToLD-BR, and Hate-BR. Four Data Augmentation techniques are evaluated: Oversampling, Undersampling, Text Augmentation, and Synonym Replacement. Our experiments revealed that, apart from the Fortuna et al. (2019) dataset, the Data Augmentation techniques did not significantly enhance the performance of hate speech detection tasks.




How to Cite

Silva, F., Cerri, A., Brisolara Corrêa, U., & A. de Freitas, L. (2024). The Impact of Data Augmentation on the Hate Speech Detection in Portuguese Language. The International FLAIRS Conference Proceedings, 37(1).



Special Track: Applied Natural Language Processing