Comparative Analysis of Naive Bayes and Support Vector Machine for Hate Speech Classification

Rolanda Difandana; Ian Imaduddin; Indra Indra

doi:10.57152/malcom.v6i1.2571

Authors

Rolanda Difandana Universitas Budi Luhur
Ian Imaduddin Universitas Budi Luhur
Indra Indra Universitas Budi Luhur

DOI:

https://doi.org/10.57152/malcom.v6i1.2571

Keywords:

Abusive Language, Hate Speech Detection, Naive Bayes, Support Vector Machine, Text Classification

Abstract

This study addresses the increasing need for automated hate speech detection in Indonesia due to the rapid growth of social media and the rise of abusive online content. It compares the performance of Naive Bayes (NB) and Support Vector Machine (SVM) algorithms in classifying Indonesian-language tweets into three categories: hate speech (27.52%), abusive language (34.25%), and neutral content (38.23%). The dataset consists of 13,169 manually annotated tweets collected from Twitter (now X), with moderate class imbalance handled using stratified sampling. Text preprocessing included tokenization, case folding, stopword removal, and stemming using the Nazief–Adriani algorithm, followed by TF-IDF feature extraction with a unigram configuration (min_df=3, max_df=0.95). Both algorithms were evaluated using 10-fold stratified cross-validation with accuracy, precision, recall, and F1-score as performance metrics. Experimental results show that SVM with a linear kernel outperformed NB, achieving an accuracy of 93.28%, precision of 92.45%, and F1-score of 92.89%, compared to NB’s accuracy of 84.71%, precision of 83.56%, and F1-score of 84.12%. Although effective, this study is limited to classical machine learning approaches with TF-IDF features and does not incorporate deep learning or contextual embeddings, while still providing practical guidance for algorithm selection in Indonesian hate speech detection systems.

Downloads

Download data is not yet available.

References

We Are Social and Meltwater, "Digital 2025: Indonesia," Global Digital Report, pp. 1-98, 2025.

A. M. Kaplan and M. Haenlein, "Users of the world, unite! The challenges and opportunities of Social Media," Business Horizons, vol. 53, no. 1, pp. 59-68, 2010.

Ministry of Communication and Information Technology Republic of Indonesia, "Annual Report on Digital Content Moderation," Jakarta, 2025.

Republic of Indonesia, "Law No. 19 of 2016 on Electronic Information and Transactions," State Gazette of the Republic of Indonesia, 2016.

S. Malmasi and M. Zampieri, "Detecting hate speech in multi-domain social media," in Proc. RANLP, pp. 452-459, 2017.

A. Schmidt and M. Wiegand, "A survey on hate speech detection using natural language processing," in Proc. SocialNLP Workshop, pp. 1-10, 2017.

B. Gambeck and Y. Sikdar, "Using convolutional neural networks for sentiment analysis of social media," in Proc. of the Int. Workshop on NLP, pp. 146-148, 2017.

A. McCallum and K. Nigam, "A comparison of event models for naive Bayes text classification," in AAAI Workshop on Learning for Text Categorization, vol. 752, pp. 41-48, 1998.

C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.

Z. Waseem and D. Hovy, "Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter," in Proc. NAACL Student Research Workshop, pp. 88-93, 2016.

B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, and M. Wojatzki, "Measuring the reliability of hate speech annotations: The case of the European refugee crisis," in Proc. NLP4CMC III, pp. 6-9, 2017.

N. Albadi, M. Kurdi, and S. Mishra, "Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere," in Proc. IEEE/ACM ASONAM, pp. 69-76, 2018.

M. O. Ibrohim and I. Budi, "Multi-label hate speech and abusive language detection in Indonesian Twitter," in Proc. ALW3, pp. 46-57, 2019.

I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, "Hate speech detection in the Indonesian language: A dataset and preliminary study," in Proc. ICACSIS, pp. 233-238, 2017.

R. Ting, P. Wirawan, and N. Hidayat, "Social media content analysis framework for Indonesian digital platforms," Journal of Information Systems, vol. 15, no. 2, pp. 112-125, 2024.

T. Davidson, D. Warmsley, M. Macy, and I. Weber, "Automated hate speech detection and the problem of offensive language," in Proc. ICWSM, pp. 512-515, 2017.

J. L. Fleiss, "Measuring nominal scale agreement among many raters," Psychological Bulletin, vol. 76, no. 5, pp. 378-382, 1971.

C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, 2008.

S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O’Reilly Media, 2009.

F. Z. Tala, "A study of stemming effects on information retrieval in Bahasa Indonesia," M.S. thesis, Universiteit van Amsterdam, 2003.

A. Z. Arifin, I. P. A. K. Mahendra, and H. T. Ciptaningtyas, "Enhanced confix stripping stemmer and Ants algorithm for classifying news documents in Indonesian language," in Proc. ICIC, pp. 149-158, 2009.

G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513-523, 1988.

J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger, "Tackling the poor assumptions of naive Bayes text classifiers," in Proc. ICML, pp. 616-623, 2003.

C. Zhai and S. Massung, Text Data Management and Analysis. ACM Books, 2016.

C. W. Hsu and C. J. Lin, "A comparison of methods for multiclass support vector machines," IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415-425, 2002.

T. Joachims, "Text categorization with support vector machines: Learning with many relevant features," in Proc. ECML, pp. 137-142, 1998.

C. C. Chang and C. J. Lin, "LIBSVM: A library for support vector machines," ACM Trans. Intelligent Systems and Technology, vol. 2, no. 3, pp. 1-27, 2011.

M. Sokolova and G. Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing & Management, vol. 45, no. 4, pp. 427-437, 2009.

R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model selection," in Proc. IJCAI, pp. 1137-1143, 1995.

I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003.

M. Ousidhoum, Z. Lin, H. Zhang, Y. Song, and D. Yeung, "Multilingual and multi-aspect hate speech analysis," in Proc. EMNLP-IJCNLP, pp. 4675-4684, 2019.

Y. Mehdad and J. Tetreault, "Do characters abuse more than words?" in Proc. SIGDial, pp. 299-303, 2016.

T. Joachims, Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2002.

W. Warner and J. Hirschberg, "Detecting hate speech on the world wide web," in Proc. LSM Workshop, pp. 19-26, 2012.

H. M. Iskandar and A. Purwarianti, "Comparison of machine learning algorithms for hate speech detection in Indonesian social media," in Proc. ICEEI, pp. 67-72, 2024.

J. Kusuma and A. Chowanda, "Indonesian hate speech detection using IndoBERTweet and BiLSTM on Twitter," JOIV: International Journal on Informatics Visualization, vol. 7, no. 3, pp. 773-780, 2023.

R. I. Yulfa, A. Solichin, and R. Budiharto, "Enhancing hate speech detection in social media using IndoBERT model: A study of sentiment analysis during the 2024 Indonesia presidential election," in Proc. ICAICTA, pp. 1-6, 2023.

L. Susanto, et al., "IndoToxic2024: A demographically-enriched dataset of hate speech and toxicity types for Indonesian language," in Proc. EMNLP Demo, pp. 1-10, 2024.

I. F. Rokhim, R. Sarno, A. F. Septiyanto, A. T. Haryono, and S. I. Sabilla, "IndoBERT-based ensemble learning for multi-level multi-label hate speech detection in Indonesian social media," in Proc. BTS-I2C, pp. 456-461, 2024.

A. Darmawan, et al., "Experiments on IndoBERT implementation for detecting multi-label hate speech with data resampling through synonym replacement method," in Proc. IEEE ICRAIE, pp. 1-6, 2023.

Z. Al-Makhadmeh and A. Tolba, "Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach," Computing, vol. 102, pp. 501-522, 2020.

Comparative Analysis of Naive Bayes and Support Vector Machine for Hate Speech Classification

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License