The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification

Yudi Setiawan; Nur Ulfa Maulidevi; Kridanto Surendro

doi:10.5334/dsj-2024-031

Data Science Journal (May 2024)

The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification

Yudi Setiawan,
Nur Ulfa Maulidevi,
Kridanto Surendro

Affiliations

Yudi Setiawan: ORCiD; School of Electrical Engineering and Informatics, Institute of Technology Bandung, Bandung
Nur Ulfa Maulidevi: ORCiD; School of Electrical Engineering and Informatics, Institute of Technology Bandung, Bandung
Kridanto Surendro: ORCiD; School of Electrical Engineering and Informatics, Institute of Technology Bandung, Bandung

DOI: https://doi.org/10.5334/dsj-2024-031
Journal volume & issue: Vol. 23
pp. 31 – 31

Abstract

Read online

Cyberbullied communications should be bundled since online harassment is growing and has serious implications. High cyberbullying requires strong text classification algorithms to safeguard persons and communities. The n-Gram models language by collecting ‘n’ components, usually words or characters, from a text and detecting how words relate and if major items or sentences are cyberbullying document types. The research improves term value generation and text classification accuracy by extracting features using TF-IDF and n-Gram. The optimum TF-IDF feature extraction pattern demonstrated the usefulness of n-Gram in cyberbullying document classification. This field demands good categorization and feature extraction. Because cyberbullying takes numerous forms and venues, broad classification is essential. To test unigram, bigram, and trigram approaches across text lengths and frequencies, this study uses several parameter values. The research also shows the limitations and gaps in earlier methods and underscores the necessity for various n-Gram parameter values to overcome cyberbullying text complexity. Short-sentence articles, fluctuating data frequencies, and dynamic online interactions necessitate complex solutions. Ideal n-Gram patterns increase cyberbullying text categorization and give context to the field. This research acknowledges cyberbullying’s prevalence and effects, the necessity for effective categorization methods, and current techniques’ limitations, opening the way for more comprehensive and adaptive online harassment combating strategies.

Published in Data Science Journal

ISSN: 1683-1470 (Online)
Publisher: Ubiquity Press
Country of publisher: United Kingdom
LCC subjects: Science: Science (General)
Website: http://datascience.codata.org/

About the journal

Abstract

Keywords