News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning

Gilberto Rivera; Rogelio Florencia; Vicente García; Alejandro Ruiz; J. Patricia Sánchez-Solís

doi:10.3390/app10186253

Applied Sciences (Sep 2020)

News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning

Gilberto Rivera,
Rogelio Florencia,
Vicente García,
Alejandro Ruiz,
J. Patricia Sánchez-Solís

Affiliations

Gilberto Rivera: Departamento de Eléctrica y Computación, División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, Cd. Juárez 32000, Chihuahua, Mexico
Rogelio Florencia: Departamento de Eléctrica y Computación, División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, Cd. Juárez 32000, Chihuahua, Mexico
Vicente García: Departamento de Eléctrica y Computación, División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, Cd. Juárez 32000, Chihuahua, Mexico
Alejandro Ruiz: Departamento de Eléctrica y Computación, División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, Cd. Juárez 32000, Chihuahua, Mexico
J. Patricia Sánchez-Solís: Departamento de Eléctrica y Computación, División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, Cd. Juárez 32000, Chihuahua, Mexico

DOI: https://doi.org/10.3390/app10186253
Journal volume & issue: Vol. 10, no. 18
p. 6253

Abstract

Read online

‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords