Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

Adnan Abid; Waqas Ali; Muhammad Shoaib Farooq; Uzma Farooq; Nabeel Sabir Khan; Kamran Abid

doi:10.1109/ACCESS.2020.2995789

IEEE Access (Jan 2020)

Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

Adnan Abid,
Waqas Ali,
Muhammad Shoaib Farooq,
Uzma Farooq,
Nabeel Sabir Khan,
Kamran Abid

Affiliations

Adnan Abid: ORCiD; Department of Computer Science, University of Management and Technology, Lahore, Pakistan
Waqas Ali: ORCiD; Department of Computer Science, University of Management and Technology, Lahore, Pakistan
Muhammad Shoaib Farooq: ORCiD; Department of Computer Science, University of Management and Technology, Lahore, Pakistan
Uzma Farooq: ORCiD; Department of Software Engineering, University of Management and Technology, Lahore, Pakistan
Nabeel Sabir Khan: ORCiD; Department of Computer Science, University of Management and Technology, Lahore, Pakistan
Kamran Abid: ORCiD; Department of Electrical Engineering, University of the Punjab, Lahore, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2020.2995789
Journal volume & issue: Vol. 8
pp. 97737 – 97747

Abstract

Read online

Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard's coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords