IEEE Access (Jan 2020)

Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

  • Adnan Abid,
  • Waqas Ali,
  • Muhammad Shoaib Farooq,
  • Uzma Farooq,
  • Nabeel Sabir Khan,
  • Kamran Abid

DOI
https://doi.org/10.1109/ACCESS.2020.2995789
Journal volume & issue
Vol. 8
pp. 97737 – 97747

Abstract

Read online

Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard's coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.

Keywords