Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Mark Abraham Magumba; Peter Nabende

doi:10.1186/s40537-021-00528-5

Journal of Big Data (Oct 2021)

Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Mark Abraham Magumba,
Peter Nabende

Affiliations

Mark Abraham Magumba: Department of Information Systems, School of Computing and Informatics Technology, Makerere University College of Computing and Information Sciences
Peter Nabende: Department of Information Systems, School of Computing and Informatics Technology, Makerere University College of Computing and Information Sciences

DOI: https://doi.org/10.1186/s40537-021-00528-5
Journal volume & issue: Vol. 8, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords