Convolutional Neural Network Based Ensemble Approach for Homoglyph Recognition

Md. Taksir Hasan Majumder; Md. Mahabur Rahman; Anindya Iqbal; M. Sohel Rahman

doi:10.3390/mca25040071

Mathematical and Computational Applications (Oct 2020)

Convolutional Neural Network Based Ensemble Approach for Homoglyph Recognition

Md. Taksir Hasan Majumder,
Md. Mahabur Rahman,
Anindya Iqbal,
M. Sohel Rahman

Affiliations

Md. Taksir Hasan Majumder: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
Md. Mahabur Rahman: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
Anindya Iqbal: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh
M. Sohel Rahman: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh

DOI: https://doi.org/10.3390/mca25040071
Journal volume & issue: Vol. 25, no. 4
p. 71

Abstract

Read online

Homoglyphs are pairs of visual representations of Unicode characters that look similar to the human eye. Identifying homoglyphs is extremely useful for building a strong defence mechanism against many phishing and spoofing attacks, ID imitation, profanity abusing, etc. Although there is a list of discovered homoglyphs published by Unicode consortium, regular expansion of Unicode character scripts necessitates a robust and reliable algorithm that is capable of identifying all possible new homoglyphs. In this article, we first show that shallow Convolutional Neural Networks are capable of identifying homoglyphs. We propose two variations, both of which obtain very high accuracy (99.44%) on our benchmark dataset. We also report that adoption of transfer learning allows for another model to achieve 100% recall on our dataset. We ensemble these three methods to obtain 99.72% accuracy on our independent test dataset. These results illustrate the superiority of our ensembled model in detecting homoglyphs and suggest that our model can be used to detect new homoglyphs when increasing Unicode characters are added. As a by-product, we also prepare a benchmark dataset based on the currently available list of homoglyphs.

Published in Mathematical and Computational Applications

ISSN: 1300-686X (Print); 2297-8747 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Applied mathematics. Quantitative methods; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/mca

About the journal

Abstract

Keywords