IEEE Access (Jan 2024)

Polish Word Recognition Based on n-Gram Methods

  • Piotr Wojcicki,
  • Tomasz Zientarski

DOI
https://doi.org/10.1109/ACCESS.2024.3385113
Journal volume & issue
Vol. 12
pp. 49817 – 49825

Abstract

Read online

Word recognition of Slavic languages is not an easy task due to the complicated declension of words and a variety of diacritical signs. Polish is a representative of West Slavic languages, which are written in Latin characters. Automatic handwritten word recognition in Slavic languages is not easy, due to the poor recognition rate of letters with diacritical signs and lack of good handwritten text corpora for languages with declension. The main aim of the research is to investigate the possibility of correcting typos made in the final phase of recognizing Polish. The method developed is based on letter recognition by means of convolutional neural networks (CNNs) and text matching algorithms for resulting words. At the first stage, we use a designed convolutional neural network for character recognition. At the second stage, after combining letters into words we apply a post-processing error correction method, which improves the efficiency of recognition of the misspelled words. We checked the efficiency of word matching for a few measures of similarity of words, i.e: edit distance (Damerau-Levenshtein), string matching (Sorensen-Dice) and list of candidates. In addition, we examine how word length and the number of misplaced letters affect the behaviour of the algorithms used. The analysis is carried out for bigram and trigram methods. By combining different methods to assess the similarity of words, better selection of lists of proposed words has been achieved. The article proposes an innovative method for correcting post-processing errors in recognizing Polish words with the efficiency of correct word matching ranging from 76% to 99%, depending on the measure and word length used.

Keywords