Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Gordan Gledec; Renato Soic; Sandor Dembitz

doi:10.1109/ACCESS.2019.2947898

IEEE Access (Jan 2019)

Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Gordan Gledec,
Renato Soic,
Sandor Dembitz

Affiliations

Gordan Gledec: ORCiD; Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
Renato Soic: Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
Sandor Dembitz: Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

DOI: https://doi.org/10.1109/ACCESS.2019.2947898
Journal volume & issue: Vol. 7
pp. 149988 – 149995

Abstract

Read online

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords