Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

Gyeongmin Kim; Junyoung Son; Jinsung Kim; Hyunhee Lee; Heuiseok Lim

doi:10.1109/access.2021.3126882

IEEE Access (Jan 2021)

Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

Gyeongmin Kim,
Junyoung Son,
Jinsung Kim,
Hyunhee Lee,
Heuiseok Lim

Affiliations

Gyeongmin Kim: ORCiD; Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Junyoung Son: ORCiD; Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Jinsung Kim: Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Hyunhee Lee: ORCiD; Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Heuiseok Lim: ORCiD; Department of Computer Science and Engineering, Korea University, Seoul, South Korea

DOI: https://doi.org/10.1109/access.2021.3126882
Journal volume & issue: Vol. 9
pp. 151814 – 151823

Abstract

Read online

Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords