Building a Korean morphological analyzer using two Korean BERT models

Yong-Seok Choi; Yo-Han Park; Kong Joo Lee

doi:10.7717/peerj-cs.968

PeerJ Computer Science (May 2022)

Building a Korean morphological analyzer using two Korean BERT models

Yong-Seok Choi,
Yo-Han Park,
Kong Joo Lee

Affiliations

Yong-Seok Choi: Department of Radio and Information Communications Engineering, Chungnam National University, Daejeon, South Korea
Yo-Han Park: Department of Radio and Information Communications Engineering, Chungnam National University, Daejeon, South Korea
Kong Joo Lee: Department of Radio and Information Communications Engineering, Chungnam National University, Daejeon, South Korea

DOI: https://doi.org/10.7717/peerj-cs.968
Journal volume & issue: Vol. 8
p. e968

Abstract

Read online Read online

A morphological analyzer plays an essential role in identifying functional suffixes of Korean words. The analyzer input and output differ from each other in their length and strings, which can be dealt with by an encoder-decoder architecture. We adopt a Transformer architecture, which is an encoder-decoder architecture with self-attention rather than a recurrent connection, to implement a Korean morphological analyzer. Bidirectional Encoder Representations from Transformers (BERT) is one of the most popular pretrained representation models; it can present an encoded sequence of input words, considering contextual information. We initialize both the Transformer encoder and decoder with two types of Korean BERT, one of which is pretrained with a raw corpus, and the other is pretrained with a morphologically analyzed dataset. Therefore, implementing a Korean morphological analyzer based on Transformer is a fine-tuning process with a relatively small corpus. A series of experiments proved that parameter initialization using pretrained models can alleviate the chronic problem of a lack of training data and reduce the time required for training. In addition, we can determine the number of layers required for the encoder and decoder to optimize the performance of a Korean morphological analyzer.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords