Scientific Reports (Dec 2023)
Comparison of the accuracy of Japanese synonym identifications using word embeddings in the radiological technology field
Abstract
Abstract The terminology in radiological technology is crucial, encompassing a broad range of principles from radiation to medical imaging, and involving various specialists. This study aimed to evaluate the accuracy of automatic synonym detection considering the characteristics of the Japanese language by Word2vec and fastText in the radiological technology field for the terminology elaboration. We collected around 340 thousand abstracts in Japanese. First, preprocessing of the abstract data was performed. Then, training models were created with Word2vec and fastText with different architectures: continuous bag-of-words (CBOW) and skip-gram, and vector sizes. Baseline synonym sets were curated by two experts, utilizing terminology resources specific to radiological technology. A term in the dataset input into the generated models, and the top-10 synonym candidates which had high cosine similarities were obtained. Subsequently, precision, recall, F1-score, and accuracy for each model were calculated. The fastText model with CBOW at 300 dimensions was most precise in synonym detection, excelling in cases with shared n-grams. Conversely, fastText with skip-gram and Word2vec were favored for synonyms without common n-grams. In radiological technology, where n-grams are prevalent, fastText with CBOW proved advantageous, while in informatics, characterized by abbreviations and transliterations, Word2vec with CBOW was more effective.