Bridging Natural Language Processing and psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish

Josu Goikoetxea; Itziar San Martin; Miren Arantzeta

doi:10.3389/flang.2024.1458887

Frontiers in Language Sciences (Nov 2024)

Bridging Natural Language Processing and psycholinguistics: computationally grounded semantic similarity datasets for Basque and Spanish

Josu Goikoetxea,
Itziar San Martin,
Miren Arantzeta

Affiliations

Josu Goikoetxea: HiTZ Research Center, Bilbao School of Engineering (EHU/UPV), Bilbao, Spain
Itziar San Martin: The Bilingual Mind - Micaela Portilla Research Center, Basque Language and Communication (EHU/UPV), Vitoria-Gasteiz, Spain
Miren Arantzeta: The Bilingual Mind - Micaela Portilla Research Center, Linguistics and Basque Studies (EHU/UPV), Vitoria-Gasteiz, Spain

DOI: https://doi.org/10.3389/flang.2024.1458887
Journal volume & issue: Vol. 3

Abstract

Read online

IntroductionSemantic relations are crucial in various cognitive processes, highlighting the need to understand concept interactions and how such relations are represented in the brain. Psycholinguistics research requires computationally grounded datasets that include word similarity measures controlled for the variables that play a significant role in lexical processing. This work presents a dataset for noun pairs in Basque and European Spanish based on two well-known Natural Language Processing resources: text corpora and knowledge bases.MethodsThe dataset creation consisted of three steps, (1) computing four key psycholinguistic features for each noun; concreteness, frequency, semantic, and phonological neighborhood density; (2) pairing nouns across these four variables; (3) for each noun pair, assigning three types of word similarity measurements, computed out of text, Wordnet and hybrid embeddings.ResultsA dataset of noun pairs in Basque and Spanish involving three types of word similarity measurements, along with four lexical features for each of the nouns in the pair, namely, word frequency, concreteness, and semantic and phonological neighbors. The selection of the nouns for each pair was controlled by the mentioned variables, which play a significant role in lexical processing. The dataset includes three similarity measurements, based on their embedding computation: semantic relatedness from text-based embeddings, pure similarity from Wordnet-based embeddings and both categorical and associative relations from hybrid embeddings.DiscussionThe present work covers an existent gap in Basque and Spanish in terms of the lack of datasets that include both word similarity and detailed lexical properties, which provides a more useful resource for psycholinguistics research in those languages.

Published in Frontiers in Language Sciences

ISSN: 2813-4605 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Language and Literature
Website: https://www.frontiersin.org/journals/language-sciences

About the journal

Abstract

Keywords