Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks

Takeshi Sakumoto; Teruaki Hayashi; Hiroki Sakaji; Hirofumi Nonaka

doi:10.1109/ACCESS.2024.3375750

IEEE Access (Jan 2024)

Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks

Takeshi Sakumoto,
Teruaki Hayashi,
Hiroki Sakaji,
Hirofumi Nonaka

Affiliations

Takeshi Sakumoto: ORCiD; Department of Engineering, Nagaoka University of Technology, Nagaoka, Niigata, Japan
Teruaki Hayashi: ORCiD; Department of Engineering, The University of Tokyo, Bunkyo, Tokyo, Japan
Hiroki Sakaji: Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
Hirofumi Nonaka: Faculty of Business Administration, Aichi Institute of Technology, Toyota, Aichi, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3375750
Journal volume & issue: Vol. 12
pp. 40213 – 40224

Abstract

Read online

Data integration, which aims to solve problems and create new services by combining datasets, has attracted considerable attention. The discovery of similar datasets that can be combined is critical. In the literature on similar dataset discovery, it is important to select an appropriate discovery method for each information need, such as the domain. However, conventional studies have evaluated discovery methods in different ways, such as domains, test datasets, and evaluation metrics. This factor prevents the appropriate method selection for each situation. Furthermore, the specific effects of the combination of different methods are not well known despite conventional studies arguing the importance of the combination. This study attempts to understand (1) the similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance. We evaluated 16 inter-dataset clustering models based on different metadata-based similarity indicators, using unified evaluation metrics and datasets for 15 domains. Our results (1) suggest that similarity indicators should be used for each domain and (2) demonstrate that most of the combinations of different methods can improve clustering performance.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords