Scaling Identifiers and their Metadata to Gigascale: An Architecture to Tackle the Challenges of Volume and Variety

Jens Klump; Doug Fils; Anusuriya Devaraju; Sarah Ramdeen; Jess Robertson; Lesley Wyborn; Kerstin Lehnert

doi:10.5334/dsj-2023-005

Data Science Journal (Mar 2023)

Scaling Identifiers and their Metadata to Gigascale: An Architecture to Tackle the Challenges of Volume and Variety

Jens Klump,
Doug Fils,
Anusuriya Devaraju,
Sarah Ramdeen,
Jess Robertson,
Lesley Wyborn,
Kerstin Lehnert

Affiliations

Jens Klump: Mineral Resources, CSIRO, Perth, WA
Doug Fils: Ocean Leadership, Washington, DC
Anusuriya Devaraju: Mineral Resources, CSIRO, Perth, WA; TERN, The University of Queensland, Brisbane, QLD
Sarah Ramdeen: Lamont-Doherty Earth Observatory, Columbia University of New York, Palisades, NY
Jess Robertson: Ministry of Business, Innovation and Employment, Wellington
Lesley Wyborn: Australian Research Data Commons, Canberra, ACT
Kerstin Lehnert: Lamont-Doherty Earth Observatory, Columbia University of New York, Palisades, NY

DOI: https://doi.org/10.5334/dsj-2023-005
Journal volume & issue: Vol. 22, no. 1

Abstract

Read online

Persistent identifiers are applied to an ever-increasing variety of research objects, including software, samples, models, people, instruments, grants, and projects, and there is a growing need to apply identifiers at a finer and finer granularity. Unfortunately, the systems developed over two decades ago to manage identifiers and the metadata describing the identified objects no longer scale. Communities working with physical samples have grappled with these three challenges of the increasing volume, variety, and variability of identified objects for many years. To address this dual challenge, the IGSN 2040 project explored how metadata and catalogues for physical samples could be shared at the scale of billions of samples across an ever-growing variety of users and disciplines. In this paper, we focus on how we scale identifiers and their describing metadata to billions of objects and who the actors involved with this system are. Our analysis of these requirements resulted in the definition of a minimum viable product and the design of an architecture that not only addresses the challenges of increasing volume and variety but, more importantly, is easy to implement because it reuses commonly used Web components. Our solution is based on a Web architectural model that utilises Schema.org, JSON-LD, and sitemaps. Applying these commonly used architectural patterns on the internet allows us to not only handle increasing variety but also enable better compliance with the FAIR Guiding Principles.

Published in Data Science Journal

ISSN: 1683-1470 (Online)
Publisher: Ubiquity Press
Country of publisher: United Kingdom
LCC subjects: Science: Science (General)
Website: http://datascience.codata.org/

About the journal

Abstract

Keywords