Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences

Sergio Lifschitz; Edward H. Haeusler; Marcos Catanho; Antonio B. de Miranda; Elvismary Molina de Armas; Alexandre Heine; Sergio G. M. P. Moreira; Cristian Tristão

doi:10.3390/biotech11030031

BioTech (Jul 2022)

Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences

Sergio Lifschitz,
Edward H. Haeusler,
Marcos Catanho,
Antonio B. de Miranda,
Elvismary Molina de Armas,
Alexandre Heine,
Sergio G. M. P. Moreira,
Cristian Tristão

Affiliations

Sergio Lifschitz: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil
Edward H. Haeusler: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil
Marcos Catanho: Lab. Genética Molecular de Microrganismos, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro 21040-900, Brazil
Antonio B. de Miranda: Lab. Genética Molecular de Microrganismos, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro 21040-900, Brazil
Elvismary Molina de Armas: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil
Alexandre Heine: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil
Sergio G. M. P. Moreira: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil
Cristian Tristão: Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Rio de Janeiro 22451-900, Brazil

DOI: https://doi.org/10.3390/biotech11030031
Journal volume & issue: Vol. 11, no. 3
p. 31

Abstract

Read online

DNA sequencers output a large set of very long biological data strings that we should persist in databases rather than basic text file systems. Many different data models and database management systems (DBMS) may deal with both storage and efficiency issues regarding genomic datasets. Specifically, there is a need for handling strings with variable sizes while keeping their biological meaning. Relational database management systems (RDBMS) provide several data types that could be further explored for the genomics context. Besides, they enforce integrity, consistency, and enable good abstractions for more conventional data. We propose the relational text data type to represent and manipulate biological sequences and their derivatives. We present a logical schema for representing the core biological information, which may be inferred from a given biological conceptual data schema and the corresponding function manipulations. We implement and evaluate these stored functions into an actual RDBMS for both efficacy and efficiency. We show that it is possible to enforce basic and complex requirements for the genomic domain. We claim that the well-established relational text data type in RDBMS may appropriately handle the representation and persistency of biological sequences. We base our approach on the idea of domain-specific abstract data types that can store data with semantically defined functions while hiding those details from non-technical end-users.

Published in BioTech

ISSN: 2673-6284 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology: Biotechnology
Website: https://www.mdpi.com/journal/biotech

About the journal

Abstract

Keywords