An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Roman Zoun; Kay Schallert; David Broneske; Ivayla Trifonova; Xiao Chen; Robert Heyer; Dirk Benndorf; Gunter Saake

doi:10.3390/a14020059

Algorithms (Feb 2021)

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Roman Zoun,
Kay Schallert,
David Broneske,
Ivayla Trifonova,
Xiao Chen,
Robert Heyer,
Dirk Benndorf,
Gunter Saake

Affiliations

Roman Zoun: Line of Business Life Science, Adesso Schweiz AG, 8048 Zürich, Switzerland
Kay Schallert: Bioprocess Engineering, University of Magdeburg, 39106 Magdeburg, Germany
David Broneske: Databases and Software Engineering, University of Magdeburg, 39106 Magdeburg, Germany
Ivayla Trifonova: Line of Business Life Science, Adesso Schweiz AG, 8048 Zürich, Switzerland
Xiao Chen: Databases and Software Engineering, University of Magdeburg, 39106 Magdeburg, Germany
Robert Heyer: Bioprocess Engineering, University of Magdeburg, 39106 Magdeburg, Germany
Dirk Benndorf: Bioprocess Engineering, University of Magdeburg, 39106 Magdeburg, Germany
Gunter Saake: Databases and Software Engineering, University of Magdeburg, 39106 Magdeburg, Germany

DOI: https://doi.org/10.3390/a14020059
Journal volume & issue: Vol. 14, no. 2
p. 59

Abstract

Read online

Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords