PLoS Computational Biology (Mar 2023)

Towards self-describing and FAIR bulk formats for biomedical data.

  • Michael Lukowski,
  • Andrew Prokhorenkov,
  • Robert L Grossman

DOI
https://doi.org/10.1371/journal.pcbi.1010944
Journal volume & issue
Vol. 19, no. 3
p. e1010944

Abstract

Read online

We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.