PLoS Computational Biology (Mar 2023)
Towards self-describing and FAIR bulk formats for biomedical data.
Abstract
We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.