Journal of Big Data (Dec 2021)

Performance-efficient distributed transfer and transformation of big spatial histopathology datasets in the cloud

  • Esma Yildirim

DOI
https://doi.org/10.1186/s40537-021-00546-3
Journal volume & issue
Vol. 8, no. 1
pp. 1 – 26

Abstract

Read online

Abstract Whole Slide Image (WSI) datasets are giga-pixel resolution, unstructured histopathology datasets that consist of extremely big files (each can be as large as multiple GBs in compressed format). These datasets have utility in a wide range of diagnostic and investigative pathology applications. However, the datasets present unique challenges: The size of the files, propriety data formats, and lack of efficient parallel data access libraries limit the scalability of these applications. Commercial clouds provide dynamic, cost-effective, scalable infrastructure to process these datasets, however, we lack the tools and algorithms that will transfer/transform them onto the cloud seamlessly, providing faster speeds and scalable formats. In this study, we present novel algorithms that transfer these datasets onto the cloud while at the same time transforming them into symmetric scalable formats. Our algorithms use intelligent file size distribution, and pipelining transfer and transformation tasks without introducing extra overhead to the underlying system. The algorithms, tested in the Amazon Web Services (AWS) cloud, outperform the widely used transfer tools and algorithms, and also outperform our previous work. The data access to the transformed datasets provides better performance compared to the related work. The transformed symmetric datasets are fed into three different analytics applications: a distributed implementation of a content-based image retrieval (CBIR) application for prostate carcinoma datasets, a deep convolutional neural network application for classification of breast cancer datasets, and to show that the algorithms can work with any spatial dataset, a Canny Edge Detection application on satellite image datasets. Although different in nature, all of the applications can easily work with our new symmetric data format and performance results show near-linear speed-ups as the number of processors increases.

Keywords