IEEE Access (Jan 2023)

Privacy-Preserving Machine Learning on Apache Spark

  • Claudia V. Brito,
  • Pedro G. Ferreira,
  • Bernardo L. Portela,
  • Rui C. Oliveira,
  • Joao T. Paulo

DOI
https://doi.org/10.1109/ACCESS.2023.3332222
Journal volume & issue
Vol. 11
pp. 127907 – 127930

Abstract

Read online

The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.

Keywords