Closed-Form Models of Accuracy Loss due to Subsampling in SVD Collaborative Filtering

Samin Poudel; Marwan Bikdash

doi:10.26599/BDMA.2022.9020024

Big Data Mining and Analytics (Mar 2023)

Closed-Form Models of Accuracy Loss due to Subsampling in SVD Collaborative Filtering

Samin Poudel,
Marwan Bikdash

Affiliations

Samin Poudel: Department of Computational Data Science and Engineering, North Carolina A & T State University, Greensboro, NC 27401, USA
Marwan Bikdash: Department of Computational Data Science and Engineering, North Carolina A & T State University, Greensboro, NC 27401, USA

DOI: https://doi.org/10.26599/BDMA.2022.9020024
Journal volume & issue: Vol. 6, no. 1
pp. 72 – 84

Abstract

Read online

We postulate and analyze a nonlinear subsampling accuracy loss (SSAL) model based on the root mean square error (RMSE) and two SSAL models based on the mean square error (MSE), suggested by extensive preliminary simulations. The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped (FUD) and the fraction of items dropped (FID). We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering (CF) algorithm. The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items. Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics. The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user (or an item) vs. not dropping it. Moreover, one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient. Most importantly, the models are constant in the sense that they are written in closed-form using the considered data characteristics (densities and numbers of users and items). The models are validated through extensive simulations based on 850 synthetically generated primary (pre-subsampling) matrices derived from the 25M MovieLens dataset. Nearly 460 000 subsampled rating matrices were then simulated and subjected to the singular value decomposition (SVD) CF algorithm. Further validation was conducted using the 1M MovieLens and the Yahoo! Music Rating datasets. The models were constant and significant across all 3 datasets.

Published in Big Data Mining and Analytics

ISSN: 2096-0654 (Print)
Publisher: Tsinghua University Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8254253

About the journal

Abstract

Keywords