Data Science and Engineering (Sep 2019)

# Selectivity Estimation on Set Containment Search

• Yang Yang,
• Wenjie Zhang,
• Ying Zhang,
• Xuemin Lin,
• Liping Wang

DOI
https://doi.org/10.1007/s41019-019-00104-1
Journal volume & issue
Vol. 4, no. 3
pp. 254 – 268

## Abstract

Abstract In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset $${\mathcal {S}}$$ S , we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over $${\mathcal {S}}$$ S . We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch-based approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.