Applied Sciences (Jul 2022)

Generalized Sketches for Streaming Sets

  • Wenhua Guo,
  • Kaixuan Ye,
  • Yiyan Qi,
  • Peng Jia,
  • Pinghui Wang

DOI
https://doi.org/10.3390/app12157362
Journal volume & issue
Vol. 12, no. 15
p. 7362

Abstract

Read online

Many real-world datasets are given as a stream of user–interest pairs, where a user–interest pair represents a link from a user (e.g., a network host) to an interest (e.g., a website), and may appear more than once in the stream. Monitoring and mining statistics, including cardinality, intersection cardinality, and Jaccard similarity of users’ interest sets on high-speed streams, are widely employed by applications such as network anomaly detection. Although estimating set cardinality, set intersection cardinality, and set Jaccard similarity, respectively, is well studied, there is no effective method that provides a one-shot solution for estimating all these three statistics. To solve the above challenge, we develop a novel framework, SimCar. SimCar online builds an order-hashing (OH) sketch for each user occurring in the data stream of interest. At any time of interest, one can query the cardinalities, intersection cardinalities, and Jaccard similarities of users’ interest sets. Specially, using OH sketches, we develop maximum likelihood estimation (MLE) methods to estimate cardinalities and intersection cardinalities of users’ interest sets. In addition, we use OH sketches to estimate Jaccard similarities of users’ interest sets and build locality-sensitive hashing tables to search for users with similar interests with sub-linear time. We evaluate the performance of our methods on real-world datasets. The experimental results demonstrate the superiority of our methods.

Keywords