IEEE Access (Jan 2020)

A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

  • Xi Peng,
  • Liang Liu,
  • Lei Zhang

DOI
https://doi.org/10.1109/ACCESS.2019.2961692
Journal volume & issue
Vol. 8
pp. 431 – 444

Abstract

Read online

With the dramatic rise of mobile internet users and the administrative requirements of long-term data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times.

Keywords