Shenzhen Daxue xuebao. Ligong ban (Sep 2022)

An algorithm for matching original experimental records based on improved CDC

  • CAI Yina,
  • CHEN Xin,
  • QIN Zhiwu,
  • WANG Xin,
  • BAO Xianyu,
  • PENG Jinxue,
  • LIN Yongqi,
  • LI Junlin

DOI
https://doi.org/10.3724/SP.J.1249.2022.05509
Journal volume & issue
Vol. 39, no. 5
pp. 509 – 514

Abstract

Read online

Aiming at the problems such as long time and occasional errors in the generation process of the current laboratory test report, we present an automatic capture technology of general original experimental records based on fence factor. First, the read files of the day are accurately filtered by calculating the overall Hash value of file. Then, we use the improved content-defined chunking (CDC) algorithm for chunking. The improvement of CDC algorithm includes setting the unit of the sliding window as the spacing of between two lines and setting the range of the byte size in the sliding window. When the text block is completed, a string matching algorithm based on pattern string is used to complete the matching process. The string matching algorithm constructs the mapping relationship between the pattern string and data block in data block index table, and then quickly matches the pattern string Pn to corresponding data block through the data block index table. The original experimental record files of customs laboratory are used for testing. The algorithm occupies the least memory and has the largest chunking throughput.

Keywords