Dianzi Jishu Yingyong (May 2019)
Research on parallelization of trigram N-gram algorithm based on MapReduce
Abstract
The training of large-scale corpora is an important basic work for the automatic detection of Chinese texts using the trigram N-gram algorithm. Faced with up to one million pieces of data to be processed by the new media platform per day, there is a computational bottleneck in the construction of a single-node trigram N-gram language model lexicon. Based on the deep research of the trigram N-gram algorithm, the idea of trigram N-gram parallelization algorithm based on MapReduce programming model is proposed. In the MapReduce programming model, the arithmetic tasks are evenly distributed to m nodes. The main task of the trigram N-gram algorithm in the Map function part is to calculate the number of times the local words are matched with the first two words, while the main part of the Reduce function,its task is to merge the number of occurrences of the statistical word matching in the Map part to generate global statistical results. The experimental results show that the MapReduce-based trigram N-gram parallelization algorithm running on Hadoop clusters has good performance and scalability. For a 12 billion word-per-day training corpus data set, the algorithm is obtained in a cluster environment. The rate of training results is more linear.
Keywords