Research on parallelization of trigram N-gram algorithm based on MapReduce

Gong Yonggang; Tian Runlin; Lian Xiaoqin; Xia Tian

doi:10.16157/j.issn.0258-7998.190008

Dianzi Jishu Yingyong (May 2019)

Research on parallelization of trigram N-gram algorithm based on MapReduce

Gong Yonggang,
Tian Runlin,
Lian Xiaoqin,
Xia Tian

Affiliations

Gong Yonggang: School of Computer and Information Engineering，Beijing Technology and Business University，Beijing 100024，China
Tian Runlin: School of Computer and Information Engineering，Beijing Technology and Business University，Beijing 100024，China
Lian Xiaoqin: School of Computer and Information Engineering，Beijing Technology and Business University，Beijing 100024，China
Xia Tian: School of Information Resource Management，Renmin University of China，Beijing 100872，China

DOI: https://doi.org/10.16157/j.issn.0258-7998.190008
Journal volume & issue: Vol. 45, no. 5
pp. 70 – 73

Abstract

Read online

The training of large-scale corpora is an important basic work for the automatic detection of Chinese texts using the trigram N-gram algorithm. Faced with up to one million pieces of data to be processed by the new media platform per day, there is a computational bottleneck in the construction of a single-node trigram N-gram language model lexicon. Based on the deep research of the trigram N-gram algorithm, the idea of trigram N-gram parallelization algorithm based on MapReduce programming model is proposed. In the MapReduce programming model, the arithmetic tasks are evenly distributed to m nodes. The main task of the trigram N-gram algorithm in the Map function part is to calculate the number of times the local words are matched with the first two words, while the main part of the Reduce function，its task is to merge the number of occurrences of the statistical word matching in the Map part to generate global statistical results. The experimental results show that the MapReduce-based trigram N-gram parallelization algorithm running on Hadoop clusters has good performance and scalability. For a 12 billion word-per-day training corpus data set, the algorithm is obtained in a cluster environment. The rate of training results is more linear.

Published in Dianzi Jishu Yingyong

ISSN: 0258-7998 (Print)
Publisher: National Computer System Engineering Research Institute of China
Country of publisher: China
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: http://journal.chinaaet.com/en

About the journal

Abstract

Keywords