Jisuanji kexue (Dec 2021)
Parallel WMD Algorithm Based on GPU Acceleration
Abstract
Word Mover's Distance (WMD) is a method of measuring text similarity.It defines the difference between two texts as the minimum distance between the word embedding vectors of the text.WMD uses the vocabulary to represent the text as a normalized bag-of-words vector.Since the words of the text occupies a small proportion in the corpus,the document vector gene-rated by the bag-of-words model is very sparse.Multiple documents can form a high-dimensional sparse matrix,such a sparse matrix will generate a lot of unnecessary operations.By calculating the WMD of a single source document for multiple target documents at once,the calculation process can be highly parallelized.Aiming at the sparsity of text vectors,this paper proposes a GPU-based parallel Sinkhorn-WMD algorithm,which uses compressed format to store target text to improve memory utilization,and reduces intermediate calculations based on the sparse structure.The pre-trained word embedding vector is used to calculate the word distance matrix,the WMD algorithm is improved,and the optimization algorithm is verified on two public news data sets.The experimental results show that the parallel algorithm on NVIDIA TITAN RTX can achieve a speedup of up to 67.43× compared with the CPU serial algorithm.
Keywords