Tongxin xuebao (Jan 2011)

Applying MapReduce frameworks to a virtualization platform for Deep Web data source discovery

  • XIN Jie,
  • CUI Zhi-ming,
  • ZHAO Peng-peng,
  • ZHANG Guang-ming,
  • XIAN Xue-feng

Journal volume & issue
Vol. 32
pp. 189 – 195

Abstract

Read online

In order to improve the performance of Deep Web crawler in discovering and searching data sources interfaces,a new method was raised to parallel processing the mass data within the Deep Web compromising MapReduce program-ming model and virtualization technology.The new crawling architecture was designed with three producers,the link classified MapReduce,the page classified MapReduce and the form classified MapReduce.Server virtualization was adopted to simulate the cluster environment in order to test the performance.Experiment results indicate that this method is capable for large-scale data parallel computing,can improve the crawling efficiency and avoid wasteful expenditure,which prove the feasibility of applying cloudy technologies into Deep Web data mining field.

Keywords