Dianxin kexue (Dec 2013)

A Parallel ETL Tool Based on an Improved Chain-MapReduce Framework

  • Bin Wu,
  • Xinguang Liu

Journal volume & issue
Vol. 29
pp. 1 – 8

Abstract

Read online

The related work in parallel ETL and common methods to deal with multiple MapReduce jobs were introduced. Then an improved chain-MapReduce framework was presented, based on this framework,a parallel ETL tool was designed. Several optimization rules on ETL which will make the ETL process generate less MapReduce jobs to avoid unnecessary I/O and network cost were presented. The ETL tool on real queries and real big datasets were evaluated. Compared with Hive, the tool reduces time on average by 10% to 20%.

Keywords