Getting Started with Hadoop: MapReduce Applications With Combiners

Apache Hadoop is a collection of open-source software utilities that facilitates solving data science problems. Hadoop enables speedy analysis of large datasets by distributing them on a cluster and processing them in parallel. Explore the use of Combiners to make MapReduce applications more efficient by minimizing data transfers.

  • recognize the need for combiners to optimize the execution of a MapReduce application by minimizing data transfers within a cluster
  • recall the steps involved in processing data in a MapReduce application
  • describe the working of a Combiner in performing a partial reduction of the data that is output from the Mapper
  • configure a Combiner to optimize a MapReduce application that calculates an average value
  • use Maven to create a new project for a MapReduce application and plan out the Map and Reduce phases by examining the auto prices dataset
  • develop the Mapper and Reducer for the application that will calculate the average price for each make of automobile in the input dataset
  • create the driver program for the MapReduce application
  • run the MapReduce application and check the output to get the average price for each automobile make
  • code up a Combiner for the MapReduce application and configure the Driver to use it for a partial reduction on the Mapper nodes of the cluster
  • fix the bug in the previous application by defining a type that represents both the aggregate price and count of automobiles that can be used to correctly calculate the average price
  • compare the output of the modified application with the previous buggy version and verify that the average prices for the vehicles are being calculated correctly
  • identify the shortcomings of regular MapReduce operations which are addressed by Combiners, and how Combiners differ from Reducers
