Apache Hadoop is a collection of open-source software utilities that facilitates solving data science problems. In this course, you will discover how to use Hadoop's MapReduce, including how to provision a Hadoop cluster on the cloud and then build a hello world application using MapReduce to calculate the word frequencies in a text document.
Getting Started with Hadoop: Developing a Basic MapReduce Application
create and configure a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service
work with the YARN Cluster Manager and HDFS NameNode web applications that come packaged with Hadoop
use Maven to create a new Java project for the MapReduce application
develop a Mapper for the word frequency application that includes the logic to parse one line of the input file and produce a collection of keys and values as output
create a Reducer for the application that will collect the Mapper output and calculate the word frequencies in the input text file
specify the configurations of the MapReduce applications in the Driver program and the project's pom.xml file
build the MapReduce word frequency application using Maven to produce a jar file and then prepare for execution from the master node of the Hadoop cluster
run the application and examine the outputs generated to get the word frequencies in the input text document
idenfity the apps packaged with Hadoop and the purposes they serve and recall the classes/methods used in the Map and Reduce phases of a MapReduce application