Previous Page

Getting Started with Hadoop: Filtering Data Using MapReduce

Getting Started with Hadoop: Filtering Data Using MapReduce


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

Apache Hadoop is a collection of open-source software utilities that facilitates solving data science problems. Extracting only the meaningful information from a dataset can be painstaking, especially if it is very large. In this course, you will examine how Hadoop's MapReduce can be used to speed up this operation.



Expected Duration (hours)
1.0

Lesson Objectives

Getting Started with Hadoop: Filtering Data Using MapReduce

  • create a new project and code up the Mapper for an application to count the number of passengers in each class of the Titanic in the input dataset
  • develop a Reducer and Driver for the application to generate the final passenger counts in each class of the Titanic
  • build the project using Maven and run it on the Hadoop master node to check that the output correctly shows the numbers in each passenger class
  • apply MapReduce to filter through only the surviving passengers on the Titanic from the input dataset
  • execute the application and verify that the filtering has worked correctly; examine the job and the output files using the YARN Cluster Manager and HDFS NameNode web UIs
  • use MapReduce to obtain a distinct set of the cuisines offered by the restaurants in a dataset
  • build and run the application and confirm the output using HDFS from both the command line and the web application
  • identify configuration functions used to customize a MapReduce and recognize the types of input and output when null values are transmitted from the Mapper to the Reducer
  • Course Number:
    it_dshpfddj_03_enus

    Expertise Level
    Beginner