Previous Page

Getting Started with Hadoop: Advanced Operations Using MapReduce

Getting Started with Hadoop: Advanced Operations Using MapReduce


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

Apache Hadoop is a collection of open-source software utilities that facilitates solving data science problems. In this course, explore how MapReduce can be used to extract the five most expensive vehicles in a dataset and then how to build an inverted index for the words appearing in a set of text files.



Expected Duration (hours)
0.8

Lesson Objectives

Getting Started with Hadoop: Advanced Operations Using MapReduce

  • define a vehicle type that can be used to represent automobiles to be stored in a Java PriorityQueue
  • configure a Mapper to use a PriorityQueue to store the five most expensive vehicles it has processed from the dataset
  • use a PriorityQueue in the Reducer of the application to receive the five most expensive automobiles from each mapper and write the top 5 vehicles overall to the output
  • execute the application and examine the output on HDFS to confirm that the five most expensive automobiles have been written out
  • define the Mapper for a MapReduce application to build an inverted index from a set of text files
  • configure the Reducer and the Driver for the inverted index application
  • run the application and examine the inverted index on HDFS
  • recognize the data structures and configurations involved when extracting the top N values from a data set
  • Course Number:
    it_dshpfddj_05_enus

    Expertise Level
    Intermediate