Data Analysis Using the Spark DataFrame API

Apache Spark 2.3
  • 16 Videos | 1h 17m 46s
  • Includes Assessment
  • Earns a Badge
Likes 38 Likes 38
An open-source cluster-computing framework used for data science, Apache Spark has become the de facto big data framework. In this Skillsoft Aspire course, learners explore how to analyze real data sets by using DataFrame API methods. Discover how to optimize operations with shared variables and combine data from multiple DataFrames using joins. Explore the Spark 2.x version features that make it significantly faster than Spark 1.x. Other topics include how to create a Spark DataFrame from a CSV file; apply DataFrame transformations, grouping, and aggregation; perform operations on a DataFrame to analyze categories of data in a data set. Visualize the contents of a Spark DataFrame, with Matplotlib. Conclude by studying how to broadcast variables and DataFrame contents in text file format.

WHAT YOU WILL LEARN

  • recognize the features that make Spark 2.x versions significantly faster than Spark 1.x
    specify the reasons for using shared variables in your Spark application and distinguish between the two options available for sharing variables
    create a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFrame
    define a transformation to view a random sample of data from a large DataFrame
    apply grouping and aggregation operations on a DataFrame to analyze categories of data in a dataset
    use Matplotlib to visualize the contents of a Spark DataFrame
    perform operations to prepare your dataset for analysis by trimming unnecessary columns and rows containing missing data
    define and apply a generic transformation on a DataFrame
  • apply complex transformations on a DataFrame to extract meaningful information from a dataset
    work with broadcast variables and perform a join operation with a DataFrame that has been broadcast
    use a Spark accumulator as a counter
    store the contents of a DataFrame in a text file for archiving or sharing
    define and work with a custom accumulator to count a vector of values
    perform different join operations on Spark DataFrames to combine data from multiple sources
    analyze data using the DataFrame API

IN THIS COURSE

  • Playable
    1. 
    Course Overview
    2m 25s
    UP NEXT
  • Playable
    2. 
    Performance Improvements in Spark 2
    6m 14s
  • Locked
    3. 
    Broadcast Variables and Accumulators
    4m 54s
  • Locked
    4. 
    Loading Data into a DataFrame
    6m 11s
  • Locked
    5. 
    Sampling the Contents of a DataFrame
    4m 9s
  • Locked
    6. 
    Grouping and Aggregations
    6m 23s
  • Locked
    7. 
    Visualizing Data in a DataFrame
    7m 34s
  • Locked
    8. 
    Trimming and Cleaning Data
    4m 32s
  • Locked
    9. 
    User-Defined Functions and DataFrames
    4m 36s
  • Locked
    10. 
    Combining Filters, Aggregations, and Sorting
    3m 31s
  • Locked
    11. 
    Using Broadcast Variables
    3m 39s
  • Locked
    12. 
    Using Accumulators
    3m 59s
  • Locked
    13. 
    Exporting DataFrame Contents
    2m 15s
  • Locked
    14. 
    Custom Accumulators
    2m 56s
  • Locked
    15. 
    Join Operations
    3m 28s
  • Locked
    16. 
    Exercise: Data Analysis Using the DataFrame API
    4m 1s

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform

Digital badges are yours to keep, forever.

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Likes 10 Likes 10  
Likes 1602 Likes 1602  
Likes 64 Likes 64