Data Analysis Using the Spark DataFrame API
Apache Spark 2.3
| Beginner
- 16 Videos | 1h 10m 46s
- Includes Assessment
- Earns a Badge
An open-source cluster-computing framework used for data science, Apache Spark has become the de facto big data framework. In this Skillsoft Aspire course, learners explore how to analyze real data sets by using DataFrame API methods. Discover how to optimize operations with shared variables and combine data from multiple DataFrames using joins. Explore the Spark 2.x version features that make it significantly faster than Spark 1.x. Other topics include how to create a Spark DataFrame from a CSV file; apply DataFrame transformations, grouping, and aggregation; perform operations on a DataFrame to analyze categories of data in a data set. Visualize the contents of a Spark DataFrame, with Matplotlib. Conclude by studying how to broadcast variables and DataFrame contents in text file format.
WHAT YOU WILL LEARN
-
recognize the features that make Spark 2.x versions significantly faster than Spark 1.xspecify the reasons for using shared variables in your Spark application and distinguish between the two options available for sharing variablescreate a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFramedefine a transformation to view a random sample of data from a large DataFrameapply grouping and aggregation operations on a DataFrame to analyze categories of data in a datasetuse Matplotlib to visualize the contents of a Spark DataFrameperform operations to prepare your dataset for analysis by trimming unnecessary columns and rows containing missing datadefine and apply a generic transformation on a DataFrame
-
apply complex transformations on a DataFrame to extract meaningful information from a datasetwork with broadcast variables and perform a join operation with a DataFrame that has been broadcastuse a Spark accumulator as a counterstore the contents of a DataFrame in a text file for archiving or sharingdefine and work with a custom accumulator to count a vector of valuesperform different join operations on Spark DataFrames to combine data from multiple sourcesanalyze data using the DataFrame API
IN THIS COURSE
-
1.Course Overview2m 25sUP NEXT
-
2.Performance Improvements in Spark 26m 14s
-
3.Broadcast Variables and Accumulators4m 54s
-
4.Loading Data into a DataFrame6m 11s
-
5.Sampling the Contents of a DataFrame4m 9s
-
6.Grouping and Aggregations6m 23s
-
7.Visualizing Data in a DataFrame7m 34s
-
8.Trimming and Cleaning Data4m 32s
-
9.User-Defined Functions and DataFrames4m 36s
-
10.Combining Filters, Aggregations, and Sorting3m 31s
-
11.Using Broadcast Variables3m 39s
-
12.Using Accumulators3m 59s
-
13.Exporting DataFrame Contents2m 15s
-
14.Custom Accumulators2m 56s
-
15.Join Operations3m 28s
-
16.Exercise: Data Analysis Using the DataFrame API4m 1s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform
Digital badges are yours to keep, forever.