Track 1: Data Analyst

In this track of the data science Skillsoft Aspire journey, the focus is the data analyst role with a focus on: Python, R, architecture, statistics, and Spark.

26 Courses | 24h 29m 23s
1 Lab | 8h

Track 2: Data Wrangler

In this track of the data science Skillsoft Aspire journey, the focus will be on the data wrangler role. We will explore areas such as: wrangling with Python, Mongo, and Hadoop.

25 Courses | 22h 5m 24s
1 Lab | 8h

Track 3: Data Ops

For this track of the data science Skillsoft Aspire journey, the focus will be on the Data Ops role. Here we will explore areas such as: governance, security, and harnessing volume and velocity.

24 Courses | 18h 27m 15s
1 Lab | 8h

Track 4: Data Scientist

For this track of the data science Skillsoft Aspire journey, the focus will be on the Data Scientist role. Here we will explore areas such as: visualization, APIs, and ML and DL algorithms.

26 Courses | 23h 20m 9s
1 Lab | 8h

COURSES INCLUDED

Data Architecture Getting Started

In this 12-video course, learners explore how to define data, its lifecycle, the importance of privacy, and SQL and NoSQL database solutions and key data management concepts as they relate to big data. First, look at the relationship between data, information, and analysis. Learn to recognize personally identifiable information (PII), protected health information (PHI), and common data privacy regulations. Then, study the data lifecycle's six phases. Compare and contrast SQL and NoSQL database solutions and look at using Visual Paradigm to create a relational database ERD (entity-relationship diagram). To implement an SQL solution, Microsoft SQL Server is deployed in the Amazon Web Services (AWS) cloud, and a NoSQL solution by deploying DynamoDB in the AWS cloud. Explore definitions of big data and governance. Learners will examine various types of data architecture, including TOGAF (The Open Group Architecture Framework) enterprise architecture. Finally, learners study data analytics and reporting, how organizations can derive value from data they have. The concluding exercise looks at implementing effective data management solutions.

13 videos | 1h 2m Assessment Badge

Data Engineering Getting Started

Data engineering is the area of data science that focuses on practical applications of data collection and analysis. This 12-video course helps learners explore distributed systems, batch versus in-memory processing, NoSQL uses, and the various tools available for data management/big data and the ETL (extract, transform, and load) process. Begin with an overview of distributed systems from a data perspective. Then look at differences between batch and in-memory processing. Learn about NoSQL stores and their use, and tools available for data management. Explore ETL-what it is, the process, and the different tools available. Learn to use Talend Open Studio to showcase the ETL concept. Next, examine data modeling and creating a data model in Talend Open Studio. Explore the hierarchy of needs when working with AI and machine learning. In another tutorial, learn how to create a data partition. Then move on to data engineering and best practices, with a look at approaches to building and using data reporting tools. Conclude with an exercise designed to create a data model.

13 videos | 45m Assessment Badge

Python - Introduction to NumPy for Multi-dimensional Data

ThisSkillsoft Aspire course explores NumPy, a Python library used in data science and big data. NumPy provides a framework to express data in the form of arrays, and is the fundamental building block for several other Python libraries. For this course, you will need to know basics of programming in Python3, and should also have some familiarity in working with Jupyter notebooks. You will learn how to create NumPy arrays and perform basic mathematical operations on them. Next you will see how to modify, index, slice, and reshape the arrays; and examine the NumPy library's universal array functions that operate on an element-by-element basis. Conclude by learning how to iterate various options through NumPy arrays.

11 videos | 58m Assessment Badge

Python - Advanced Operations with NumPy Arrays

NumPy is oneof the fundamental packages for scientific computing that allows data to be represented in dimensional arrays. This course covers the array operations you can undertake such as image manipulation, fancy indexing, and broadcasting. To take this Skillsoft Aspire course, you should be comfortable with how to create, index, and slice Numpy arrays, and apply aggregate and universal functions. Among the topics, you will learn about the several options available in NumPy to split arrays. You will learn how to use NumPy to work with digital images, which are multidimensional arrays. Next, you will observe how to manipulate a color image, perform slicing operations to view sections of the image, and use a SciPy package for image manipulation. You will learn how to use masks, an array of index values, to access multiple elements of an array simultaneously, referred to as Sansi indexing. Finally, this course covers broadcasting to perform operations between mismatched arrays.

13 videos | 1h 7m Assessment Badge

Python - Introduction to Pandas and DataFrames

Simplify data analysis with Pandas DataFrames. Pandas is a Python library that enables you to work with series and tabular data, including initialization, and population. For this course, learners do not need prior experience working with Pandas, but should be familiar with Python3, and Jupyter Notebooks. Topics include the following: Define your own index for a Pandas series object; load data from a CSV (comma separated values) file, to create a Pandas DataFrame; Add and remove data from your Pandas DataFrame; Analyze a portion of your DataFrame; Examine how to reshape or reorient data, and to create a pivot table. Finally, represent multidimensional data in two-dimensional DataFrames, with multi or hierarchical indexes.

14 videos | 1h 4m Assessment Badge

Python - Manipulating & Analyzing Data in Pandas DataFrames

Explore advanced data manipulation and analysis with Pandas DataFrames, a Python library that shares similarities with relational databases. To take this course, prior basic experience is needed with Pandas DataFrames, data loading, and Jupyter Notebook data manipulation. You will learn to iterate data in your DataFrame. See how to export data to Excel files, JSON (JavaScript Object Notation) files, and CSV (comma separated values) files. Sort the contents of a DataFrame and manage missing data. Group data with a multi-index. Merge disparate data into a single DataFrame through join and concatenate operations. Finally, you will determine when and where to integrate data with structured queries, similar to SQL.

10 videos | 44m Assessment Badge

R Data Structures

R is a programming language that is an essential skill for statistical computing and graphics. It is the tool of choice for data science professionals in every industry and field-not only to create reproducible high-quality analyses, but to take advantage of R's great graphic and charting capabilities. In this 11-video Skillsoft Aspire course, you will explore the fundamental data structures used in R, including working with vectors, lists, matrices, factors, and data frames. The key concepts in this course include: creating vectors in R and manipulating and performing operations on vectors in R; how to sort vectors in R; and how to use lists in R and explore example code line by line executing each line using the run current line command along the way. You will also examine creating matrices and performing matrix operations in R; creating factors and data frames in R; performing data frame operations in R; and how to create and use a data frame.

11 videos | 51m Assessment Badge

Importing & Exporting Data using R

An essential skill for statistical computing and graphics. The programming language R the tool of choice for data science professionals in every industry and field-both to take advantage of R's great graphic and charting capabilities and to create reproducible high-quality analyses. In this 8-video Skillsoft Aspire course, you will discover how to use R to import and export tabular data in CSV (comma-separated values), Excel, and HTML format. The key concepts covered in this course include how to read data from a CSV formatted text file and from an Excel spreadsheet; how to read tabular data from an HTML file; and how to export tabular data from R to a CSV file and to an Excel spreadsheet. In addition, learners will explore exporting tabular data from R to an HTML table; how to read data from an HTML table and export to CSV; and how to confirm that the contents of the CSV file were written correctly.

8 videos | 33m Assessment Badge

Data Exploration using R

The tool of choice for data science professionals in every modern industry and field, the programming language R has become an essential skill for statistical computing and graphics. It both creates reproducible high-quality analyses and takes advantage of superior graphic and charting capabilities. In this 10-video Skillsoft Aspire course, you will explore data in R by using the dplyr library, including working with tabular data, piping data, mutating data, summarizing data, combining datasets, and grouping data. Key concepts covered in this course include using the dplyr library to load data frames; selecting subsets of data by using dplyr; and how to filter tabular data using dplyr. You will also learn to perform multiple operations by using the pipe operator; how to create new columns with the mutate method; and how to summarize data using summary functions. Next, use the dplyr join functions to combine data. Then learn how to use the group by method from the dplyr library, and how to query data with various dplyr library functions.

10 videos | 40m Assessment Badge

R Regression Methods

The programming language has become an essential skill for statistical computing and graphics, the tool of choice for data science professionals in every industry and field. R creates reproducible high-quality analyses, and allows users to take advantage of its great graphic and charting capabilities. In this 8-video Skillsoft Aspire course, you will discover how to apply regression methods to data science problems by using R. Key concepts covered in this course include preparing a data set before creating a linear regression model how to create a linear regression model with the lm method in R; and extracting statistical results of a linear regression problem. You will also learn how to test the predict method on perform the preparatory steps needed to create a logistic model; and how to apply the generalized linear model (glm) method on a logistic regression problem. Finally, learners see how to create a linear regression model and use the predict method on a linear model.

8 videos | 36m Assessment Badge

R Classification & Clustering

Explore the advantages of the programming language R in this 8-video Skillsoft Aspire course. An essential skill for statistical computing and graphics, R is the tool of choice for data science professionals in every industry and field. It both creates reproducible high-quality analyses, and offers unparalleled graphic and charting capabilities. Learners will examine how to apply classification and clustering methods to data science problems by using R. Key concepts covered in this course include performing the preparatory steps needed to create a classification and decision tree; using the rpart library and ctree library to build a decision tree; and how to perform the preparatory steps needed to carry out clustering. Next, explore use of the k-means clustering method; using hierarchical clustering with the hclust and cutree methods; and applying a decision tree method to a classification problem. Finally, learn to train a decision tree classifier by using the data and a relationship inside of those data.

8 videos | 38m Assessment Badge

Simple Descriptive Statistics

Along the career path to Data Science, a fundamental understanding of statistics and modeling is required. The goal of all modeling is generalizing as well as possible from a sample to the population of big data as a whole. In this 10-video Skillsoft Aspire course, learners explore the first step in this process. Key concepts covered here include the objectives of descriptive and inferential statistics, and distinguishing between the two; objectives of population and sample, and distinguishing between the two; and objectives of probability and non-probability sampling and distinguishing between them. Learn to define the average of a data set and its properties; the median and mode of a data set and their properties; and the range of a data set and its properties. Then study the inter-quartile range of a data set and its properties; the variance and standard deviation of a data set and their properties; and how to differentiate between inferential and descriptive statistics, the two most important types of descriptive statistics, and the formula for standard deviation.

10 videos | 1h 10m Assessment Badge

Common Approaches to Sampling Data

Data science is an interdisciplinary field that seeks to find interesting generalizable insights within data and then puts those insights to monetizable use. In this 8-video Skillsoft Aspire course, learners can explore the first step in obtaining a representative sample from which meaningful generalizable insights can be obtained. Examine basic concepts and tools in statistical theory, including the two most important approaches to sampling-probability and nonprobability sampling-and common sampling techniques used for both approaches. Learn about simple random sampling, systematic random sampling, and stratified random sampling, including their advantages and disadvantages. Next, explore sampling bias. Then consider what is probably the most popular type of nonprobability sampling technique-the case study, used in medical education, business education, and other fields. A concluding exercise on efficient sampling invites learners to review their new knowledge by defining the two properties of all probability sampling techniques; enumerating the three types of probability sampling techniques; and listing two types of nonprobability sampling.

8 videos | 46m Assessment Badge

Inferential Statistics

In this Skillsoft Aspire course on data science, learners can explore hypothesis testing, which finds wide applications in data science. This beginner-level, 10-video course builds upon previous coursework by introducing simple inferential statistics, called the backbone of data science, because they seek to posit and prove or disprove relationships within data. You will start by learning steps in simple hypothesis testing: the null and alternative hypotheses, s-statistic, and p-value, as ach term is introduced and explained. Next, listen to an informative discussion of a specific family of hypothesis tests, the t-test. Then learn to describe their applications, and become familiar with how to use cases including linear regression. Learn about Gaussian distribution and the related concepts of correlation, which measures relationships between any two variables, and autocorrelation, a special form used in the concept of time-series analysis. In the closing exercise, review your knowledge by differentiating between the null and the alternative hypotheses in a hypothesis testing procedure, then enumerating four distinct uses for different types of t-tests.

10 videos | 1h 1m Assessment Badge

Apache Spark Getting Started

Explore the basics of Apache Spark, an analytics engine used for big data processing. It's an open source, cluster computing framework built on top of Hadoop. Discover how it allows operations on data with both its own library methods and with SQL, while delivering great performance. Learn the characteristics, components, and functions of Spark, Hadoop, RDDS, the spark session, and master and worker notes. Install PySpark. Then, initialize a Spark Context and Spark DataFrame from the contents of an RDD and a DataFrame. Configure a DataFrame with a map function. Retrieve and transform data. Finally, convert Spark and Pandas DataFrames and vice versa.

15 videos | 1h 6m Assessment Badge

Hadoop & MapReduce Getting Started

In this course, learners will explore the theory behind big data analysis using Hadoop, and how MapReduce enables parallel processing of large data sets distributed on a cluster of machines. Begin with an introduction to big data and the various sources and characteristics of data available today. Look at challenges involved in processing big data and options available to address them. Next, a brief overview of Hadoop, its role in processing big data, and the functions of its components such as the Hadoop Distributed File System (HDFS), MapReduce, and YARN (Yet Another Resource Negotiator). Explore the working of Hadoop's MapReduce framework to process data in parallel on a cluster of machines. Recall steps involved in building a MapReduce application and specifics of the Map phase in processing each row of the input file's data. Recognize the functions of the Shuffle and Reduce phases in sorting and interpreting the output of the Map phase to produce a meaningful output. To conclude, complete an exercise on the fundamentals of Hadoop and MapReduce.

8 videos | 1h 3m Assessment Badge

Developing a Basic MapReduce Hadoop Application

In this Skillsoft Aspire course, discover how to use Hadoop's MapReduce; provision a Hadoop cluster on the cloud; and build an application with MapReduce to calculate word frequencies in a text document. To start, create a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service; then work with the YARN Cluster Manager and HDFS (Hadoop Distributed File System) NameNode web applications that come packaged with Hadoop. Use Maven to create a new Java project for the MapReduce application, and develop a mapper for word frequency application. Create a Reducer for the application that will collect Mapper output and calculate word frequencies in input text files, and identify configurations of MapReduce applications in the Driver program and the project's pom.xml file. Next, build the MapReduce word frequency application with Maven to produce a jar file and prepare for execution from the master node of the Hadoop cluster. Finally, run the application and examine outputs generated to get word frequencies in the input text document. The exercise involves developing a basic MapReduce application.

10 videos | 1h 13m Assessment Badge

Hadoop HDFS Getting Started

Explore the concepts of analyzing large data sets in this 12-video Skillsoft Aspire course, which deals with Hadoop and its Hadoop Distributed File System (HDFS), which enables parallel processing of big data efficiently in a distributed cluster. The course assumes a conceptual understanding of Hadoop and its components; purely theoretical, it contains no labs, with just enough information provided to understand how Hadoop and HDFS allow processing big data in parallel. The course opens by explaining the ideas of vertical and horizontal scaling, then discusses functions served by Hadoop to horizontally scale data processing tasks. Learners explore functions of YARN, MapReduce, and HDFS, covering how HDFS keeps track of where all pieces of large files are distributed, replication of data, and how HDFS is used with Zookeeper: a tool maintained by the Apache Software Foundation and used to provide coordination and synchronization in distributed systems, along with other services related to distributed computing-a naming service, configuration management, and so on. Learn about Spark, a data analytics engine for distributed data processing.

12 videos | 1h 14m Assessment Badge

Introduction to the Shell for Hadoop HDFS

In this Skillsoft Aspire course, learners discover how to set up a Hadoop Cluster on the cloud and explore bundled web apps-the YARN Cluster Manager app and the HDFS (Hadoop Distributed File System) NameNode UI. This 9-video course assumes a good understanding of what Hadoop is, and how HDFS enables processing of big data in parallel by distributing large data sets across a cluster; learners should also be familiar with running commands from the Linux shell, with some fluency in basic Linux file system commands. The course opens by exploring two web applications which are packaged with Hadoop, the UI for the YARN cluster manager, and the node name UI for HDFS. Learners then explore two shells which can be used to work with HDFS, the Hadoop FS shell and Hadoop DFS shell. Next, you will explore basic commands which can be used to navigate HDFS; discuss their similarities with Linux file system commands; and discuss distributed computing. In a closing exercise, practice identifying web applications used to explore and also monitor Hadoop.

9 videos | 52m Assessment Badge

Working with Files in Hadoop HDFS

In this Skillsoft Aspire course, learners will encounter basic Hadoop file system operations such as viewing the contents of directories and creating new ones. This 8-video course assumes good understanding of what Hadoop is, and how HDFS enables processing of big data in parallel by distributing large data sets across a cluster; learners should also be familiar with running commands from the Linux shell, with some fluency in basic Linux file system commands. Begin by working with files in various ways, including transferring files between a local file system and HDFS (Hadoop Distributed File System) and explore ways to create and delete files on HDFS. Then examine different ways to modify files on HDFS. After exploring the distributed computing concept, prepare to begin working with HDFS in a production setting. In the closing exercise, write a command to create a directory/data/products/files on HDFS, for which data/products may not exist; list two commands for two copy operations-one from local file system to HDFS, and another for reverse transfer, from HDFS to local host.

8 videos | 47m Assessment Badge

Hadoop HDFS File Permissions

Explore reasons why not all users should have free reign over all data sets, when managing a data warehouse. In this 9-video Skillsoft Aspire course, learners explore how file permissions can be viewed and configured in HDFS (Hadoop File Management System) and how the NameNode UI is used to monitor and explore HDFS. For this course, you need a good understanding of Hadoop and HDFS, along with familiarity with the HDFS shells, and confidence in working with and manipulating files on HDFS, and exploring it from the command line. The course focuses on different ways to view permissions, which are linked to files and directories, and how these can be modified. Learners explore automating many tasks involving HDFS by simply scripting them, and to use HDFS NameNode UI to monitor the distributed file system, and explore its contents. Review distributed computing and big data. The closing exercise involves writing a command to be used on the HDFS dfs shell to count the number of files within a directory on HDFS, and to perform related tasks.

9 videos | 48m Assessment Badge

Data Silos, Lakes, & Streams Introduction

This 11-video course discusses the transition of data warehousing to cloud-based solutions using the AWS (Amazon Web Services) cloud platform. You will examine various implications involved in storing different types of data from different sources within an organization. You will need to be familiar with provisioning and working with resources on the cloud, basic big data architecture, distributed systems, using shell commands, and a Linux terminal prompt. You will learn that an organization may have data silos which may prevent access to other teams within an organization. You will learn how to use data lakes, a centralized repository to store data at scale, and as a viable solution to data silos that might exist within an organization. You will learn the difference between a data lake which stores all kinds of raw data in a native format before the data has been processed, and a data warehouse which contains data that can be used so directly to generate business insights. Finally, this course demonstrates storing data with AWS Redshift data warehouse.

12 videos | 1h 19m Assessment Badge

Data Lakes on AWS

This course discusses the transition of data warehousing to cloud-based solutions using the AWS (Amazon Web Services) cloud platform. In 11 videos, the course explores how data lakes store data using a flat structure, and the data are tagged, making it easy to search and query. You will learn how to build a data lake on the AWS cloud by storing data in S3 (simple storage service) buckets. You will learn to set up your data lake architecture lake using AWS Glue, a fully managed ETL (extract, transform, load) service. You will learn to configure and run Glue crawlers, and you will examine how crawlers merge data stored in an S3 folder path; and to use S3 to generate metadata tables in Glue. Learners will use Athena, Amazon's interactive query service as a simple way to analyze data in S3 using standard SQL. Finally, you will examine how to merge the data crawled by our CSV (comma separated values) crawler into a single table.

12 videos | 1h 9m Assessment Badge

Data Lake Sources, Visualizations, & ETL Operations

This course discusses the transition of data warehousing to cloud-based solutions using the AWS (Amazon Web Services) cloud platform. You will explore Amazon Redshift, a fully managed petabyte-scale data warehouse service which forms part of the larger AWS cloud-computing platform. The 12-video course demonstrates how to create and configure an Amazon Redshift cluster; to load data into it from an S3 (simple storage service) bucket; and configure a Glue crawler for stored data. This course examines how to visualize the data stored in the data lake and how to perform ETL (extract, transform, load) operations on the data using Glue scripts. You will work with the DynamoDB, a NoSQL database service that supports key-value and document data structures. You will learn how to use AWS QuickSight, a high-performance business intelligence service which integrates seamlessly with Glue tables by using the Amazon Athena Query Service. Finally, you will configure jobs to run extract, transform, and load operations on data stored in our data lake.

13 videos | 1h 27m Assessment Badge

Applied Data Analysis

In this 14-video course, learners discover how to perform data analysis by using Anaconda Python R, and related analytical libraries and tools. Begin by learning how to install and configure Python with Anaconda, and how R is installed by using Anaconda. Jupyter Notebook will be launched to explore data. Next, learn about the import and export of data in Python, and how to read data from, write data to files with Python Pandas library, and import and export data in R. Learn to recognize and handle missing data in R and to use the Dplyr package to transform data in R. Then learners examine Python data analysis libraries NumPy and Pandas. Next, perform exploratory data analysis in R by using mean, median, and mode. Discover how to use the Python data analysis library Pandas to analyze data and how to use the ggplot2 library to visualize data with R. Learn about Pandas built-in data visualization tools to visualize data by using Python. The closing exercise deals with performing data analysis with R and Python.

15 videos | 1h 24m Assessment Badge

Final Exam: Data Analyst

Final Exam: Data Analyst will test your knowledge and application of the topics presented throughout the Data Analyst track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.

1 video | 32s Assessment Badge

FREE ACCESS

COURSES INCLUDED

Python - Using Pandas to Work with Series & DataFrames

Pandas, a popular Python library, is part of the open-source PyData stack. In this 10-video Skillsoft Aspire course, you will learn that Pandas represents data in a tabular format which makes it easy and intuitive to perform data manipulation, cleaning, and exploration. You will use Python's DataFrame a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). To take this course, you should already be familiar with Python programming language; all code writing is in Jupyter notebooks. You will work with basic Pandas data structures, Pandas Series objects representing a single column of data which can store numerical values, strings, Booleans, and more complex data types. Learn how to use Pandas DataFrame, which represents data in table form. Finally, learn to append and sort series values, add missing data, add columns, and aggregate data in a DataFrame. The closing exercise involves instantiating a Pandas Series object by using both a list and a dictionary; changing the Series index to something other than default value; and practicing sorting Series values in place.

11 videos | 1h 10m Assessment Badge

Python - Using Pandas for Visualizations and Time-Series Data

This 12-video Skillsoft Aspire course uses Python, the preferred programming language for data science, to explore data in Pandas with popular chart types such as the bar graph, histogram, pie chart, and box plot. Discover how to work with time series and string data in data sets. Pandas represents data in a tabular format which makes it easy to perform data manipulation, cleaning, and data exploration, all important parts of any data engineer's toolkit. You will learn how to use Matplotlib, a multiplatform data visualization library built on NumPy, the Python library that is used to work with multidimensional data. Learners will use Panda's features to work with specific kinds of data such as time series data and stream data. This course uses a real-world demonstration using Pandas to analyze stock market returns for Amazon. Finally, you will learn how to make data transformations to clean, format, and transform the data into a useful form for further analysis.

13 videos | 1h 28m Assessment Badge

Python - Pandas Advanced Features

This course uses Python, the preferred programming language for data science, to explore Pandas, a popular Python library, and is a part of the open-source PyData stack. In this 11-video Skillsoft Aspire course, learners will use Pandas DataFrame to perform advanced category grouping, aggregations, and filtering operations. You will see how to use Pandas to retrieve a subset of your data by performing filtering operations both on rows, as well as columns. You will perform analysis on multilevel data by using the GROUPBY operation on Dataframe. You will then learn to use data masking or data obfuscation to protect classified or commercially sensitive data. Learners will work with duplicate data, an important part of data cleaning. You will examine the two broad categories of data continuous data which comprise of a continuous range of value, and categorical data has discrete, finite values. Pandas automatically generates indexes for each of our DataFrame rows, and here you will learn to different types of reindexing operations on Dataframe.

12 videos | 1h 11m Assessment Badge

Cleaning Data in R

R is a programming language that is essential for data science, used for statistical computing and graphics. In this 13-video course, learners explore essential methods for wrangling and cleaning data with R. Begin by recognizing types of unclean data and criteria for ensuring data quality. First, learners see how to fetch a JSON (JavaScript Object Notation) document over HTTP and load data into a dplyr table. Learn how to load multiple sheets from an Excel document and how to handle common errors encountered when reading CSV (comma-separated values) data. Read data from a relational database with a SQL (structured query language) query. Explore joining tabular data by combining two related data sets by using a join operation, and spreading data-reshaping tabular data by spreading values from rows to columns. Look at summarizing data, applying a summary function using dplyr; imputing data, using mean imputation to replace missing values; and extracting matches, using a regular expression and data wrangling tools from the tidyverse package. The closing exercise practices data wrangling functions using R.

13 videos | 1h 2m Assessment Badge

Technology Landscape & Tools for Data Management

This Skillsoft Aspire course explores various tools you can utilize to get better data analytics for your organization. You will learn the important factors to consider when selecting tools, velocity, the rate of incoming data, volume, the storage capacity or medium, and the diversified nature of data in different formats. This course discusses the various tools available to provide the capability of implementing machine learning, deep learning, and to provide AI capabilities for better data analytics. The following tools are discussed: TensorFlow, Theano, Torch, Caffe, Microsoft cognitive tool, OpenAI, DMTK from Microsoft, Apache SINGA, FeatureFu, DL4J from Java, Neon, and Chainer. You will learn to use SCIKIT-learn, a machine learning library for Python, to implement machine learning, and how to use machine learning in data analytics. This course covers how to recognize the capabilities provided by Python and R in the data management cycle. Learners will explore Python; the libraries NumPy, SciPy, Pandas to manage data structures; and StatsModels. Finally, you will examine the capabilities of machine learning implementation in the cloud.

9 videos | 26m Assessment Badge

Machine Learning & Deep Learning Tools in the Cloud

This Skillsoft Aspire course explores the machine learning solutions provided by AWS (Amazon Web Services) and Microsoft, and how to compare the tools and frameworks that can be used to implement machine learning, and deep learning. You will learn to become efficient in data wrangling by building a foundation with data tools and technology. This course explores Machine Learning Toolkit provided by Microsoft, which provides various algorithms and applies artificial intelligence and deep learning. Learners will also examine Distributed Machine Learning Toolkit, which is hosted on Azure. Next, explore the machine learning tools provided by AWS, including DeepRacer and DeepLens which provide deep learning capabilities. You will learn how to use Amazon SageMaker, and how Jupyter notebooks are used to test machine learning algorithms. You will learn about other AWS tools, including TensorFlow, Apache MXNet, and Deep Learning AMI. Finally, learn about different tools for data mining and analytics, and how to build and process a data pipeline provided by KNIME (Konstanz Information Miner).

9 videos | 22m Assessment Badge

Data Wrangling with Trifacta

Data wrangling, an increasingly popular tool among today's top firms, has become more complex as data become even more unstructured and varied in their source. In this 13-video Skillsoft Aspire course, you will learn how to simplify the task by organizing and cleaning disparate data to present your data in the best format possible with Trifacta, which accelerates data wrangling to enhance productivity for data scientists. Learn to reshape data, look up data, and pivot data. Explore essential methods for wrangling data, including how to use Trifacta to standardize, format, filter, and extract data. Also covered are other key topics: how to split and merge columns; utilize conditional aggregation; apply transforms to reshape data; and join two data sets into one by using join operations. In the concluding exercise, learners will be asked to start by loading a data set into Trifacta; to replace any missing values, if necessary; and to use a row filter operation, use a group by operation, and use an aggregate function in the group by operation.

13 videos | 49m Assessment Badge

MongoDB Querying

This course explores how to use MongoDB, a cross-platform document-oriented database that has become a popular tool for data wrangling and data science. MongoDB is a NoSQL (not only structured query language) that uses JSON (Javascript Object Notation) like documents with schemata. One advantage of MongoDB is the flexibility of how it stores data. You will learn how to perform MongoDB actions related to data wrangling by using Python with the PyMongo library. You will learn how to perform basic CRUD (create, read, update, delete) operations on a Mongo DB document. Next, learn how to use the find operation to select documents from a collection, and to use query operators to match document criteria. You will learn how to select documents using a specified criterion, similar to a WHERE clause in an SQL statement. Finally, this course demonstrates how to use the mongoimport tool to import from JSON or CSV, and mongoexport to export data from a MongoDB collection to JSON or CSV (comma separated values).

15 videos | 1h 6m Assessment Badge

MongoDB Aggregation

This Skillsoft Aspire course explores MongoDB, a cross-platform document-oriented database that has become a popular tool for data wrangling and data science. MongoDB is a NoSQL (not only structured query language) that uses Javascript Object Notation (JSON)- like documents with schemata. This course demonstrates reshaping, aggregating and summarizing documents in a MongoDB database, and gather, filter, modify, and query data, and to perform MongoDB actions related to data wrangling. Learners observe demonstrations of how to recognize the structure of aggregate operations in MongoDB; how to use the group operator to perform aggregate computations; and how to use limit and sort operators in an aggregations pipeline. Next, learn how to use the unwind operator to expand an array field in an aggregation, and how to use the Lookup operator to perform a joint operation between 2 collections in an aggregation, and how to use the index stats operator in an aggregation stage to view the statistics on the indexes. Finally, you will learn how to use a Geospatial Index for a geosearch operation.

11 videos | 50m Assessment Badge

Getting Started with Hive

This 9-video Skillsoft Aspire course focuses solely on theory and involves no programming or query execution. Learners begin by examining what a data warehouse is, and how it differs from a relational database, important because Apache Hive is primarily a data warehouse, despite giving a SQL-like interface to query data. Hive facilitates work on very large data sets, stored as files in the Hadoop Distributed File System, and lets users perform operations in parallel on data in these files by effectively transforming Hive queries into MapReduce operations. Next, you will hear about types of data and operations which data warehouses and relational databases handle, before moving on to basic components of the Hadoop architecture. Finally, the course discusses features of Hive making it popular among data analysts. The concluding exercise recalls differences between online transaction processing and online analytical processing systems, asking learners to identify Hadoop's three major components; list Hadoop offerings on three major cloud platforms (AWS, Microsoft Azure, and Google Cloud Platform); and list benefits of Hive for data analysts.

10 videos | 55m Assessment Badge

Loading & Querying Data with Hive

Among the market's most popular data warehouses used for data science, Apache Hive simplifies working with large data sets in files by representing them as tables. In this 12-video Skillsoft Aspire course, learners explore how to create, load, and query Hive tables. For this hands-on course, learners should have a conceptual understanding of Hive and its basic components, and prior experience with querying data from tables using SQL (structured query language) and with using the command line. Key concepts covered include cluster, joining tables, and modifying tables. Demonstrations covered include using the Beeline client for Hive for simple operations; creating tables, loading them with data, and then running queries against them. Only tables with primitive data types are used here, with data loaded into these tables from HDFS (Hadoop Distributed File System) file system and local machines. Learners will work with Hive metastore and temporary tables, and how they can be used. You will become familiar with basics of using the Hive query language and quite comfortable working with HDFS.

13 videos | 1h 19m Assessment Badge

Viewing & Querying Complex Data with Hive

Learners explore working with complex data types in Apache Hive in this Skillsoft Aspire course, which assumes previous work with Hive tables using the Hive query language, and comfort using a command-line interface or Hive client to run queries. Learners begin this 12-video, hands-on course by working with Hive tables whose columns are of complex data types (arrays, maps, and structs). Watch demonstrations of set operations and transforming complex types into tabular form with explode operation. Then use lateral views to add more data to exploded outputs. Course labs use the Beeline client; the instructor's Beeline terminal runs on the master node of a Hadoop cluster, provisioned on Google Cloud platform using its Dataproc service, and learner access is assumed to a Hadoop cluster and Beeline, on-premises or in the cloud. Finally, learners observe how to use views to aggregate contents of multiple columns. As the course concludes, you should be comfortable working with all types of data in Hive and performing analysis tasks on tables with both parameter types as well as complex data.

12 videos | 1h 12m Assessment Badge

Optimizing Query Executions with Hive

In this 7-video Skillsoft Aspire course, learners can explore optimizations allowing Apache Hive to handle parallel processing of data, while users can still contribute to improving query performance. For this course, learners should have previous experience with Hive and familiarity with querying big data for analysis purposes. The course focuses only on concepts; no queries are run. Learners begin to understand how to optimize query executions in Hive, beginning with exploring different options available in Hive to query data in an optimal manner. Discuss how to split data into smaller chunks, specifically, partitioning and bucketing, so that queries need not scan full data sets each time. Hive truly democratizes access to data stored in a Hadoop cluster, eliminating the need to know MapReduce to process cluster data, and makes data accessible using the Hive query language. All files in Hadoop are exposed in the form of tables. Watch demonstrations of structuring queries to reduce numbers of map reduce operations generated by Hive, and speeding up query executions. Other concepts covered include partitioning, bucketing, and joins.

7 videos | 42m Assessment Badge

Using Hive to Optimize Query Executions with Partitioning

Continue to explore the versatility of Apache Hive, among today's most popular data warehouses, in this 10-video Skillsoft Aspire course. Learners are shown ways to optimize query executions, including the powerful technique of partitioning data sets. The hands-on course assumes previous work with Hive tables using the Hive query language and in processing complex data types, along with theoretical understanding of improving query performance by partitioning very large data sets. Demonstrations focus on basics of partitioning and how to create partitions and load data into them. Learners work with both Hive-managed tables and external tables to see how partitioning works for each; then watch navigating to the shell of the Hadoop master node, and creating new directories in the Hadoop file system. Observe dynamic partitioning of tables and how this simplifies loading of data into partitions. Finally, you explore how using multiple columns in a table can partition data within it. During this course, learners will acquire a sound understanding of how exactly large data sets can be partitioned into smaller chunks, improving query performance.

10 videos | 1h Assessment Badge

Bucketing & Window Functions with Hive

Learners explore how Apache Hive query executions can be optimized, including techniques such as bucketing data sets, in this Skillsoft Aspire course. Using windowing functions to extract meaningful insights from data is also covered. This 10-video course assumes previous work with partitions in Hive, as well as conceptual understanding of how buckets can improve query performance. Learners begin by focusing on how to use the bucketing technique to process big data efficiently. Then take a look at HDFS (Hadoop Distributed File System) by navigating to the shell of the Hadoop master node; from there, make use of the Hadoop fs-ls command to examine contents of the directory. Observe three subdirectories corresponding to three partitions based on the value of the category column. You will then explore how to combine both the partitioning as well as bucketing techniques to further improve query performance. Finally, learners will explore the concept of co-windowing, which helps users analyze a subset of ordered data, and then to see how this technique can be implemented in Hive.

9 videos | 1h 3m Assessment Badge

Filtering Data Using Hadoop MapReduce

Extracting meaningful information from a very large dataset can be painstaking. In this Skillsoft Aspire course, learners examine how Hadoop's MapReduce can be used to speed up this operation. In a new project, code the Mapper for an application to count the number of passengers in each Titanic class in the input data set. Then develop a Reducer and Driver to generate final passenger counts in each Titanic class. Build the project by using Maven and run on Hadoop master node to check that output correctly shows passenger class numbers. Apply MapReduce to filter only surviving Titanic passengers from the input data set. Execute the application and verify that filtering has worked correctly; examine job and output files with YARN cluster manager and HDFS (Hadoop Distributed File System) NameNode web User interfaces. Using a restaurant app's data set, use MapReduce to obtain the distinct set of cuisines offered. Build and run the application and confirm output with HDFS from both command line and web application. The exercise involves filtering data by using MapReduce.

9 videos | 58m Assessment Badge

Hadoop MapReduce Applications With Combiners

In this Skillsoft Aspire course, explore the use of Combiners to make MapReduce applications more efficient by minimizing data transfers. Start by learning about the need for Combiners to optimize the execution of a MapReduce application by minimizing data transfers within a cluster. Recall the steps to process data in a MapReduce application, and look at using a Combiner to perform partial reduction of data output from the Mapper. Then create a new project to calculate average automobile prices using Maven for a MapReduce application. Next, develop the Mapper and Reducer to calculate the average price for automobile makes in the input data set. Create a driver program for the MapReduce application, run it, and check output to get the average price per automobile. Learn how to code up a Combiner for a MapReduce application, fix the bug in the application so it can be used to correctly calculate the average price, then run the fixed application to verify that the prices are being calculated correctly. The concluding exercise concerns optimizing MapReduce with Combiners.

13 videos | 1h 23m Assessment Badge

Advanced Operations Using Hadoop MapReduce

In this Skillsoft Aspire course, explore how MapReduce can be used to extract the five most expensive vehicles in a data set, then build an inverted index for the words appearing in a set of text files. Begin by defining a vehicle type that can be used to represent automobiles to be stored in a Java PriorityQueue, then configure a Mapper to use a PriorityQueue to store the five most expensive automobiles it has processed from the dataset. Learn how to use a PriorityQueue in the Reducer of the application to receive the five most expensive automobiles from each mapper and write the top five automobiles overall to the output, then execute the application to verify the results. Next, explore how you can utilize the MapReduce framework in order to generate an inverted index and configure the Reducer and Driver for the inverted index application. This leads on to running the application and examining the inverted index on HDFS (Hadoop Distributed File System). The concluding exercise involves advanced operations using MapReduce.

9 videos | 48m Assessment Badge

Data Analysis Using the Spark DataFrame API

An open-source cluster-computing framework used for data science, Apache Spark has become the de facto big data framework. In this Skillsoft Aspire course, learners explore how to analyze real data sets by using DataFrame API methods. Discover how to optimize operations with shared variables and combine data from multiple DataFrames using joins. Explore the Spark 2.x version features that make it significantly faster than Spark 1.x. Other topics include how to create a Spark DataFrame from a CSV file; apply DataFrame transformations, grouping, and aggregation; perform operations on a DataFrame to analyze categories of data in a data set. Visualize the contents of a Spark DataFrame, with Matplotlib. Conclude by studying how to broadcast variables and DataFrame contents in text file format.

16 videos | 1h 10m Assessment Badge

Data Analysis using Spark SQL

Analyze an Apache Spark DataFrame as though it were a relational database table. During this Aspire course, you will discover the different stages involved in optimizing any query or method call on the contents of a Spark DataFrame. Discover how to create views out of a Spark DataFrame's contents and run queries against them; and how to trim and clean a DataFrame. Next, learn how to perform an analysis of data by running different SQL queries; how to configure a DataFrame with an explicitly defined schema; and define what a window is in the context of Spark. Finally, observe how to create and analyze categories of data in a data set by using Windows.

9 videos | 54m Assessment Badge

Data Lake Framework & Design Implementation

A key component to wrangling data is the data lake framework. In this 9-video Skillsoft Aspire course, discover how to design and implement data lakes in the cloud and on-premises by using standard reference architectures and patterns to help identify the proper data architecture. Learners begin by looking at architectural differences between data lakes and data warehouses, then identifying the features that data lakes provide as part of the enterprise architecture. Learn how to use data lakes to democratize data and look at design principles for data lakes, identifying the design considerations. Explore the architecture of Amazon Web Services (AWS) data lakes and their essential components, then look at implementing data lakes using AWS. You will examine the prominent architectural styles used when implementing data lakes on-premises and on multiple cloud platforms. Next, learners will see the various frameworks that can be used to process data from data lakes. Finally, the concluding exercise compares data lakes and the data warehouse, showing how to specify data lake design patterns, and implement data lakes by using AWS.

10 videos | 33m Assessment Badge

Data Lake Architectures & Data Management Principles

A key component to wrangling data is the data lake framework. In this 9-video Skillsoft Aspire course, learners discover how to implement data lakes for real-time management. Explore data ingestion, data processing, and data lifecycle management with Amazon Web Services (AWS) and other open-source ecosystem products. Begin by examining real-time big data architectures, and how to implement Lambda and Kappa architectures to manage real-time big data. View benefits of adopting Zaloni data lake reference architecture. Examine the essential approach of data ingestion and comparative benefits provided by file formats Avro and Parquet. Explore data ingestion with Sqoop, and various data processing strategies provided by MapReduce V2, Hive, Pig, and Yam for processing data with data lakes. Learn how to derive value from data lakes and describe benefits of critical roles. Learners will explore steps involved in the data lifecycle and the significance of archival policies. Finally, learn how to implement an archival policy to transition between S3 and Glacier, depending on adopted policies. Close the course with an exercise on ingesting data and archival policy.

10 videos | 34m Assessment Badge

Data Architecture Deep Dive - Design & Implementation

This 11-video Skillsoft Aspire course explores the numerous types of data architecture that can be used when working with big data; how to implement strategies by using NoSQL (not only structured query language); CAP theorem (consistency, availability, and partition tolerance); and partitioning to improve performance. Learners examine the core activities essential for data architectures: data security, privacy, integrity, quality, regulatory compliances, and governance. You will learn different methods of partitioning, and the criteria for implementing data partitioning. Next, you will install and explore MongoDB, a cross-platform document-oriented database system, and learn to read and write optimizations in MongoDB. You will learn to identify various important components of hybrid data architecture, and adapting it to your data needs. You will learn how to implement DAG (Directed Acyclic Graph) by using the Elasticsearch search engine. You evaluate your needs to determine whether to implement batch processing or stream processing. This course also covers process implementation by using serverless and Lambda architecture. Finally, you will examine types of data risk when implementing data modeling and design.

12 videos | 35m Assessment Badge

Data Architecture Deep Dive - Microservices & Serverless Computing

Explore numerous types of data architecture that are effective data wrangling tools when working with big data in this 9-video Skillsoft Aspire course. Learn the strategies, design, and constraints involved in implementing data architecture. You will learn the concepts of data partitioning, CAP theorem (consistency, availability, and partition tolerance), and process implementation using serverless and Lambda data architecture. This course examines Saga, newly introduced in data management pattern catalog of microservices; API (application programming interface) composition; CQRS (Command Query Responsibility Segregation); event sourcing; and application event. This course explores the differences in traditional data architecture and serverless architecture which allows you to use client-side logic and third-party services. You will learn how to use AWS (Amazon Web Services) Lambda to implement a serverless architecture. This course then explores batch processing architecture, which processes data files by using long running batch jobs to filter actual content, real-time architecture, and machine learning at scale architecture built to serve machine learning algorithms. Finally, you will explore how to build a successful data POC (proof of concept).

10 videos | 25m Assessment Badge

Final Exam: Data Wrangler

Final Exam: Data Wrangler will test your knowledge and application of the topics presented throughout the Data Wrangler track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.

1 video | 32s Assessment Badge

FREE ACCESS

COURSES INCLUDED

Data Science Tools

Explore a variety of new data science tools available today; the different uses for these tools; and the benefits and challenges in deploying them in this 12-video course. First, examine a data science platform, the nucleus of technologies used to perform data science tasks. You will then explore the analysis process to inspect, clean, transform, and model data. Next, the course surveys integrating and exploring data, coding, and building models using that data, deploying the models to production, and delivering results through applications or by generating reports. You will see how a great data science platform should be flexible and scalable, and it should combine multiple features and capabilities that effectively centralize data science efforts. You will learn the six sequential steps of a typical data science workflow, from defining the objective for the project to reporting the results. Finally, explore DevOps, resources that allow developers and IT to work together in harmony which includes people, processes, and infrastructure; and its typical functionalities including integration, testing, packaging, as well as deployment.

13 videos | 47m Assessment Badge

Delivering Dashboards: Management Patterns

In this 11-video course, explore the concept of dashboards and the best practices that can be adopted to build effective dashboards. The course also covers how to implement dashboards and visualizations by using PowerBI and ELK and the concepts of leaderboards and scorecards. Learners begin with analytical visualization, recognizing the various types of visualizations that can be used to build concise dashboards. Then you will look at different dashboard types and their associated features and benefits. Learners examine different types of data used in analysis and the types of visualizations that can be created from the data. Learn about dashboard components, the essential components involved in building a productive dashboard. Familiarize yourself with best practices for building a productive dashboard. Learn how to create a dashboard using ELK (Elasticsearch, Logstash, and Kibana) and PowerBI (business intelligence platform). Then look at selection criteria for charts, the critical benefits provided by leaderboards and scorecards, and the various scorecard types. The closing exercise involves creating dashboards by using PowerBI and ELK.

12 videos | 33m Assessment Badge

Delivering Dashboards: Exploration & Analytics

Explore the role played by dashboards in data exploration and deep analytics in this 11-video course. Dashboards are especially useful in visualizing data for a wide variety of business users, so that they can better understand the data being analyzed. First, learners examine the essential patterns of dashboard design and how to implement appropriate dashboards by using Kibana, Tableau, and Qlikview. You will begin by learning about data exploration capabilities using charts; then take a look at analytical visualization tools, the prominent tools that can be used to implement charts. Learn how to create bar and line charts and dashboards, and then share those dashboards with Kibana. Learners then watch demonstrations of creating charts and dashboards by using both Tableau and Qlikview tools. You will explore the approach to building dashboards with real-time data updates and how to specify essential design patterns that can be adopted when designing dashboards. Finally, learn about creating monitoring dashboards with ELK (Elasticsearch, Logstash, and Kibana). The concluding exercise deals with creating dashboards by using Kibana, Tableau, and Qlikview.

12 videos | 30m Assessment Badge

Cloud Data Architecture: Cloud Architecture & Containerization

In this course, learners discover how to implement cloud architecture for large- scale data science applications, serverless computing, adequate storage, and analytical platforms using DevOps tools and cloud resources. Key concepts covered here include the impact of implementing containerization on cloud hosting environments; the benefits of container implementation, such as lower overhead, increased portability, operational consistency, greater efficiency and better application development; and the role of cloud container services. You will study the concept of serverless computing and its benefits; the approaches of implementing DevOps in the cloud; and how to implement OpsWorks on AWS by using Puppet which provides the ability to define which software and configuration a system requires. See demonstrations of how to classify storage from the perspective of capacity and data access technologies; the benefits of implementing machine learning, deep learning, and artificial intelligence in the cloud; and the impact of cloud technology on BI analytics. Finally, learners encounter container and cloud storage types, container and serverless computing benefits, and advantages of implementing cloud-based BI analytics.

10 videos | 44m Assessment Badge

Cloud Data Architecture: Data Management & Adoption Frameworks

Explore how to implement containers and data management on popular cloud platforms like Amazon Web Services (AWS) and Google Cloud Platform (GCP) for data science. Planning big data solutions, disaster recovery, and backup and restore in the cloud are also covered in this course. Key concepts covered here include cloud migration models from the perspective of architectural preferences; prominent big data solutions that can be implemented in the cloud; and the impact of implementing Kubernetes and Docker in the cloud, and how to implement Kubernetes on AWS. Next, learn how to implement data management on AWS, GCP, and DBaaS; how to implement big data solutions using AWS; how to build backup and restore mechanisms in the cloud; and how to implement disaster recovery planning for cloud applications. Learners will see prominent cloud adoption frameworks and their associated capabilities, and hear benefits of and how to implement blockchain technologies or solutions in the cloud. Finally, learn how to implement Kubernetes on AWS, build backup and restore mechanisms on GCP, and implement big data solutions in the cloud.

13 videos | 1h 4m Assessment Badge

Data Compliance Issues & Strategies

the key areas for compliance in data protection: policies and legal regulations. You will learn how an organization can develop a policy framework. In this course, learners examine the legal regulations applicable to data protection, and company policies, the internal documents, and procedures an organization implements to comply with the law. You will learn how to develop a policy framework, and how to establish internal rules for personnel. You will also learn about some of the organizations that have developed regional policies, for example, the APEC (Asia-Pacific Economic Cooperation) Privacy Framework, and the OECD (Organisation for Economic Co-operation) Privacy Principles. Finally, you will explore procedures for internal and external reporting, and other responses to data breaches

13 videos | 43m Assessment Badge

Implementing Governance Strategies

This course explores the key concepts behind governance and its relationship with big data. Big data are large and complex, often being represented by massive amounts of very granular data that need to be protected from misuse. This 12-video course examines the five main requirements when an all-encompassing governance strategy is being planned and designed. You will first learn to build a data governance plan by first identifying the most important data domain. Then learn the importance of assembling a data governance body for an organization's big data activities; and how to identify the stakeholders that need to be part of a data governance program. Next, you will learn why the members' governance body should be fairly diverse, well trained, and informed of the policies surrounding the collection of data and the procedures for using the data; and should include compliance professionals who understand the rules and regulations applicable to your corporate structure. Finally, you will explore the issues involved in cloud storage of big data.

13 videos | 45m Assessment Badge

Data Access & Governance Policies: Data Access Governance & IAM

This course explores how a DAG (Data Access Governance), a structured data access framework, can reduce the likelihood of data security breaches, and reduce the likelihood of future breaches. Risk and data safety compliance addresses how to identify threats against an organization's digital data assets. You will learn about legal compliance, industry regulations, and compliance with organizational security policies. You will learn how the IAM (identity and access management) relates to users, devices, or software components. Learners will then explore how a PoLP (Principle of Least Privilege) dictates to whom and what permission is given to users to access data. You will learn to create an IAM user and group within AWS (Amazon Web Services), and how to assign file system permissions to a Windows server in accordance with the principle of least privilege. Finally, you will examine how vulnerability assessments are used to identify security weaknesses, and different types of preventative security controls, for example, firewalls or malware scanning.

13 videos | 58m Assessment Badge

Data Access & Governance Policies: Data Classification, Encryption, & Monitoring

Explore how data classification determines which security measures apply to varying classes of data. This 12-video course classifies data into a couple of main categories, internal data and sensitive data. You will learn to classify data by using Microsoft FSRM (File Server Resource Manager), a role service in Windows Server that enables you to manage and classify data stored on file servers. Learners will explore different tools used to safeguard sensitive information, such as data encryption. You will learn how to enable Microsoft BitLocker, a full volume encryption feature included with Microsoft Windows, to encrypt data at rest. An important aspect of data access governance is securing data that is being transmitted over a network, and you will learn to configure a VPN (virtual private network) using Microsoft System Center Configuration Manager. You will learn to configure a Custom Filtered Log View using MS Windows Event Viewer to track user access to a database. Finally, you will learn to audit file access on an MS Windows Server 2016 host.

13 videos | 1h 18m Assessment Badge

Streaming Data Architectures: An Introduction to Streaming Data in Spark

Learn the fundamentals of streaming data with Apache Spark. During this course, you will discover the differences between batch and streaming data. Observe the types of streaming data sources. Learn about how to process streaming data, transform the stream, and materialize the results. Decouple a streaming application from the data sources with a message transport. Next, learn about techniques used in Spark 1.x to work with streaming data and how it contrasts with processing batch data; how structured streaming in Spark 2.x is able to ease the task of stream processing for the app developer; and how streaming processing works in both Spark 1.x and 2.x. Finally, learn how triggers can be set up to periodically process streaming data; and the key aspects of working with structured streaming in Spark

9 videos | 50m Assessment Badge

Streaming Data Architectures: Processing Streaming Data with Spark

Process streaming data with Spark, the analytic engine built on Hadoop. In this course, you will discover how to develop applications in Spark to work with streaming data and generate output. Topics include the following: Configure a streaming data source; Use Netcat and write applications to process the data stream; Learn the effects of using the Update mode on your stream processing application's output; Write a monitoring application that listens for new files added to a directory; Compare the append output with the update mode; Develop applications to limit files processed in each trigger; Use Spark's Complete mode for output; Perform aggregation operations on streaming data with the DataFrame API; Process streaming data with Spark SQL queries.

11 videos | 52m Assessment Badge

Scalable Data Architectures: Getting Started

Explore theoretical foundations of the need for and characteristics of scalable data architectures in this 8-video course. Learn to use data warehouses to store, process, and analyze big data. Key concepts covered here include how to recognize the need to scale architectures to keep up with needs for storage and processing of big data; how to identify characteristics of data warehouses ideally suiting them to tasks of big data analysis and processing; and how to distinguish between relational databases and data warehouses. Next, learn to recognize specific characteristics of systems meant for online transaction processing and online analytical processing, and how data warehouses are an example of online analytical processing (OLAP) systems. Then, learn to identify various components of data warehouses enabling them to work with varied sources, extract and transform big data, and generate reports of analysis operations efficiently. Finally, study features of Amazon Redshift enabling big data to be processed at scale; features of data warehouses, contrasted with those of relational databases; and two options available to scale compute capacity.

8 videos | 52m Assessment Badge

Scalable Data Architectures: Using Amazon Redshift

Using a hands-on lab approach, explore how to use Amazon Redshift to set up and configure a data warehouse on the cloud in this 9-video course. Discover how to interact with Redshift service with both the console and Amazon Web Services (AWS) Command Line Interface (CLI). Key concepts covered here include how to use the Amazon Redshift Quick Launch feature to provision a data warehouse; provisioning a Redshift cluster with the default cluster; and tool configuration options for a Redshift cluster, and metrics available to optimize a cluster configuration. Next, learn how to create Identity and Access Management (IAM) roles on AWS that include necessary permissions to interact with Redshift and S3 services; to provision an IAM user that can connect to and interact with AWS using the CLI; and to install the AWS command-line interface to create and delete Redshift clusters. Then learn to use Redshift Query Editor to create tables, load data, and run queries; and learn features of Amazon Redshift and commands and configurations needed to work with Redshift by using the CLI.

9 videos | 54m Assessment Badge

Scalable Data Architectures: Using Amazon Redshift & QuickSight

In this 12-video course, explore the loading of data from an external source such as Amazon S3 into a Redshift cluster, as well as configuration of snapshots and resizing of clusters. Discover how to use Amazon QuickSight to visualize data. Key concepts covered in this course include using the AWS console to load data sets to Amazon S3 and then into a table provisioned on a Redshift cluster; running queries on data in a Redshift cluster with the query evaluation feature; and working with SQL Workbench to connect to and query data in a Redshift cluster. Learn how to disable automated snapshots for a Redshift cluster and configure a table to be excluded from snapshots; recover an individual table from the snapshot of an entire cluster; and create a security group rule enabling access from Amazon's QuickSight servers to a Redshift cluster. Next, configure Amazon QuickSight to load data from a table in a Redshift cluster for analysis; and use the QuickSight dashboard to generate a time series plot to visualize sales at a retailer over time.

12 videos | 1h 17m Assessment Badge

Building Data Pipelines

Explore data pipelines and methods of processing them with and without ETL (extract, transform, load). In this course, you will learn to create data pipelines by using the Apache Airflow workflow management program. Key concepts covered here include data pipelines as an application that sits between raw data and a transformed data set, between a data source and a data target; how to build a traditional ETL pipeline with batch processing; and how to build an ETL pipeline with stream processing. Next, learn how to set up and install Apache Airflow; the key concepts of Apache Airflow; how to instantiate a directed acyclic graph in Airflow. Learners are shown how to use tasks and include arguments in Airflow; how to use dependencies in Airflow; how to build an ETL pipeline with Airflow; and how to build an automated pipeline without using ETL. Finally, learn how to test Airflow tasks by using the airflow command line utility, and how to use Apache Airflow to create a data pipeline.

13 videos | 1h 9m Assessment Badge

Data Pipeline: Process Implementation Using Tableau & AWS

Explore the concept of data pipelines, the processes and stages involved in building them, and technologies such as Tableau and Amazon Web Services (AWS) that can be used in this 11-video course. Learners begin with an initial look at the data pipeline and its features, and then the steps involved in building one. You will go on to learn about the processes involved in building data pipelines, the different stages of a pipeline, and the various essential technologies that can be used to implement one. Next, learners explore the various types of data sources that are involved in the data pipeline transformation phases. Then you learn to define scheduled data pipelines and list all the associated components, tasks, and attempts. You will learn how to install Tableau Server and command line utilities and then build data pipelines using the Tableau command line utilities. Finally, take a look at the steps involved in building data pipelines on AWS. The closing exercise involves building data pipelines with Tableau.

11 videos | 38m Assessment Badge

Data Pipeline: Using Frameworks for Advanced Data Management

Discover how to implement data pipelines using Python Luigi, integrate Spark and Tableau to manage data pipelines, use Dask arrays, and build data pipeline visualization with Python in this 10-video course. Begin by learning about features of Celery and Luigi that can be used to set up data pipelines, then how to implement Python Luigi to set up data pipelines. Next, turn to working with Dask library, after listing the essential features provided by Dask from the perspective of task scheduling and big data collections. Learn about implementation of Dask arrays to manage NumPy application programming interfaces (APIs). Explore frameworks that can be used to implement data exploration and visualization in data pipelines. Integrate Spark and Tableau to manage data pipelines. Move on to streaming data visualization with Python, using Python to build visualizations for streaming data. Then learn about the data pipeline building capabilities provided by Kafka, Spark, and PySpark. The concluding exercise involves setting up Luigi to implement data pipelines, Spark and Tableau integration, and building pipelines with Python.

10 videos | 32m Assessment Badge

Data Sources: Integration from the Edge

In this 11-video course, you will examine the architecture of IoT (Internet of Things) solutions and the essential approaches of integrating data sources. Begin by examining the required elements for deploying IoT solutions and its prominent service categories. Take a look at the capabilities provided and the maturity models of IoT solutions. Explore the critical design principles that need to be implemented when building IoT solutions and the cloud architectures of IoT from the perspective of Microsoft Azure, Amazon Web Services, and GCP (Google Cloud Platform). Compare the features and capabilities provided by the MQTT (Message Queuing Telemetry Transport) and XMPP (Extensible Messaging and Presence Protocol) protocols for IoT solutions. Identify key features and applications that can be implemented by using IoT controllers; learn to recognize the concept of IoT data management and the applied lifecycle of IoT data. Examine the list of essential security techniques that can be implemented to secure IoT solutions. The concluding exercise focuses on generating data streams.

11 videos | 39m Assessment Badge

Data Sources: Implementing Edge Data on the Cloud

To become proficient in data science, users have to understand edge computing. This is where data is processed near the source or at the edge of the network while in a typical cloud environment, data processing happens in a centralized data storage location. In this 7-video course, learners will explore the implementation of IoT (Internet of Things) on prominent cloud platforms like AWS (Amazon Web Services) and GCP (Google Cloud Platform). Discover how to work with IoT Device Simulator and generate data streams with MQTT (Message Queuing Telemetry Transport). You will next examine the approaches and steps involved in setting up AWS IoT Greengrass, and the essential components of GCP IoT Edge. Then learn how to connect a web application to AWS IoT by using MQTT over WebSockets. The next tutorial demonstrates the essential approach of using IoT Device Simulator, then on to generating streams of data by using the MQTT messaging protocol. The concluding exercise involves creating a device type, a user, and a device by using IoT Device Simulator.

7 videos | 30m Assessment Badge

Securing Big Data Streams

Learners can explore security risks related to modern data capture, data centers, and processing methods, such as streaming analytics, in this 13-video course. As the value of a company's data increases, the same data have become more and more valuable to hackers and other criminals. You will learn up-to-date techniques and tools employed to mitigate security risks, and best practices related to securing big data, including cloud data, trust, and encryption. Begin with an overview of common security concerns for big data and streaming data, as well as concerns related to NoSQL (non-structured query language), distributed processing frameworks, and flaws related to data mining and analytics. Then explore how to secure big data; explore streaming data and data in motion; and see how end-point devices are secured by using validation and filtering, as well as how to use encryption to secure data at rest. In the concluding exercise, practice what you have learned by describing key big data security concerns, key streaming data security concerns, and how end-point devices are secured.

13 videos | 1h 2m Assessment Badge

Harnessing Data Volume & Velocity: Turning Big Data into Smart Data

In this course, you will explore the concept of smart data and its associated lifecycle and benefits and the frameworks and algorithms that can help transition big data to smart data. Begin by comparing big data and smart data from the perspective of volume, variety, velocity, and veracity. Look at smart data capabilities for machine learning and artificial intelligence. Examine how to turn big data into smart data and how to use data volumes; list applications of smart data and smart process, and recall use cases for smart data application. Then explore the lifecycle of smart data and the associated impacts and benefits. Learn steps involved in transforming big data into smart data by using k-NN (K Nearest Neighbor algorithm), and look at various smart data solution implementation frameworks. Recall how to turn smart data into business by using data sharing and algorithms and how to implement clustering on smart data. Finally, learn about integrating smart data and its impact on optimization of data strategy. The exercise concerns transforming big data into smart data.

13 videos | 38m Assessment Badge

Data Rollbacks: Transaction Rollbacks & Their Impact

In this 9-video course, you will explore the data concepts of transactions, transaction management policies, and rollbacks. Discover how to implement transaction management and rollbacks by using SQL Server. Begin by learning about the concept and characteristics of the rollback process and its impact on transactions. Then take a look at various states of transactions, and prominent types of transactions along with their essential features (distributed and compensating transactions). Moving on, learn about implementing transaction management, along with certain essential elements like commit savepoint and release savepoint using SQL server. Learners recall the various transaction log operations and their characteristics (transaction recovery and transaction replication). You will learn to recognize the Deadlock Management capabilities and features provided by SQL server by using lock monitors and trace. Examine the essential rollback mechanism adopted by SQL server, then see how the SQL server is used to roll back databases to a specific point in time. A concluding exercise involves implementing transaction management and rollbacks by using SQL server.

10 videos | 35m Assessment Badge

Data Rollbacks: Transaction Management & Rollbacks in NoSQL

During this 7-video course, learners will explore differences between transaction management by using NoSQL and MongoDB and discover how to implement change data capture in databases and NoSQL. The first tutorial compares the transaction management architecture and capabilities of NoSQL and SQL. Then you will learn how to recognize the transaction management capabilities of MongoDB, along with its impact on consistency and availability. Next, learners will explore how to implement multi-document transaction management by using replica set in MongoDB. Then the course moves on to examine change data capture, which is the process of capturing the change, and learn about the essential SQL Server change data capture features. You will examine the features of change stream in MongoDB, which leads on to creating change streams to enable real-time data change streaming for applications using MongoDB. To conclude the course, an exercise on MongoDB transactions and change streams compares the transaction management architecture and capabilities of NoSQL and SQL.

8 videos | 28m Assessment Badge

Final Exam: Data Ops

Final Exam: Data Ops will test your knowledge and application of the topics presented throughout the Data Ops track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.

1 video | 32s Assessment Badge

FREE ACCESS

COURSES INCLUDED

The Four Vs of Data

The four Vs (volume, variety, velocity, and veracity) of big data and data science are a popular paradigm used to extract meaning and value from massive data sets. In this course, learners discover the four Vs, their purpose and uses, and how to extract value by using the four Vs. Key concepts covered here include the four Vs, their roles in big data analytics, and the overall principle of the four Vs; and ways in which the four Vs relate to each other. Next, study variety and data structure and how they relate to the four Vs; validity and volatility and how they relate to the four Vs; and how the four Vs should be balanced in order to implement a successful big data strategy. Learners are shown the various use cases of big data analytics and the four Vs of big data, and how the four Vs can be leveraged to extract value from big data. Finally, review the four Vs of big data analytics, their differences, and how balance can be achieved.

13 videos | 39m Assessment Badge

Data Driven Organizations

Examine data-driven organizations, how they use data science, and the importance of prioritizing data in this 13-video course. Data-driven organizations are committed to gathering and utilizing data necessary for a business holistically to gain competitive advantage. You will explore how to create a culture within an organization by involving management and training employees. You will examine analytic maturity as a metric to measure an organization's progress. Next, learn how to analyze data quality; how it is measured in a relative manner, not an absolute manner; and how it should be measured, weighed and appropriately applied to determine the value or quality of a data set. You will learn the potential business effects of missing data and the three main reasons why data are not included in a collection: missing at random, missing due to data collection, and missing not at random. This course explores the wide range of impacts when there is duplicate data. You will examine how truncated or censored data have inconsistent results. Finally, you will explore data provenance and record-keeping.

13 videos | 1h 14m Assessment Badge

Raw Data to Insights: Data Ingestion & Statistical Analysis

Explore how statistical analysis can turn raw data into insights, and then examine how to use the data to improve business intelligence, in this 10-video course. Learn how to scrutinize and perform analytics on the collected data. The course explores several approaches for identifying values and insights from data by using various standard and intuitive principles, including data exploration and data ingestion, along with the practical implementation by using R. First, you will learn how to detect outliers by using R, and how to compare simple linear regression models, with and without outliers, to improve the quality of the data. Because today's data are available in diversified formats, with large volume and high velocity, this course next demonstrates how to use a variety of technologies: Apache Kafka, Apache NiFi, Apache Sqoop, and Wavefront (a program for simulating two-dimensional acoustic systems) to ingest data. Finally, you will learn how these tools can help users in data extraction, scalability, integration support, and security.

10 videos | 53m Assessment Badge

Raw Data to Insights: Data Management & Decision Making

To master data science, it is important to turn raw data into insights. In this 12-video course, you will learn to apply and implement various essential data correction techniques, transformation rules, deductive correction techniques, and predictive modeling using critical data analytical approaches by using R. The key concepts in this course include: the capabilities and advantages of the application of data-driven decision making; loading data from databases using R; preparing data for analysis; and the concept of data correction, using the essential approaches of simple transformation rules and deductive correction, Next, examine implementing data correction using simple transformation rules and deductive correction; the various essential distributed data management frameworks used to handle big data; and the approach of implementing data analytics using machine learning. Finally, learn how to implement exploratory data analysis by using R; to implement predictive modeling by using machine learning; how to correct data with deductive correction; and how to analyze data in R and facilitate predictive modeling with machine learning.

12 videos | 56m Assessment Badge

Tableau Desktop: Real Time Dashboards

To become a data science expert, you must master the art of data visualization. This 12-video course explores how to create and use real time dashboards with Tableau. Begin with an introduction to real-time dashboards and differences between real time and streaming data. Next, take a look at different cloud data sources. Learn how to build a dashboard in Tableau and update it in real time. Discover how to organize your dashboard by adding objects and adjusting the layout. Then customize and format different aspects of dashboards in Tableau and add interactivity using actions like filtering. Look at creating a dashboard starter, a prebuilt dashboard that can be used with Tableau Online to connect to cloud data sources. Add extensions to your dashboard such as the Tableau Extensions API (application program interface). Explore how to put together a simple dashboard story, which consists of sheets-each sheet in sequence is called a story point-and how to share a dashboard in Tableau. In the concluding exercise, learners create a dashboard starter.

13 videos | 1h 7m Assessment Badge

Storytelling with Data: Introduction

In this 10-video course, learners can explore the concept of storytelling with data, including processes involved in storytelling and interpreting data contexts. You will explore prominent types of analysis, visualizations, and graphics tools useful for storytelling. Become familiar with various processes: storytelling with analysis, and its types; storytelling with visualization; and storytelling with scatter plots, line charts, heat maps, and bar charts. Popular software programs are also used: d3.js (Data-Drive Document), WebDataRocks, Birt, Google Charts, and Cytoscape. Users of storytelling include three types: strategists, who actually build strategy for story making; developers or designers, who often use videos, images, infographics to create experience architecture; and marketers or salespeople, who uses different modes including visual social networks, calendaring, messaging in visual form, digital signage, UGC or employee advocacy, story selling, live streaming, or data storytelling. A concluding exercise asks learners to recall elements of storytelling context; specify types of analysis used to facilitate storytelling with data; list prominent visualizations used to facilitate storytelling with data; and list prominent graphical tools useful for data exploration.

10 videos | 46m Assessment Badge

Storytelling with Data: Tableau & Power BI

To convey the true meaning of data most effectively, data scientists and data management professionals need to be able to harness the capabilities of different approaches of storytelling with data. This 14-video course explores how to select the most effective visuals for a storytelling project, how to eliminate clutter, and how to choose the best practices for story design. In addition, learners will see demonstrations of how to work with Tableau and Power BI bar charts to facilitate storytelling with data. Learn to select appropriate visuals for your data storytelling project; how to use slopegraphs; and learn important steps to take in cluttering and de-cluttering data. Explore the gestalt principle, as well as common problems of visual story design. In the concluding exercise, learners will load data by using Power BI from a CSV file; create a bar chart by using the data; and create a pie chart to show the whole-part relation in the data.

14 videos | 56m Assessment Badge

Python for Data Science: Basic Data Visualization Using Seaborn

Explore Seaborn, a Python library used in data science that provides an interface for drawing graphs that conveys a lot of information, and are also visually appealing. To take this course, learners should be comfortable programming in Python and using Jupyter notebooks; familiarity with Pandas for Numpy would be helpful, but is not required. The course explores how Seaborn provides higher-level abstractions over Python's Matplotlib, how it is tightly integrated with the PyData stack, and how it integrates with other data structure libraries such as NumPy and Pandas. You will learn to visualize the distribution of a single column of data in a Pandas DataFrame by using histograms and the kernel density estimation curve, and then slowly begin to customize the aesthetics of the plot. Next, learn to visualize bivariate distributions, which are data with two variables in the same plot, and see the various ways to do it in Seaborn. Finally, you will explore different ways to generate regression plots in Seaborn.

11 videos | 1h 6m Assessment Badge

Python for Data Science: Advanced Data Visualization Using Seaborn

Explore Seaborn, a Python library used in data science that provides an interface for drawing graphs that convey a lot of information, and are also visually appealing. To take this course, learners should be comfortable programming in Python, have some experience using Seaborn for basic plots and visualizations, and should be familiar with plotting distributions, as well as simple regression plots. You will work with continuous variables to modify plots, and to put it into a context that can be shared. Next, learn how to plot categorical variables by using box plots, violin plots, swarm plots, and FacetGrids (lattice or trellis plotting). You will learn to plot a grid of graphs for each category of your data. Learners will explore Seaborn standard aesthetic configurations, including the color palette, and style elements. Finally, this course teaches learners how to tweak displayed data to convey more information from the graphs.

11 videos | 1h 3m Assessment Badge

Data Science Statistics: Using Python to Compute & Visualize Statistics

Learners continue their exploration of data science in this 10-video course, which deals with using NumPy, Pandas, and SciPy libraries to perform various statistical summary operations on real data sets. This beginner-level course assumes some prior experience with Python programming and an understanding of basic statistical concepts such as mean, standard deviation, and correlation. The course opens by exploring different ways to visualize data by using the Matplotlib library, including univariate and bivariate distributions. Next, you will move to computing descriptor statistics for distributions, such as variance and standard error, by using the NumPy, Pandas, and SciPy libraries. Learn about the concept of the z-score, in which every value in a distribution is expressed in terms of the number of standard deviations from the mean value. Then cover the computation of the z-score for a series using SciPy. In the closing exercise, you will make use of the matplotlib data visualization library through three points represented by given coordinates, then enumerate all of the details which are conveyed in a Boxplot.

10 videos | 1h 15m Assessment Badge

Advanced Visualizations & Dashboards: Visualization Using Python

In this course, learners explore approaches to building and implementing visualizations for data science, as well as plotting and graphing using Python libraries such as Matplotlib, ggplot, bokeh, and Pygal. Key concepts covered here include the importance and relevance of data visualization from the business perspective; libraries that can be used in Python to implement data visualization and how to set up a data visualization environment using Python tools and libraries; and prominent data visualization libraries that can be used with Matplotlib. Then see how to create bar charts by using ggplot in Python; how to create charts, using the bokeh and Pygal libraries in Python; and criteria that should be considered when selecting an appropriate data visualization library. Learners observe how to create interactive graphs and image files; how to plot graphs using line and markers; and how to plot multiple lines in a single graph with different line styles and markers. Finally, see how to create a line chart with Pygal, create an HTML directive to render the line chart, and render the line chart.

12 videos | 37m Assessment Badge

R for Data Science: Data Visualization

Continue exploring the advantageous aspects of the programming language R in this 8-video Skillsoft Aspire course. An essential skill for statistical computing and graphics, R has become the tool of choice for data science professionals in every industry and field. Learn how to create reproducible high-quality analyses, while taking advantage of R's great graphic and charting capabilities. Learners will explore how to use R to create plots and charts of data. Key concepts covered in this course include creating a scatter plot by using the built-in R method; creating a line graph on a time series data set; and creating a bar chart with the built-in R function bar plot. You will learn how to create a box and whisker plot by using the built in mtcars data set; to create a histogram with the built-in R function hist, and the equivalent by using the ggplot2 library functions; and how to create a bubble plot with the ggplot2 library. Finally, learn how to use an appropriate plot to visualize data.

8 videos | 32m Assessment Badge

Advanced Visualizations & Dashboards: Visualization Using R

Discover how to build advanced charts by using Python and Jupyter Notebook for data science in this course, which explores R and ggplot2 visualization capabilities and how to build charts and graphs with these tools. Key concepts in this course include different types of charts that can be implemented and their relevance in data visualization; how to create a stacked bar plot; how to create Matplotlib animations; and how to use NumPy and Plotly to create interactive 3D plots in Jupyter Notebook. Learners are shown the graphical capabilities of R from the perspective of data visualization; how to build heat maps and scatter plots using R; and how to implement correlogram and build area charts using R. Next, you will explore ggplot2 capabilities from the perspective of data visualization; learn how to build and customize graphs by using ggplot2 in R; and how to create heat maps, a representation of data in form of a map or diagram. Finally, learn to create scatter plots and create area charts with R.

11 videos | 34m Assessment Badge

Data Recommendation Engines

This 13-video course explores recommendation engines, systems which provide various users with items or products that they may be interested in by observing their previous purchasing, search, and behavior histories. They are used in many industries to help users find or explore products and content; for example, to find movies, news, insurance, and a myriad of other products and services. Learners will examine the three main types of recommendation systems: item-based, user-based or collaborative, and content-based. The course next examines how to collect data to be used for learning, training, and evaluation. You will learn how to use RStudio, an open-source IDE (integrated development environment) to import, filter, and massage data into data sets. Learners will create an R function that will give a score to an item based on other user ratings and similarity scores. You will learn to use R to create a function called compareUsers, to create an item-to-item similarity or content score. Finally, learn to validate and score by using the built-in R function RMSE (root mean square error).

13 videos | 1h 4m Assessment Badge

Data Insights, Anomalies, & Verification: Handling Anomalies

In this 9-video course, learners examine statistical and machine learning implementation methods and how to manage anomalies and improvise data for better data insights and accuracy. The course opens with a thorough look at the sources of data anomaly and comparing differences between data verification and validation. You will then learn about approaches to facilitating data decomposition and forecasting, and steps and formulas used to achieve the desired outcome. Next, recall approaches to data examination and use randomization tests, null hypothesis, and Monte Carlo. Learners will examine anomaly detection scenarios and categories of anomaly detection techniques and how to recognize prominent anomaly detection techniques. Then learn how to facilitate contextual data and collective anomaly detection by using scikit-learn. After moving on to tools, you will explore the most prominent anomaly detection tools and their key components, and recognize the essential rules of anomaly detection. The concluding exercise shows how to implement anomaly detection with scikit-learn, R, and boxplot.

10 videos | 45m Assessment Badge

Data Insights, Anomalies, & Verification: Machine Learning & Visualization Tools

Discover how to use machine learning methods and visualization tools to manage anomalies and improvise data for better data insights and accuracy. This 10-video course begins with an overview of machine learning anomaly detection techniques, by focusing on the supervised and unsupervised approaches of anomaly detection. Then learners compare the prominent anomaly detection algorithms, learning how to detect anomalies by using R, RCP, and the devtools package. Take a look at the components of general online anomaly detection systems and then explore the approaches of using time series and windowing to detect online or real-time anomalies. Examine prominent real-world use cases of anomaly detection, along with learning the steps and approaches adopted to handle the entire process. Learn how to use boxplot and scatter plot for anomaly detection. Look at the mathematical approach to anomaly detection and implementing anomaly detection using a K-means machine learning approach. Conclude your coursework with an exercise on implementing anomaly detection with visualization, cluster, and mathematical approaches.

11 videos | 50m Assessment Badge

Data Science Statistics: Applied Inferential Statistics

Explore how different t-tests can be performed by using the SciPy library for hypothesis testing in this 10-video course, which continues your explorations of data science. This beginner-level course assumes prior experience with Python programming, along with an understanding of such terms as skewness and kurtosis and concepts from inferential statistics, such as t-tests and regression. Begin by learning how to perform three different t-tests-the one-sample t-test, the independent or two-sample t-test, and the paired t-test-on various samples of data using the SciPy library. Next, learners explore how to interpret results to accept or reject a hypothesis. The course covers, as an example, how to fit a regression model on the returns on an individual stock, and on the S&P 500 Index, by using the scikit-learn library. Finally, watch demonstrations of measuring skewness and kurtosis in a data set. The closing exercise asks you to list three different types of t-tests, identify values which are returned by t-tests, and write code to calculate the percentage returns from time series data using Pandas.

10 videos | 1h 18m Assessment Badge

Data Research Techniques

To master data science, you must learn the techniques surrounding data research. In this 10-video course, learners will discover how to apply essential data research techniques, including JMP measurement, and how to valuate data by using descriptive and inferential methods. Begin by recalling the fundamental concept of data research that can be applied on data inference. Then learners look at steps that can be implemented to draw data hypothesis conclusions. Examine values, variables, and observations that are associated with data from the perspective of quantitative and classification variables. Next, view the different scales of standard measurements with a critical comparison between generic and JMP models. Then learn about the key features of nonexperimental and experimental research approaches when using real-time scenarios. Compare differences between descriptive and inferential statistical analysis and explore the prominent usage of different types of inferential tests. Finally, look at the approaches and steps involved in the implementation of clinical data research and sales data research using real-time scenarios. The concluding exercise involves implementing data research.

11 videos | 32m Assessment Badge

Data Research Exploration Techniques

This course explores EDA (exploratory data analysis) and data research techniques necessary to communicate with data management professionals involved in application, implementation, and facilitation of the data research mechanism. You will examine EDA as an important way to analyze extracted data by applying various visual and quantitative methods. In this 10-video course, learners acquire data exploration techniques to derive different data dimensions to derive value from the data. You will learn proper methodologies and principles for various data exploration techniques, analysis, decision-making, and visualizations to gain valuable insights from the data. This course covers how to practically implement data exploration by using R random number generator, Python, linear algebra, and plots. You will use EDA to build learning sets which can be utilized by various machine learning algorithms or even statistical modeling. You will learn to apply univariate visualization, and to use multivariate visualizations to identify the relationship among the variables. Finally, the course explores dimensionality reduction to apply different dimension reduction algorithms to deduce the data in a state which is useful for analytics.

11 videos | 49m Assessment Badge

Data Research Statistical Approaches

This 12-video course explores implementation of statistical data research algorithms using R to generate random numbers from standard distribution, and visualizations using R to graphically represent the outcome of data research. You will learn to apply statistical algorithms like PDF (probability density function), CDF (cumulative distribution function), binomial distribution, and interval estimation for data research. Learners become able to identify the relevance of discrete versus continuous distribution in simplifying data research. This course then demonstrates how to plot visualizations by using R to graphically predict the outcomes of data research. Next, learn to use interval estimation to derive an estimate for an unknown population parameter, and learn to implement point and interval estimation by using R. Learn data integration techniques to aggregate data from different administrative sources. Finally, you will learn to use Python libraries to create histograms, scatter, and box plot; and use Python to implement missing values and outliers. The concluding exercise involves loading data in R, generating a scatter chart, and deleting points outside the limit of x vector and y vector.

13 videos | 42m Assessment Badge

Machine & Deep Learning Algorithms: Introduction

Examine fundamentals of machine learning (ML) and how Pandas ML can be used to build ML models in this 7-video course. The working of Support Vector Machines to perform classification of data are also covered. Begin by learning about different kinds of machine learning algorithms, such as regression, classification, and clustering, as well as their specific applications. Then look at the process involved in learning relationships between input and output during the training phase of ML. This leads to an introduction to Pandas ML, and the benefits of combining Pandas, scikit-learn, and XGBoost into a single library to ease the task of building and evaluating ML models. You will learn about Support Vector Machines, which are a supervised machine learning algorithm, and how they are used to find a hyperplane to divide data points into categories. Learners then study the concept of overfitting in machine learning, and the problems associated with a model overfitted to training data. and how to mitigate the issue. The course concludes with an exercise in machine learning and classification.

7 videos | 45m Assessment Badge

Machine & Deep Learning Algorithms: Regression & Clustering

In this 8-video course, explore the fundamentals of regression and clustering and discover how to use a confusion matrix to evaluate classification models. Begin by examining application of a confusion matrix and how it can be used to measure the accuracy, precision, and recall of a classification model. Then study an introduction to regression and how it works. Next, take a look at the characteristics of regression such as simplicity and versatility, which have led to widespread adoption of this technique in a number of different fields. Learn to distinguish between supervised learning techniques such as regression and classifications, and unsupervised learning methods such as clustering. You will look at how clustering algorithms are able to find data points containing common attributes and thus create logical groupings of data. Recognize the need to reduce large data sets with many features into a handful of principal components with the PCA (Principal Component Analysis) technique. Finally, conclude the course with an exercise recalling concepts such as precision and recall, and use cases for unsupervised learning.

8 videos | 48m Assessment Badge

Machine & Deep Learning Algorithms: Data Preparation in Pandas ML

Classification, regression, and clustering are some of the most commonly used machine learning (ML) techniques and there are various algorithms available for these tasks. In this 10-video course, learners can explore their application in Pandas ML. First, examine how to load data from a CSV (comma-separated values) file into a Pandas data frame and prepare the data for training a classification model. Then use the scikit-learn library to build and train a LinearSVC classification model and evaluate its performance with available model evaluation functions. You will explore how to install Pandas ML and define and configure a ModelFrame, then compare training and evaluation in Pandas ML with equivalent tasks in scikit-learn. Learn how to build a linear regression model by using Pandas ML. Then evaluate a regression model by using metrics such as r-square and mean squared error, and visualize its performance with Matplotlib. Work with ModelFrames for feature extraction and encoding, and configure and build a clustering model with the K-Means algorithm, analyzing data clusters to determine unique characteristics. Finally, complete an exercise on regression, classification, and clustering.

10 videos | 1h 3m Assessment Badge

Machine & Deep Learning Algorithms: Imbalanced Datasets Using Pandas ML

The imbalanced-learn library that integrates with Pandas ML (machine learning) offers several techniques to address the imbalance in datasets used for classification. In this course, explore oversampling, undersampling, and a combination of techniques. Begin by using Pandas ML to explore a data set in which samples are not evenly distributed across target classes. Then apply the technique of oversampling with the RandomOverSampler class in the imbalanced-learn library; build a classification model with oversampled data; and evaluate its performance. Next, learn how to create a balanced data set with the Synthetic Minority Oversampling Technique and how to perform undersampling operations on a data set by applying Near Miss, Cluster Centroids, and Neighborhood cleaning rules techniques. Next, look at ensemble classifiers for imbalanced data, applying combination samplers for imbalanced data, and finding correlations in a data set. Learn how to build a multilabel classification model, explore the use of principal component analysis, or PCA, and how to combine use of oversampling and PCA in building a classification model. The exercise involves working with imbalanced data sets.

12 videos | 1h 23m Assessment Badge

Creating Data APIs Using Node.js

Data science skills are of no value unless you have data to work with. Automating your data retrieval through application program interfaces (APIs) is a process that any data scientist must understand. In this 12-video course, learners will explore how to create RESTful OAuth APIs using Node.js. Begin with API prerequisites, installing the prerequisites to create an API using Node.js, and building a RESTful API using Node.js and Express.js. You will next discover how to build a RESTful API with OAuth in Node.js, before examining what OAuth is and why it is required. Learn about creating an HTTP server using Hapi.js; then look at how to use modules in your API using Node.js, and how to return data with JSON using Node.js. Learners explore using nodemon for development workflow in Node.js and learn how to make HTTP requests with Node.js by using request library. Use POSTman to test your Node.js API and deploy APIs with Node.js. Connect to social media APIs with Node.js to return data. A concluding exercise deals with building RESTful APIs.

13 videos | 1h 30m Assessment Badge

Final Exam: Data Scientist

Final Exam: Data Scientist will test your knowledge and application of the topics presented throughout the Data Scientist track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.

1 video | 32s Assessment Badge

FREE ACCESS