Final Exam: Data Analyst

  • 1 video | 32s
  • Includes Assessment
  • Earns a Badge
Likes 5 Likes 5
Final Exam: Data Analyst will test your knowledge and application of the topics presented throughout the Data Analyst track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.


  • list the six phases of the data lifecycle
    use the ggplot2 library to visualize data using R
    create vectors in R
    crawl data stored in a DynamoDB table
    compare and contrast SQL and NoSQL database solutions
    standardize a distribution to express its values as z-scores and use Pandas to generate a correlation and covariance matrix for your dataset
    specify the configurations of the MapReduce applications in the Driver program and the project's pom.xml file
    explain the concept of hierarchical index or multi-index and why can be useful
    use the NumPy library to manipulate arrays and the Pandas library to load and analyze a dataset
    delete a Google Cloud Dataproc cluster and all of its associated resources
    configure and view permissions for individual files and directories using the getfacl and chmod commands
    recall how Apache Zookeeper enables the HDFS NameNode and YARN ResourceManager to run in high-availability mode
    load data into a Redshift cluster from S3 buckets
    describe the options available when iterating over 1-dimensional and multi-dimensional arrays
    run the application and examine the outputs generated to get the word frequencies in the input text document
    use the get and getmerge functions to retrieve one or multiple files from HDFS
    export the contents of a DataFrame into files of various formats
    using the independent t-test and with a related sample using a paired t-test using the SciPy library
    Create data frames in R
    create and configure simple graphs with lines and markers using the Matplotlib data visualization library
    configure HDFS using the hdfs-site.xml file and identify the properties which can be set from it
    work with the YARN Cluster Manager and HDFS NameNode web applications that come packaged with Hadoop
    install Pandas and create a Pandas Series
    recognize and deal with missing data in R
    export the contents of a DataFrame into files of various formats
    use NumPy to compute the correlation and covariance of two distributions and visualize their relationship with scatterplots
    recognize the challenges involved in processing big data and the options available to address them such as vertical and horizontal scaling
    create and configure a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service
    define the inter-quartile range of a dataset and enumerate its properties
    draw the shape of a Gaussian distribution and enumerate its defining properties
  • execute the application and verify that the filtering has worked correctly; examine the job and the output files using the YARN Cluster Manager and HDFS NameNode web UIs
    set up a JDBC connection on Glue to the Redshift cluster
    import and export data in R
    deploy DynamoDB in the Amazon Web Services cloud
    using the mutate method
    create matrices in R
    identify different tools available for data management
    describe the ETL process and different tools available
    initialize a Spark DataFrame from the contents of an RDD
    configure a JDBC connection on Glue to the Redshift cluster
    describe NoSQL Stores and how they are used
    create and load data into an RDD
    read data from an Excel spreadsheet
    run ETL scripts using Glue
    use fancy indexing with arrays using an index mask
    write a simple bash script
    identify the various GCP services used by Dataproc when provisioning a cluster
    read data from files and write data to files using the Python Pandas library
    define linear regression
    edit individual cells and entire rows and columns in a Pandas DataFrame
    Define the mean of a dataset and enumerate its properties
    retrieve specific parts of an array using row and column indices
    recall the steps involved in building a MapReduce application and the specific workings of the Map phase in processing each row of data in the input file
    use NumPy to compute statistics such as the mean and median on your data
    describe the concept of hierarchical index or multi-index and why can be useful
    define the contents of a DataFrame using the SQLContext
    describe and apply the different techniques involved in handling datasets where some information is missing
    use the dplyr library to load data frames
    build and run the application and confirm the output using HDFS from both the command line and the web application
    transfer files from your local file system to HDFS using the copyFromLocal command


  • Playable
    Data Analyst


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.


Likes 2 Likes 2  
Likes 14 Likes 14  
Likes 18 Likes 18  


Likes 81 Likes 81  
Likes 47 Likes 47  
Likes 241 Likes 241