Final Exam: Data Analyst
1 Video | 30m 32s
- Includes Assessment
- Earns a Badge
Final Exam: Data Analyst will test your knowledge and application of the topics presented throughout the Data Analyst track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.
WHAT YOU WILL LEARN
build and run the application and confirm the output using HDFS from both the command line and the web applicationcompare and contrast SQL and NoSQL database solutionsconfigure a JDBC connection on Glue to the Redshift clusterconfigure and view permissions for individual files and directories using the getfacl and chmod commandsconfigure HDFS using the hdfs-site.xml file and identify the properties which can be set from itcrawl data stored in a DynamoDB tablecreate and configure a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc servicecreate and configure simple graphs with lines and markers using the Matplotlib data visualization librarycreate and load data into an RDDCreate data frames in Rcreate matrices in Rcreate vectors in Rdefine linear regressiondefine the contents of a DataFrame using the SQLContextdefine the inter-quartile range of a dataset and enumerate its propertiesDefine the mean of a dataset and enumerate its propertiesdelete a Google Cloud Dataproc cluster and all of its associated resourcesdeploy DynamoDB in the Amazon Web Services clouddescribe and apply the different techniques involved in handling datasets where some information is missingdescribe NoSQL Stores and how they are useddescribe the concept of hierarchical index or multi-index and why can be usefuldescribe the ETL process and different tools availabledescribe the options available when iterating over 1-dimensional and multi-dimensional arraysdraw the shape of a Gaussian distribution and enumerate its defining propertiesedit individual cells and entire rows and columns in a Pandas DataFrameexecute the application and verify that the filtering has worked correctly; examine the job and the output files using the YARN Cluster Manager and HDFS NameNode web UIsexplain the concept of hierarchical index or multi-index and why can be usefulexport the contents of a DataFrame into files of various formatsexport the contents of a DataFrame into files of various formatsidentify different tools available for data management
identify the various GCP services used by Dataproc when provisioning a clusterimport and export data in Rinitialize a Spark DataFrame from the contents of an RDDinstall Pandas and create a Pandas Serieslist the six phases of the data lifecycleload data into a Redshift cluster from S3 bucketsread data from an Excel spreadsheetread data from files and write data to files using the Python Pandas libraryrecall how Apache Zookeeper enables the HDFS NameNode and YARN ResourceManager to run in high-availability moderecall the steps involved in building a MapReduce application and the specific workings of the Map phase in processing each row of data in the input filerecognize and deal with missing data in Rrecognize the challenges involved in processing big data and the options available to address them such as vertical and horizontal scalingretrieve specific parts of an array using row and column indicesrun ETL scripts using Gluerun the application and examine the outputs generated to get the word frequencies in the input text documentset up a JDBC connection on Glue to the Redshift clusterspecify the configurations of the MapReduce applications in the Driver program and the project's pom.xml filestandardize a distribution to express its values as z-scores and use Pandas to generate a correlation and covariance matrix for your datasettransfer files from your local file system to HDFS using the copyFromLocal commanduse fancy indexing with arrays using an index maskuse NumPy to compute statistics such as the mean and median on your datause NumPy to compute the correlation and covariance of two distributions and visualize their relationship with scatterplotsuse the dplyr library to load data framesuse the get and getmerge functions to retrieve one or multiple files from HDFSuse the ggplot2 library to visualize data using Ruse the NumPy library to manipulate arrays and the Pandas library to load and analyze a datasetusing the independent t-test and with a related sample using a paired t-test using the SciPy libraryusing the mutate methodwork with the YARN Cluster Manager and HDFS NameNode web applications that come packaged with Hadoopwrite a simple bash script
IN THIS COURSE
1.Data Analyst33sUP NEXT
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platformDigital badges are yours to keep, forever.