AWS Certified Machine Learning Specialty: Exploratory Data Analysis Competency

  • 34m
  • 34 questions
The AWS Certified Machine Learning Specialty: Exploratory Data Analysis Competency benchmark measures your ability to explore, sanitize, and prepare data for modeling, perform feature engineering, and analyze and visualize data for machine learning. A learner who scores high on this benchmark demonstrates that they have the necessary skills to identify and handle dirty data, transform data, recognize labeled data and identify migration strategies, identify and extract features from data sets, graph data, interpret descriptive statistics, and perform clustering.

Topics covered

  • define Bernoulli, uniform, and binomial data distributions
  • define binning and discretization as the process of transforming numerical variables into categorical counterparts
  • define data scaling and normalization and describe why it is important to standardize independent variables
  • define normal, Poisson, and exponential data distributions
  • define the main Amazon QuickSight processes and terms
  • describe bag-of-words model and compare it to TF-IDF
  • describe how Amazon SageMaker Ground Truth works and name its major benefits
  • describe how data outliers impact data analysis and name common ways to deal with outliers
  • describe how dimensions and features are linked to each other, specifying their impacts on building accurate ML models
  • describe how missing data impacts ML models and name ways to deal with missing data
  • describe how tables, databases, and data catalogs work in Amazon Athena and how to query data from other AWS services in Athena
  • describe how the Apache Spark open-source framework works with Amazon Elastic MapReduce (EMR) and its real-world use cases
  • describe how to perform one-hot encoding and its main purpose
  • describe how to use Amazon SageMaker Feature Store to fully manage repositories for ML features
  • describe the concept of n-gram and why they are used for machine learning
  • describe the process of term frequency-inverse document frequency (TF-IDF) and its uses in text mining
  • describe the use cases for Amazon Elastic MapReduce (EMR), recognize when to deploy it, and compare EMR and Glue
  • differentiate between categorical and numerical data types
  • name and describe modern graphic types used in data analysis
  • name and describe traditional graphic types used in data analysis
  • outline data shuffling and define its role in removing biases and building more robust training models
  • outline how data transformation can be used to make data more useful for data analysis
  • outline how the Apache Hadoop open-source framework works with Amazon Elastic MapReduce (EMR) and its real-world use cases
  • recognize the basic principles behind text feature engineering
  • recognize what's meant by advanced time series analysis concepts, such as trends, seasonality, and autocorrelation
  • specify how skewed data can affect ML classification and ways to address it
  • use Spark and EMR workflows to prepare data for a TF-IDF problem
  • work with Amazon Athena to create databases and tables and run queries
  • work with Amazon QuickSight to create a simple multi-visual analysis and a dashboard
  • work with Amazon SageMaker Feature Store to achieve feature consistency and standardization
  • work with commonly used feature engineering techniques on real data
  • work with data, analyses, visuals, ML insights, and dashboards in Amazon QuickSight
  • work with Python toolkits to implement various types of data visualization
  • work with time series data in Python, implementing data analysis pipelines