SRE Proficiency (Advanced Level)

  • 32m
  • 32 questions
The SRE Proficiency benchmark measures whether a learner has had extensive exposure to SRE technologies, practices, and principles across multiple platforms. A learner who scores high on this benchmark demonstrates professional proficiency in all of the major areas of SRE operations, across a variety of different platforms and deployments.

Topics covered

  • define the concept of criticality, name four criticality values, and identify the purpose of criticality and each value
  • define the mean time between failures (MTBF) metric and outline when and how to use it for SRE work
  • define the mean time to resolve (MTTR) metric and outline when and how to use it for SRE work
  • define the mean time to respond (MTTR) metric and describe why it might be used in SRE
  • define what is meant by cascading failures and identify situations in which this term is used
  • define what is meant by operational loads, list their types, and describe how they relate to optimal performance
  • define what is meant by resource exhaustion and describe its consequences
  • describe how automation processes can vary
  • describe how server overloads can lead to cascading failures
  • describe the features and benefits of the mean time to failure (MTTF) metric and outline how to use it in SRE work
  • describe the purpose and characteristics of utilization signals
  • determine which factors are the root cause of a problem
  • differentiate between load shedding and graceful degradation
  • differentiate between SRE and DevOps
  • list CPU considerations as they relate to failures and overutilization
  • list factors that can contribute to memory exhaustion
  • list the core tenets of SRE
  • list the potential consequences of overloads, including serious illness to staff
  • outline how to prevent server overloads
  • outline processes for working with overload errors
  • outline steps to ensure efficient queue management
  • outline steps to mitigate overloads
  • provide an overview of common pitfalls associated with troubleshooting systems
  • provide an overview of Service Level Agreements
  • provide an overview of Service Level Objectives
  • provide an overview of Site Reliability Engineering
  • provide an overview of the primary goals of a post-mortem philosophy
  • provide an overview Service Level Indicators
  • recognize how file descriptors and threads can directly lead to failures
  • recognize how resource exhaustion can lead to service unavailability
  • recognize how resource exhaustion can travel from one resource to another
  • recognize the nine principles of Site Reliability Engineering