Site Reliability Engineer: Managing Cascading Failures

SRE    |    Intermediate
  • 21 videos | 1h 11m 9s
  • Includes Assessment
  • Earns a Badge
Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.


  • Discover the key concepts covered in this course
    Define what is meant by cascading failures and identify situations in which this term is used
    Describe how server overloads can lead to cascading failures
    Define what is meant by resource exhaustion and describe its consequences
    List cpu considerations as they relate to failures and overutilization
    List factors that can contribute to memory exhaustion
    Recognize how file descriptors and threads can directly lead to failures
    Recognize how resource exhaustion can travel from one resource to another
    Recognize how resource exhaustion can lead to service unavailability
    Outline how to prevent server overloads
    Outline steps to ensure efficient queue management
  • Differentiate between load shedding and graceful degradation
    Define what is meant by code retries and recognize why it is relevant to the topic of cascading failures
    Recognize the benefits of setting deadlines
    Recognize how propagating cancellations can reduce unneeded work
    Define what is meant by latency considerations, including bimodal latency, and describe how to address this class of problems
    Outline the steps involved in managing slow startups and working with cold caching
    Differentiate between the various cascading failure triggers
    Outline how to test cascading failures
    List steps to immediately address cascading failures
    Summarize the key concepts covered in this course


  • 1m 30s
  • 1m 55s
  • Locked
    3.  Server Overloads
    3m 34s
  • Locked
    4.  Resource Exhaustion
    3m 4s
    In this video, you'll learn that one of the potential causes for a cascading failure is resource exhaustion. You'll learn what this means and how it relates to system-level resources like memory, CPU, disk space, and so on. For example, imagine having 64 gigabytes of memory and running at or near full capacity. Or imagine running at 100% CPU, or even running completely out of disk space. FREE ACCESS
  • Locked
    5.  CPU Resources
    4m 52s
    In this video, you'll learn more about resource exhaustion or system overloads in the CPU of an individual server. You'll learn that if it's starved, processes start to run much more slowly. The host will outline the symptoms of starved CPU and how they cascade into other areas like memory, total number of active threads, and on-screen text. FREE ACCESS
  • Locked
    6.  Memory Resources
    2m 50s
    In this video, you'll learn more about the system resource exhaustion can happen in all sorts of different areas. One common area is memory consumption. It can happen from processes taking lots of memory, but it can also happen as the CPU request queue starts to back up. A common symptom of memory exhaustion is dying tasks. As memory consumption needs exceed memory availability, a task might be killed or evicted by the system. FREE ACCESS
  • Locked
    7.  File Descriptors and Threads
    2m 14s
  • Locked
    8.  Resource Dependencies
    1m 58s
  • Locked
    9.  Unavailable Services
    4m 45s
  • Locked
    10.  Preventing Overloads
    4m 47s
  • Locked
    11.  Queueing Requests
    2m 59s
    In this video, you'll outline steps to ensure efficient queue management when systems are overloaded. You'll learn that one of the most common resources to get exhausted is the CPU. When the CPU gets exhausted, it no longer can handle all the requests coming at it. This means the queue of requests start to get larger and larger until the system eventually runs out of memory. FREE ACCESS
  • Locked
    12.  Load Shedding
    6m 32s
    In this video, you'll learn how to differentiate between load shedding and graceful degradation. The idea behind load shedding is that some less important requests will be dropped as the server approaches an overload condition. You'll learn that the goal of load shedding is to prevent the server from having to execute less important requests, helping prevent the system from running out of CPU, memory, and failing health checks. FREE ACCESS
  • Locked
    13.  Code Retries
    2m 34s
    In this video, you'll learn more about when systems are being overloaded with requests. You'll discover that retries can make the problem worse. However, what is a retry? A retry is performed after a previous failure. Regardless of the reason for failure, a retry is performed if the initial failure was due to a condition no longer relevant. FREE ACCESS
  • Locked
    14.  Implementing Deadlines
    3m 56s
  • Locked
    15.  Propagating Cancellations
    1m 33s
  • Locked
    16.  Dealing with Latency
    2m 20s
  • Locked
    17.  Working with Slow Startups
    3m 49s
  • Locked
    18.  Cascading Failure Triggers
    4m 50s
  • Locked
    19.  Testing Cascading Failures
    4m 30s
  • Locked
    20.  Addressing Cascading Failures
    5m 30s
  • Locked
    21.  Course Summary
    1m 8s


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.