Site Reliability Engineer: Managing Cascading Failures

SRE
  • 21 Videos | 1h 20m 39s
  • Includes Assessment
  • Earns a Badge
Likes 75 Likes 75
Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

WHAT YOU WILL LEARN

  • discover the key concepts covered in this course
    define what is meant by cascading failures and identify situations in which this term is used
    describe how server overloads can lead to cascading failures
    define what is meant by resource exhaustion and describe its consequences
    list CPU considerations as they relate to failures and overutilization
    list factors that can contribute to memory exhaustion
    recognize how file descriptors and threads can directly lead to failures
    recognize how resource exhaustion can travel from one resource to another
    recognize how resource exhaustion can lead to service unavailability
    outline how to prevent server overloads
    outline steps to ensure efficient queue management
  • differentiate between load shedding and graceful degradation
    define what is meant by code retries and recognize why it is relevant to the topic of cascading failures
    recognize the benefits of setting deadlines
    recognize how propagating cancellations can reduce unneeded work
    define what is meant by latency considerations, including bimodal latency, and describe how to address this class of problems
    outline the steps involved in managing slow startups and working with cold caching
    differentiate between the various cascading failure triggers
    outline how to test cascading failures
    list steps to immediately address cascading failures
    summarize the key concepts covered in this course

IN THIS COURSE

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform

Digital badges are yours to keep, forever.

YOU MIGHT ALSO LIKE