Site Reliability Engineer: Managing Cascading Failures
SRE
| Intermediate
- 21 videos | 1h 11m 9s
- Includes Assessment
- Earns a Badge
Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.
WHAT YOU WILL LEARN
-
discover the key concepts covered in this coursedefine what is meant by cascading failures and identify situations in which this term is useddescribe how server overloads can lead to cascading failuresdefine what is meant by resource exhaustion and describe its consequenceslist CPU considerations as they relate to failures and overutilizationlist factors that can contribute to memory exhaustionrecognize how file descriptors and threads can directly lead to failuresrecognize how resource exhaustion can travel from one resource to anotherrecognize how resource exhaustion can lead to service unavailabilityoutline how to prevent server overloadsoutline steps to ensure efficient queue management
-
differentiate between load shedding and graceful degradationdefine what is meant by code retries and recognize why it is relevant to the topic of cascading failuresrecognize the benefits of setting deadlinesrecognize how propagating cancellations can reduce unneeded workdefine what is meant by latency considerations, including bimodal latency, and describe how to address this class of problemsoutline the steps involved in managing slow startups and working with cold cachingdifferentiate between the various cascading failure triggersoutline how to test cascading failureslist steps to immediately address cascading failuressummarize the key concepts covered in this course
IN THIS COURSE
-
1.Course Overview1m 30sUP NEXT
-
2.Recognizing Cascading Failures1m 55s
-
3.Server Overloads3m 34s
-
4.Resource Exhaustion3m 4s
-
5.CPU Resources4m 52s
-
6.Memory Resources2m 50s
-
7.File Descriptors and Threads2m 14s
-
8.Resource Dependencies1m 58s
-
9.Unavailable Services4m 45s
-
10.Preventing Overloads4m 47s
-
11.Queueing Requests2m 59s
-
12.Load Shedding6m 32s
-
13.Code Retries2m 34s
-
14.Implementing Deadlines3m 56s
-
15.Propagating Cancellations1m 33s
-
16.Dealing with Latency2m 20s
-
17.Working with Slow Startups3m 49s
-
18.Cascading Failure Triggers4m 50s
-
19.Testing Cascading Failures4m 30s
-
20.Addressing Cascading Failures5m 30s
-
21.Course Summary1m 8s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.