SRE Team Management: Managing Operational Loads
SRE
| Intermediate
- 17 Videos | 54m 39s
- Includes Assessment
- Earns a Badge
To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.
WHAT YOU WILL LEARN
-
discover the key concepts covered in this coursedescribe what is meant by operational load and outline the three general categories of operational loadoutline how on-call engineers depend on pages to respond to incidents and outagesoutline the steps involved in responding to emergency incidentsoutline the purpose of customer request support tickets and provide examples of simple and complex ticketsdescribe the essential components of a typical ticketing systemrecognize how to use service level objectives (SLO) to ensure timely responses and resolutionsdescribe what is meant by toil and provide examples of toil, such as applying schema changes to a databasedifferentiate between types of toil including automated, manual, repetitive, and tactical
-
outline steps to track and identify toil and describe why less toil is betterdescribe how to measure and calculate toiloutline steps to minimize or eliminate toil completelydifferentiate between toil and complexity and describe approaches to address complexitydescribe how toil can negatively effect staff including through low morale and confusion amongst SREslist key metrics used for managing interrupts, such as the severity of the interruptoutline human element factors to consider when dealing with interrupts, such as distractibilitysummarize the key concepts covered in this course
IN THIS COURSE
-
1.Course Overview1m 44sUP NEXT
-
2.Operational Loads3m 35s
-
3.Incidents and Outages2m 53s
-
4.Responding to Incidents3m 29s
-
5.Support Tickets3m 29s
-
6.Ticketing Systems4m 36s
-
7.Response and Resolution Timeframes3m 25s
-
8.Toil in SRE3m 8s
-
9.Types of Toil3m 41s
-
10.Identifying Toil3m 7s
-
11.Calculating Toil3m 15s
-
12.Eliminating Toil3m 21s
-
13.Addressing Complexity2m 52s
-
14.Negative Effects of Toil3m 18s
-
15.Working with Interrupts3m 29s
-
16.Human Element Factors with Interrupts4m 4s
-
17.Course Summary1m 13s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform
Digital badges are yours to keep, forever.