SRE Team Management: Managing Operational Loads
SRE
| Intermediate
- 17 videos | 54m 39s
- Includes Assessment
- Earns a Badge
To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.
WHAT YOU WILL LEARN
-
discover the key concepts covered in this coursedescribe what is meant by operational load and outline the three general categories of operational loadoutline how on-call engineers depend on pages to respond to incidents and outagesoutline the steps involved in responding to emergency incidentsoutline the purpose of customer request support tickets and provide examples of simple and complex ticketsdescribe the essential components of a typical ticketing systemrecognize how to use service level objectives (SLO) to ensure timely responses and resolutionsdescribe what is meant by toil and provide examples of toil, such as applying schema changes to a databasedifferentiate between types of toil including automated, manual, repetitive, and tactical
-
outline steps to track and identify toil and describe why less toil is betterdescribe how to measure and calculate toiloutline steps to minimize or eliminate toil completelydifferentiate between toil and complexity and describe approaches to address complexitydescribe how toil can negatively effect staff including through low morale and confusion amongst SREslist key metrics used for managing interrupts, such as the severity of the interruptoutline human element factors to consider when dealing with interrupts, such as distractibilitysummarize the key concepts covered in this course
IN THIS COURSE
-
1m 44s
-
3m 35s
-
2m 53s
-
3m 29s
-
3m 29s
-
4m 36s
-
3m 25s
-
3m 8s
-
3m 41s
-
3m 7s
-
3m 15s
-
3m 21s
-
2m 52s
-
3m 18s
-
3m 29s
-
4m 4s
-
1m 13s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.