SRE Team Management: Managing Operational Loads

SRE
  • 17 Videos | 1h 2m 9s
  • Includes Assessment
  • Earns a Badge
To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect. 

WHAT YOU WILL LEARN

  • discover the key concepts covered in this course
    describe what is meant by operational load and outline the three general categories of operational load
    outline how on-call engineers depend on pages to respond to incidents and outages
    outline the steps involved in responding to emergency incidents
    outline the purpose of customer request support tickets and provide examples of simple and complex tickets
    describe the essential components of a typical ticketing system
    recognize how to use service level objectives (SLO) to ensure timely responses and resolutions
    describe what is meant by toil and provide examples of toil, such as applying schema changes to a database
    differentiate between types of toil including automated, manual, repetitive, and tactical
  • outline steps to track and identify toil and describe why less toil is better
    describe how to measure and calculate toil
    outline steps to minimize or eliminate toil completely
    differentiate between toil and complexity and describe approaches to address complexity
    describe how toil can negatively effect staff including through low morale and confusion amongst SREs
    list key metrics used for managing interrupts, such as the severity of the interrupt
    outline human element factors to consider when dealing with interrupts, such as distractibility
    summarize the key concepts covered in this course

IN THIS COURSE

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform

Digital badges are yours to keep, forever.