Site Reliability Engineer: Managing Overloads

SRE    |    Intermediate
  • 20 videos | 1h 10m 51s
  • Includes Assessment
  • Earns a Badge
Rating 4.8 of 247 users Rating 4.8 of 247 users (247)
Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.


  • Discover the key concepts covered in this course
    Define what is meant by operational loads, list their types, and describe how they relate to optimal performance
    Outline the purpose of pages and how to manage them
    Recognize the benefits of using tickets
    Outline the activities involved in ongoing operational responsibilities
    Identify how operational overload occurs and name considerations related to operational threshold
    Outline steps to mitigate overloads
    List the potential consequences of overloads, including serious illness to staff
    Recognize the importance of prioritizing work and tasks
    Recognize the pitfalls of the queries per second metric
  • Name capacity options, such as per customer limitations
    Recognize the benefits of client-side throttling
    Define the concept of criticality, name four criticality values, and identify the purpose of criticality and each value
    Describe the purpose and characteristics of utilization signals
    Outline processes for working with overload errors
    Describe mechanisms available to avoid retrying requests, such as per-request retry budget and per-client retry budget
    Outline how counters can help prevent overloads
    Describe how loads from connections can help recognize and prevent overloads
    Identify potential problems caused by new connection bursts
    Summarize the key concepts covered in this course


  • 1m 23s
  • 7m 46s
  • Locked
    3.  Operational Load Types: Pages
    5m 55s
  • Locked
    4.  Operational Load Types: Tickets
    2m 55s
  • Locked
    5.  Ongoing Operational Responsibilities
    4m 26s
    In this video, you'll learn more about SREs' ongoing operational responsibilities, which can also be known as kicking the can down the road or toil. Ongoing ops are work that an SRE needs to do to maintain everyday operations of a system. It doesn't matter how much of it you do, there's always more. This is a continuous effort that's needed for you to avoid being overloaded with unplanned items. FREE ACCESS
  • Locked
    6.  Operational Overload
    5m 41s
    In this video, you'll learn more about how to identify operational overload and name considerations related to operational threshold. You'll discover that for an SRE team to run smoothly, they need to have a predictable workload. However, work items are consistently inconsistent. For example, let's pretend a single ticket comes in. Well, it might look very simple and maybe it is. But it can also end up being insanely complex and requiring a massive investigation. FREE ACCESS
  • Locked
    7.  Mitigating Overload
    5m 59s
    In this video, you'll learn more about the symptoms of an operational overload. The good news is that symptoms are easy to identify. One symptom is people getting demoralized because they start to complain or rant about their work and also about the business as a whole. Things start to get toxic, and an unhealthy task queue can be identified by looking at how large the queue is, missed deadlines, and old items. FREE ACCESS
  • Locked
    8.  Consequences of Overloads
    3m 22s
    In this video, you'll learn more about the psychosocial risks associated with an operational overload. You'll learn that an operational overload causes a psychosocial risk. The severity of the impact will vary from person to person. This video outlines some of the ways that an operational overload can cause people to break down. FREE ACCESS
  • Locked
    9.  Prioritizing Work
    3m 43s
    In this video, you'll learn more about when a team gets into an operational overload. This means they have more work than they can handle. When that happens, teams can't make any progress. They're battling against a tidal wave of work that prevents the team from managing their workload effectively. They can't make any headway on priority items. FREE ACCESS
  • Locked
    10.  Queries Per Second
    3m 54s
    Queries per second (QPS) is a metric used to measure the rate of traffic going through a system. This can be an end-to-end metric, taking into account all the network hops in between. It can also be used in conjunction with how your teams are functioning to assess their ability to handle scale. A good alternative to QPS is measuring per individual service, which can help you to understand system health more granularly. FREE ACCESS
  • Locked
    11.  Per Customer Limitations
    3m 33s
    In this video, you'll learn more about overloads. You'll learn that overloads aren't just limited to site reliability engineers, they can happen with many teams across the organization being hit by a high volume of work. A global overload results in a sort of all hands on deck situation where everyone has to get involved to handle a massive influx of tickets, pages, and even the ongoing operations.You'll learn how to avoid normal overload situations and how global overloads happen. FREE ACCESS
  • Locked
    12.  Client-side Throttling
    2m 53s
    Client-side throttling is a technique used by hosting providers to protect their systems from excessive load. When customers continuously make requests for large amounts of data, the system can become bogged down. By setting up quotas on the customer's account, the provider can limit the amount of data that the customer can download in a day. If the customer exceeds the quota, the provider can then throttle the customer's request. FREE ACCESS
  • Locked
    13.  Criticality and Criticality Values
    2m 57s
    In this video, you will learn how to define the concept of criticality and name four criticality values. You will learn that having a client be forced to back off because a quota has been reached helps make sure other clients can still use the system without being impacted by the actions of others. But what happens if there is a really critical query that needs to be run by the client? Do we treat all requests the same? This will be up to your specific situation. FREE ACCESS
  • Locked
    14.  Utilization Signals
    2m 57s
    Utilization signals are metrics that can be used to prevent system overload. These signals can be based on local task states or on the load of the task process itself. By monitoring these signals over time, we can determine the health of the task and the system it is running on. FREE ACCESS
  • Locked
    15.  Working with Overload Errors
    3m 19s
    In this session, Sven Batalla will discuss how to handle overload errors in a data center. First, he will discuss the two types of overload errors that can occur- large and small. Second, he will go over the steps that should be taken in each situation. Finally, he will discuss load balancing strategies that can be employed to prevent overload. FREE ACCESS
  • Locked
    16.  Retrying Requests
    2m 39s
    Retrying requests is a common strategy for preventing system overload. A per-request retry budget allows the caller to make two more attempts before getting an error message, while a per-client retry budget limits how many requests the client can send in a given time frame. FREE ACCESS
  • Locked
    17.  Overloads and Counters
    1m 58s
    Counting requests to the back end can help prevent system overloads. Retries can make the problem worse, so implementing a counter can help. The counter can tell the back end how many times a specific request has already been tried, and this information can be used to decide whether or not to allow the request. This strategy relies on the client being honest, but if you control both the client and the back end, it is a reasonable situation to be in. FREE ACCESS
  • Locked
    18.  Connection Loads
    2m 1s
    Connection loads can be a factor in system overloads. Loads from connections can slow down requests and take up more CPU and memory. Monitoring the health of your connections is recommended to avoid overloads. FREE ACCESS
  • Locked
    19.  New Connection Bursts
    2m 24s
    In this session, Sven Batalla will discuss how new connection bursts can occur and how to prevent system overload. He will also discuss two strategies for handling these bursts. The first is cross-datacenter load balancing, which can help distribute the load when a single data center becomes overloaded. The second is the use of proxy batch jobs, which can buffer the request and then forward it to the actual request management tasks. FREE ACCESS
  • Locked
    20.  Course Summary
    1m 9s


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.


Rating 4.6 of 5 users Rating 4.6 of 5 users (5)
Rating 4.4 of 5 users Rating 4.4 of 5 users (5)
Rating 4.7 of 29 users Rating 4.7 of 29 users (29)


Rating 4.7 of 390 users Rating 4.7 of 390 users (390)
Rating 4.7 of 205 users Rating 4.7 of 205 users (205)
Rating 4.6 of 1588 users Rating 4.6 of 1588 users (1588)