Site Reliability: Engineering

SRE    |    Intermediate
  • 13 videos | 1h 5m 8s
  • Includes Assessment
  • Earns a Badge
Likes 606 Likes 606
Site Reliability Engineers are often considered the link between software development and operations. In this course, you'll explore the principles of site reliability engineering as well as common concerns such as measuring and managing risk, and risk tolerance. You'll also learn how to ensure a satisfactory level of service by implementing Service Level Objectives, Service Level Agreements, and Service Level Indicators.


  • discover the key concepts covered in this course
    provide an overview of Site Reliability Engineering
    recognize the nine principles of Site Reliability Engineering
    list the core tenets of SRE
    differentiate between SRE and DevOps
    provide an overview Service Level Indicators
    provide an overview of Service Level Objectives
  • provide an overview of Service Level Agreements
    recognize how to embrace and manage risk in an environment
    recognize how to measure service risk using metrics such as time-based availability and aggregate availability
    identify the risk tolerance of infrastructure services
    provide an overview of error budgets
    summarize the key concepts covered in this course


  • 1m 17s
  • 4m 31s
    In this video, you'll learn more about the Site Reliability Engineer or SRE and what the job involves. You'll learn that although the concept isn't new, the role is becoming increasingly common in today's organizations. The SRE bridges the gap between operations and development to build scalable and highly protected systems. FREE ACCESS
  • Locked
    3.  Principles of Site Reliability Engineering
    7m 30s
    In this video, you'll learn more about the 9 core Principles of Site Reliability Engineering and the implementation of a DevOps approach to building and maintaining your organization. You'll learn that the primary responsibility of the SRE will be writing code. The second is to build a team of site reliability engineers who can draw from a pool of developers to help ensure that the system is reliable. The third deals with recognizing the core capabilities of your development team. Explore these and other key principles by watching this video. FREE ACCESS
  • Locked
    4.  Tenets of SRE
    9m 13s
    In this video, you'll learn more about the core tenets of a site reliability engineer. These include managing availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Each one of them in one way or another will help improve the overall reliability of the site it's services, and applications. The video outlines the ten core activites the SRE will be involved with. FREE ACCESS
  • Locked
    5.  SRE vs. DevOps
    6m 47s
    In this video, you'll learn more about the distinctions between what is considered to be a DevOps position and being a Site Reliability Engineer. You'll learn that in many cases, the terms are used interchangeably. In some situations, the roles are combined. But in other situations, they're distinct. You'll learn that the goal of an SRE is to bridge the gap between Dev and Ops thereby creating a DevOps role. FREE ACCESS
  • Locked
    6.  Service Level Indicator
    5m 37s
    In this video, you'll learn more about the concept of a Service Level Indicator. You'll learn about how these deal with specific aspects of a service and how they deliver actual measured values. You'll learn there are several key indicators that are of interest to both the provider and consumer, including request latency which measures the actual time it takes to respond to a request. FREE ACCESS
  • Locked
    7.  Service Level Objective
    In this video, you'll learn more about the concept of availability. It refers to how well a system can fulfill its purpose and function reliably. Toward that end, the service level objective or SLO defines a precise numerical target for availability which then defines a benchmark against which future performance can be compared. Explore this subject by watching this video. FREE ACCESS
  • Locked
    8.  Service Level Agreement
    3m 48s
    In this video, you'll learn more about the Service Level Agreement or the SLA which is a contract between a provider and a consumer that defines the level of service to be expected. You'll learn that the typical components of an SLA include clearly defined metrics such as the speed of a service, the responsibilities in terms of who looks after what, and expectations of what should be done for maintenance and upkeep. FREE ACCESS
  • Locked
    9.  Managing Risk
    3m 8s
  • Locked
    10.  Measuring Risk
    7m 27s
    In this video, you'll learn more about methods for measuring the level of risk that can be associated with a system or a service. This usually starts with establishing a target for any given metric or performance value. And when it comes to measuring risk, there are many considerations you should make in terms of what might result from a failure including customer dissatisfaction or a loss of trust, lost revenue, and customers. Watch this video to find out more about measuring risk, and acceptable levels for unplanned downtime. FREE ACCESS
  • Locked
    11.  Risk Tolerance
    4m 48s
    In this video, you'll learn more about considerations for establishing an acceptable level of risk or risk tolerance. As in any service or solution, there are often a lot of moving parts, with each one having different considerations when it comes to risk. You'll discover there is always a risk of physical failure with components such as hardware. This video explores these topics and how to take a top-down approach, focusing on the risk associated with individual components. FREE ACCESS
  • Locked
    12.  Error Budgets
    4m 11s
    In this video, you'll learn more about error budgets. These refer to the amount of acceptable downtime for any given service or system, which is then used to develop new features or improvements. But the very nature of the error budget inherently includes the possibility for conflicts to arise between the product development teams and the site reliability engineering teams because they themselves are in fact at odds with each other, again due to the nature of their jobs. This video outlines these issues. FREE ACCESS
  • Locked
    13.  Course Summary


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.


Likes 162 Likes 162  
Likes 249 Likes 249  
Likes 26 Likes 26