Distributed Reliability: SRE Critical State Management

SRE    |    Intermediate
  • 14 videos | 1h 13m 47s
  • Includes Assessment
  • Earns a Badge
Rating 4.8 of 181 users Rating 4.8 of 181 users (181)
Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.


  • Discover the key concepts covered in this course
    Describe critical state management and how it applies to distributed systems and affects reliability
    Define the cap theorem and describe how it relates to distributed systems
    Outline how to coordinate system failures on distributed systems
    Differentiate deterministic and nondeterministic algorithms and how they relate to distributed systems
    Describe the system models that can be used with distributed systems
    Define the concept of distributed consensus and list the stages of validation
  • Define the concept of byzantine fault and describe how it applies to distributed systems
    Describe the distributed consensus architecture patterns used in distributed systems
    Describe best practice and tricks for increasing performance for distributed systems
    Define the multi-paxos protocol and describe how it relates to distributed systems
    Outline how to deploy distributed consensus-based systems and name some key considerations
    Name and describe the key considerations when monitoring distributed consensus systems
    Summarize the key concepts covered in this course


  • 1m 20s
  • 5m 41s
    Site reliability engineering (SRE) is the practice of allowing software developers to run, manage, and maintain ongoing daily operations of their applications and services so that they are available for users to consume. Critical state management is a key part of SRE, as it allows for anticipating and planning for system failures. A distributed consensus is needed for building highly available and robust systems, which leads to the use of distributed locking. FREE ACCESS
  • Locked
    3.  CAP Theorem
    5m 19s
    In this video, you'll learn more about the acid features of typical DBMS system transactions. These are known as ACID, which stands for atomic, consistent, isolated, and durable. The idea is that every transaction performed against a DBMS system abides by these characteristics. You'll learn how to define these terms and how they relate to distributed systems. FREE ACCESS
  • Locked
    4.  Distributed Systems Coordination Failure
    7m 22s
    In this video, you will outline the primary job of an SRE. You will learn that, as with all systems, distributed systems can sometimes fail too. The objective is always to restore a system to full operation. This means that when a failure happens, we need to figure out what is going on and then resolve the issue. FREE ACCESS
  • Locked
    5.  Deterministic vs. Nondeterministic
    7m 21s
    In this video, you'll learn the difference between deterministic and nondeterministic algorithms. The objective of algorithms is to get an answer. However, not every algorithm can give you a specific answer. This leads to the discussion of deterministic versus nondeterministic algorithms. Deterministic algorithms work through the same states every time to produce an answer. Meanwhile, non-deterministic algorithms might go through completely different states every time they execute. FREE ACCESS
  • Locked
    6.  Distributed System Models
    5m 51s
    In this video, you will learn about different kinds of distributed systems. You will discover that there are several different categories of distribution, including synchronous and asynchronous models. You will also learn about architectural models and the fundamental models. FREE ACCESS
  • Locked
    7.  Distributed Consensus
    5m 9s
    In this video, you will learn more about distributed systems and how to achieve reliability in a system when dealing with faulty processes. You will learn that solving this problem requires that these distributed processes effectively agree on which data values will be committed to a database. You will learn there are many ways to achieve distributed consensus, including a two phase commit process and a three phase commit process. FREE ACCESS
  • Locked
    8.  Byzantine Fault
    4m 51s
  • Locked
    9.  Distributed Consensus Architecture Patterns
    6m 50s
    In this video, you'll learn about distributed consensus algorithms. You'll learn that these algorithms allow nodes to agree on information. They're low level and primitive, but distributed consensus algorithms provide a good place for practical functionality. You'll also learn that higher-level components such as datastores, configuration stores, queues, locking, and leader election services can help with consensus algorithms. FREE ACCESS
  • Locked
    10.  Distributed Consensus Performance
    5m 40s
    Distributed consensus can be quite slow and costly, but if it is implemented correctly, it can still function effectively. To improve performance, throughput, latency, and data replication, a number of strategies can be employed. FREE ACCESS
  • Locked
    11.  Multi-Paxos Detailed Message Flow
    4m 50s
    In this video, you'll learn how to define the Multi-Paxos protocol and describe how it relates to distributed systems. You'll learn that Paxos operates as a sequence of proposals. These proposals are accepted or denied by a majority of the processes in the system. If accepted, the proposals are executed. This means that Paxos makes sure all the operations are performed in a strict order. FREE ACCESS
  • Locked
    12.  Distributed Consensus-based System Deployment
    7m 33s
    When deploying a distributed consensus-based system, you need to take a number of factors into account, such as the number of replicas needed, the load each system can handle, and the quorum composition. The best way to determine how many replicas you need is to ask questions such as how important reliability is, how often you'll perform planned maintenance, and what level of risk you're willing to accept. FREE ACCESS
  • Locked
    13.  Distributed Consensus System Monitoring
    4m 54s
    In this video, you will learn more about distributed consensus-based systems. These are great to have because they solve a lot of problems and reduce risk while reliably serving customers. But as an SRE, your main responsibility is keeping that system up and running. Therefore, one thing you want to have is the ability to monitor these systems. You will learn that to monitor a distributed system, you need system metrics and log data collected and stored in a searchable format. FREE ACCESS
  • Locked
    14.  Course Summary
    1m 7s


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.


Rating 4.9 of 16 users Rating 4.9 of 16 users (16)
Rating 4.7 of 164 users Rating 4.7 of 164 users (164)
Rating 4.5 of 193 users Rating 4.5 of 193 users (193)


Rating 4.8 of 177 users Rating 4.8 of 177 users (177)
Rating 4.3 of 4904 users Rating 4.3 of 4904 users (4904)
Rating 4.7 of 316 users Rating 4.7 of 316 users (316)