Site Reliability Engineering: Release Engineering Intermediate
  • 2 Courses | 2h 12m
  • 42 Courses | 43h 49m 40s
  • 1 Book | 7h 17m
  • Includes Lab
Likes 137 Likes 137
Explore Site Reliability Engineering (SRE), where software engineering aspects are applied to infrastructure and operation tasks.


Build & Release Engineering Best Practices: Release Engineering

  • Playable
    Course Overview
    1m 35s
  • Playable
    What Is Release Engineering?
    5m 53s


Site Reliability: Engineering

  • Playable
    Course Overview
    1m 17s
  • Playable
    What Is Site Reliability Engineering?
    4m 31s


Build & Release Engineering Best Practices: Release Engineering
It's important to know why the roles, philosophy, and principles behind release engineering - a relatively new discipline of software engineering - are used for building and delivering software. In this course, you'll learn about the automated release system called Rapid, and how it can be used to provide a framework for delivering reliable software builds and releases. You'll also learn about configuration management and the importance of collaboration between release engineers and site reliability engineers.
15 videos | 59m has Assessment available Badge
Build & Release Engineering Best Practices: Release Management
Release management can guide your software development efforts from planning to deployment, resulting in better customer satisfaction with the end product. In this course, you'll learn about the benefits of using a release management process to manage and improve the development of a software build. You'll then move on to explore key concepts and principles that apply to release management, as well as common considerations and potential challenges to be aware of. Lastly, you'll learn about common toolsets used by release engineers and best practices related to continuous integration and release deployment.
15 videos | 1h 12m has Assessment available Badge


Site Reliability: Engineering
Site Reliability Engineers are often considered the link between software development and operations. In this course, you'll explore the principles of site reliability engineering as well as common concerns such as measuring and managing risk, and risk tolerance. You'll also learn how to ensure a satisfactory level of service by implementing Service Level Objectives, Service Level Agreements, and Service Level Indicators.
13 videos | 1h 5m has Assessment available Badge
Site Reliability: Tools & Automation
There are numerous tools available to Site Reliability Engineers to help with planning, managing, deploying, automating, and monitoring services and infrastructure. In this course, you'll explore these tools as well some the benefits of automation and the automation process. You'll also discover common pitfalls and failures, as well as how to manage of post-mortem incidents.
14 videos | 52m has Assessment available Badge
Best Practices for the SRE: Automation
It has been proven that the automation of processes and systems commonly results in higher production rates and increased productivity. In this course, you'll learn the basics of automation, including benefits such as consistency, efficiency, problem-solving, and cost-savings. You'll examine the potential challenges of automation, including integration, complexity, and security. Lastly, you'll learn the value of automation for a Site Reliability Engineer and how SREs are using automation to improve daily operations and overcome obstacles.
15 videos | 1h has Assessment available Badge
Best Practices for the SRE: Use Cases for Automation
Site Reliability Engineers often use automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient. In this course, you'll learn about common use cases for automating systems and processes. You'll examine PowerShell capabilities that can be used to automate a variety of Windows administrative tasks including user creation, patching and updating, bulk enrollment, and software installations. Lastly, you'll learn about cluster turnup automation, reliability, and enabling failure at scale.
16 videos | 1h 9m has Assessment available Badge
Backup & Recovery: Business Continuity & Disaster Recovery
Disasters can occur at any time and to any sized organization, so administrators should invest the time and resources to properly plan for business continuity and disaster recovery. In this course, you'll learn how to plan for business continuity, assess risk, and perform business impact assessments. You'll also learn about system resilience, sensitive data types, and data classifications. Lastly, you'll see a comparison of Recovery Time Objective and Recovery Point Objective, and examine what to include when preparing a disaster recovery training plan.
15 videos | 1h 21m has Assessment available Badge
Backup & Recovery: Enterprise Backup Strategies
Critical information must be backed up and protected for a company's survival. In this course, you'll learn about onsite and offsite backup and the recovery solution. You'll examine the three main cloud providers - Amazon Web Services, Microsoft Azure, and Google. You'll then learn about considerations for local backup and bring your own device backups. Finally, you'll explore the cultural impact involved in moving to the cloud and how employee communication and inclusion could be vital to a successful migration.
11 videos | 45m has Assessment available Badge
Backup & Recovery: Windows Client Backup and Recovery Tools
For the vitality of any company, data protection solutions are essential. There are numerous types of built-in backup and recovery tools available in the Windows 10 operating system. In this course, you'll learn about features such as File History, System Image Backups, and OneDrive and how they can be used to keep data safe and secure. Next, you'll examine how to repair a Windows 10 PC using the Advanced Startup options, enable volume shadow copies, and create a recovery drive for access to the advanced start-up options. Finally, you'll learn about the various restore features such as System Restore, that can be used to restore a system to a previously known working version.
14 videos | 1h 11m has Assessment available Badge
Describing Distributed Systems
Distributed systems involves numerous computers that work together but appear as only a single computer to the operator. In this course, you'll learn about distributed systems can provide numerous benefits including performance, availability, and autonomy. You'll also explore distributed systems in greater detail, and learn strategies and best practices for monitoring them.
13 videos | 42m has Assessment available Badge
Monitoring Distributed Systems
Principles and techniques are key in building a successful monitoring and alerting system. In this course, you'll explore the 'four golden signals' of monitoring while learning how to differentiate between symptoms and causes. You'll also learn about the guidelines for designing a monitoring system, questions to ask when creating rules for monitoring, and how to monitor for the long term.
14 videos | 30m has Assessment available Badge
Site Reliability Engineering: Scenario Planning
Scenario planning helps site reliability engineers strategically prepare for uncertainties that may disrupt or negatively affect services. In this course, you'll explore scenario planning use cases and the strategies utilized to prepare for disasters. You'll examine the functions of Disaster Recovery Testing (DiRT) and Customer Reliability Engineering teams, which help manage the impact of a disaster or disruption. Next, you'll identify disaster recovery testing events and recognize how to plan and design tests for DiRT. You'll move on to describe the production incident lifecycle and how to minimize production incidents. You'll identify unmanaged responses, how to rectify untrained responses, and the activities used to train response teams. Finally, you'll examine how to test people and how they self-organize and interact using various role-playing and test scenarios.
21 videos | 1h 11m has Assessment available Badge
SRE Simplicity: Software System Complexity
Simple systems and software are proven to be easier to develop, understand, maintain, and test. For site reliability engineers, simplicity should be an end-to-end goal and cover all aspects of the software life cycle. In this course, you'll explore the importance of simple systems and software code. You'll identify the different types of software complexity, such as structural complexity, organizational complexity, complexity of use, and theoretical complexity, and learn how to differentiate between complex and complicated code. You'll move on to recognize how to measure complexity using various metrics, such as cyclomatic complexity, the Halstead metric, and the maintainability index. Lastly, you'll examine class coupling, using NPATH to measure the complexity of a piece of code, and prioritizing the simplification of projects and resources.
18 videos | 1h 16m has Assessment available Badge
SRE Simplicity: Simple Software Systems
When creating a simple software system, it is essential to identify and remove any unwanted complexity, whether accidental or essential. By eliminating complexity, site reliability engineers can ensure the final software product is more stable and reliable. In this course, you'll learn to differentiate between agility and stability and explore the importance of stability testing. You'll learn about key metrics and methods, such as production analysis and agile process metrics, which can be used by software development teams to ensure business goals are met. Lastly, you'll learn how to avoid introducing potential defects and bugs by limiting the number of negative lines of code in a project.
15 videos | 1h 8m has Assessment available Badge
SRE Postmortums: Blameless Postmortem Culture Creation
There are various, frequently-used premortem and postmortem techniques adopted by site reliability engineers (SRE) to diagnose issues and come up with problem resolution ideas and alternative approaches. To do this effectively, SREs need to account for several factors at play, including the workplace culture and work collaboration. In this course, you'll learn how to promote a blameless culture - one without finger-pointing and animated language. You'll explore the key characteristics of good and bad postmortems, and discover the benefits of reviewing postmortems, sharing knowledge, giving feedback, and rewarding positive behavior. You'll then learn how to respond to postmortem culture implementation failure. Lastly, you'll discover how using the right postmortem templates and postmortem management tools can improve how you write postmortems and manage their associated data.
22 videos | 1h 11m has Assessment available Badge
Cloud and Containers for the SRE: Containers
Containers in cloud computing are a form of operating system virtualization that allows users or administrators to deploy and run applications without the need for virtual machines. Containers can be deployed and run virtually anywhere, and support Linux, Windows, and Mac operating systems. In this course, you'll explore the various types of container solutions, including Kubernetes, Docker, and AWS. You'll outline how containers enable a more efficient continuous integration and delivery system and why they're needed for SRE. You'll also examine container storage, security, and migration. You'll list the high-availability solutions available for containers and investigate the Containers as a Service concept. Lastly, you'll recognize how the container ecosystem is revolutionizing software delivery, and identify the role of Docker and Kubernetes in container orchestration.
20 videos | 1h 21m has Assessment available Badge
Cloud and Containers for the SRE: Implementing Container Solutions
Although containerization technologies such as Docker and Kubernetes can function independently, they can also benefit significantly from one another. Furthermore, open source automation tools such as Jenkins can be used to increase resource utilization and efficiency through pipelines. In this course, you'll explore the many benefits of pipelines, and learn how to use them to build code. You'll outline the benefits of Git and GitHub for revision control and identify the distributed version control tools that can be used to manage source code history. You'll then work with Jenkinsfiles to write pipeline-as-a-code and code to use at the build stage, after the build and test stages, and for recording failures. Next, you'll use the Jenkins Pipeline to set the environment variables and outline the key steps and factors needed in your code review. Lastly, you'll learn how to use Kubernetes to deploy applications with high availability, scalability, and resilience.
14 videos | 1h 7m has Assessment available Badge
SRE Troubleshooting Processes
Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.
18 videos | 1h 2m has Assessment available Badge
SRE Troubleshooting: Tools
Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems, correct them, and prevent them from happening again. In this course, you'll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. You'll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. You'll then work with the various built-in Windows troubleshooting tools, namely the Event Viewer, Resource Monitor, and System Information tools. Next, you'll use Google Cloud Dataflow to process logs, before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly, you'll identify how Google's Dapper is used for troubleshooting distributed systems, and the open standards tool, Prometheus, for instrumenting software and exposing metrics.
13 videos | 41m has Assessment available Badge
Site Reliability Engineer: Managing Overloads
Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.
20 videos | 1h 10m has Assessment available Badge
Site Reliability Engineer: Managing Cascading Failures
Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.
21 videos | 1h 11m has Assessment available Badge
SRE Emergency & Incident Response: Responding to Emergencies
Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.
18 videos | 1h 12m has Assessment available Badge
SRE Emergency & Incident Response: Incident Response
A well-prepared and organized approach is key to addressing and managing the aftermath of a system failure, security breach, or cyberattack. In this course, you'll explore the fundamental principles an SRE needs to be familiar with when responding to and managing incidents. You'll identify the goals, requirements, best practices, and key players involved in incident management. You'll learn how to deal with managed and unmanaged incidents and what's involved in an incident response plan. You'll identify incident response roles and responsibilities, and how to use incident metrics to manage incidents at scale. You'll outline what's involved in establishing a computer security incident response team (CSIRT), including each key team member's roles and responsibilities. Lastly, you'll examine what goes into an incident response policy.
17 videos | 1h 24m has Assessment available Badge
Distributed Reliability: SRE Critical State Management
Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.
14 videos | 1h 13m has Assessment available Badge
Distributed Reliability: SRE Distributed Periodic Scheduling
Maintaining a distributed system requires constant maintenance to ensure failures don't interfere with that system's reliability and availability. Using periodic scheduling and replication, site reliability engineers can minimize the effect failures may have on a system's performance. One way to automate this process is to utilize the system daemon, cron. In this course, you'll explore how to use cron for task scheduling, the purpose, components, and operators involved in cron jobs, and the format and characters of cron syntax. You'll outline how cron works with distributed periodic scheduling and idempotency, and in largescale deployments. Next, you'll review the PAXOS distributed consensus algorithm, best practices for its use, and how it applies to distributed replication. Lastly, you'll practice scheduling a cron job and using cron syntax generators.
14 videos | 57m has Assessment available Badge
SRE Load Balancing Techniques: Front-end Load Balancing
Today's distributed systems can consist of hundreds or even thousands of servers, and getting them to work together efficiently is a challenge. Load balancing is a multifaceted concept whose many techniques can help SREs face this challenge. In this course, you'll explore how front-end load balancing works and its associated techniques, concepts, and capabilities. You'll examine the characteristics of load balancers, their use in application delivery and security, and the use of DNS load balancers. You'll outline strategies for virtual IP load balancing, cloud load balancing, and handling overload. Finally, you'll learn how the Google Front End Service, Andromeda virtualization stack, Maglev network load balancing service, and the Envoy edge and service proxy are used for load balancing-related tasks.
14 videos | 1h 5m has Assessment available Badge
SRE Load Balancing Techniques: Data Center Load Balancing
A Site Reliability Engineer (SRE) must know how to perform load balancing within the data center, both internally and externally. In this course, you'll learn about load balancing, including various methods for balancing loads in the data center. You'll begin by examining what data center load balancing is and its importance to performance, as well as load balancing policies. You'll then learn how to deal with unhealthy tasks using flow control, and tips and tricks for optimizing load balancing. Next, you'll examine methods for limiting connection pools with subsetting, and the various load balancing components. Lastly, you'll learn how to balance loads internally and externally using HTTPS and TCP/UDP, and how to balance loads using SSL and TCP proxy load balancing.
14 videos | 1h 3m has Assessment available Badge
SRE Products at Scale: Product Launches
Site Reliability Engineers (SREs) often contribute to the launch of new products and features. These launches can occur in rapid iterations and at scale, so SREs need to be prepared to help them succeed. In this course, you'll examine launch coordination engineering to build and release reliable and fast products. You'll identify the criteria for a successful product launch and how to develop and use launch checklists to reduce failure and ensure consistency and completeness. Next, you'll outline the techniques used for reliable launches and how launch coordination engineers can help mitigate the repetition of launch mistakes. You'll investigate the production readiness review model used to identify a service's reliability needs. Lastly, you'll outline the characteristics of SRE engagement and early engagement models, as well as SRE engagement frameworks.
16 videos | 1h 17m has Assessment available Badge
Cloud and Containers for the SRE: Cloud Architectures & Solutions
When deploying a medium to a large-sized cloud solution, there are many factors to consider, such as the numerous cloud environments to choose from and the different levels of management and security they each require. In this course, you'll explore these environments in detail, with a specific focus on their application in SRE. You'll examine the features, purpose, benefits, and potential drawbacks of services such as Software as a Service (SaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Anything as a Service (XaaS). You'll then investigate private, public, hybrid, and community clouds and on and off-premises software. Moving on, you'll delve into cloud architecture-related topics, such as orchestration, automation, elasticity, and cloud bursting. Lastly, you'll study cloud payment models, resource allocation, and on-demand self-service.
24 videos | 1h has Assessment available Badge
SRE Testing Tasks: Software Reliability & Testing
Site reliability engineers (SREs) can use various testing techniques to ensure software operations are as failure-free as possible for a specified time in a specified environment. In this course, you'll explore multiple testing techniques, their purposes, and the tasks involved in their execution. You'll start by examining traditional software testing approaches, such as unit tests, integration tests, and system tests. Next, you'll investigate the components and use cases of various reliability metrics applied to SRE testing, including mean time to failure (MTTF), mean time to recover (MTTR), and mean time between failures (MTBF). Lastly, you'll outline several software testing approaches, such as stress, configuration, integration, acceptance, production, and canary testing, among others. You'll identify when, how, and by whom each of these testing types is carried out.
18 videos | 1h 22m has Assessment available Badge
SRE Testing Tasks: Testing Considerations
Site reliability engineers (SREs) need to create a healthy test and build environment to ensure that products being distributed integrate and function as expected. In this course, you'll explore the fundamentals of creating a robust SRE test and build environment, looking at the standard tools and techniques available for testing at scale. You'll examine disaster and statistical testing, and learn about working with deadlines and production configurations. You'll investigate the topic of test failures, identifying why an SRE should expect specific tests to fail and how results for test failures can help maximize knowledge about operations and end-users. Lastly, you'll look at the why and how of incorporating break glass procedures, integration testing configuration files, and fake back-end versions into your testing procedures.
14 videos | 1h 2m has Assessment available Badge
SRE Team Management: Scaling the Team
When adding a new site reliability engineer (SRE) to your team, it's important that the new member not only has the required skills but also receives the proper training. This allows the new SRE to fit into the team and get up to speed as quickly as possible. In this course, you'll learn about the best practices for onboarding a new SRE team member, including methods and tools that can be used during the onboarding process. Next, you'll explore the technical skills that an SRE requires, including the ability to reverse engineer an application to determine the root cause of a problem. Finally, you'll examine the skills and knowledge an SRE requires when on-call, including those needed to provide support and manage support issues.
14 videos | 1h 3m has Assessment available Badge
SRE Data Pipelines & Integrity: Data Pipelines
Site reliability engineers often find data processing complex as demands for faster, more reliable, and extra cost-effective results continue to evolve. In this course, you'll explore techniques and best practices for managing a data pipeline. You'll start by examining the various pipeline application models and their recommended uses. You'll then learn how to define and measure service level objectives, plan for dependency failures, and create and maintain pipeline documentation. Next, you'll outline the phases of a pipeline development lifecycle's typical release flow before investigating more challenging topics such as managing data processing pipelines, using big data with simple data pipelines, and using periodic pipeline patterns. Lastly, you'll delve into the components of Google Workflow and recognize how to work with this system.
21 videos | 1h 11m has Assessment available Badge
SRE Data Pipelines & Integrity: Pipeline Design
Site reliability engineers (SREs) encounter numerous and varied pipeline technologies and frameworks in their work. When building a pipeline, SREs need to invest considerable time during the design phase to ensure the results work best for the specific case. In this course, you'll explore the numerous features of a pipeline, such as latency, high availability, development, and operations. You'll also examine the two different pipeline mutations: idempotent and two-phase, as well as the checkpointing technique and various code patterns. You'll then investigate the five core characteristics of the pipeline maturity matrix and outline how they should be used to design the pipeline technology. You'll then identify potential failure modes, outage causes, and different prevention and response techniques. Finally, you'll outline event delivery system design and operations and how to plan for customer integration and support.
17 videos | 1h 2m has Assessment available Badge
SRE Data Pipelines & Integrity: Data Integrity
Data integrity is vital as it ensures end-user data accuracy and consistency in conjunction with an adequate level of service and availability. In this course, you'll learn how to choose a strategy for data integrity, including how to account for any potential upsides and tradeoffs. You'll explore various types of failures that lead to data loss and the existence of the many data failure modes. You'll also identify data integrity challenges. Next, you'll examine in detail the soft deletion, back up and recovery, and early detection layers of defense-in-depth, before investigating the data integrity challenges a cloud developer may encounter in high-velocity environments. Finally, you'll outline considerations for implementing out-of-band data validation and successful data recovery and identify how the primary SRE principles apply to data integrity.
16 videos | 1h 6m has Assessment available Badge
SRE Team Management: Managing Operational Loads
To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.
17 videos | 54m has Assessment available Badge
SRE Team Management: Operational Overload
Site reliability engineers (SREs) are responsible for many administrative tasks, often splitting their time between reactive ops work and special projects. To ensure teams do not become overloaded, SREs may be transferred to a team in order to prevent or help mitigate overload. In this course, you will learn how to deal with operational overload. You'll start by examining ops mode, which is an approach used to ensure services are properly maintained and optimized. You'll discover factors that contribute to team morale and stress. In addition, you will outline emergency planning strategies and best practices, as well as learn how to categorize emergencies and prepare detailed emergency plans. Next, you'll explore how knowledge sharing relates to emergency preparedness, the key to writing successful postmortems, the importance of service level objectives, and how an appropriate level of detail is required to properly explain your findings. Lastly, you'll discover the key factors and attributes of successful teams. You'll examine a team-first approach and differentiate between questioning techniques such as open/closed, funnel, probing, and leading.
14 videos | 55m has Assessment available Badge
SRE Metric Management: Software Reliability Metrics
To improve the chances of creating, monitoring, and maintaining a successful software development project, site reliability engineers and all team members must be aware of which metrics to measure. They also need a working knowledge of both automated and manual testing methods. In this course, you'll learn how to manage and select SRE metrics and how various testing methods work. You'll begin by learning what metrics need to be measured for project management, software development, and APIs - examining in detail CI/CD, cloud API, and software project metrics, to name a few. Next, you'll compare both manual and automated testing methods and the goals of each. Lastly, you'll investigate automated testing frameworks and platforms, test cases and types, and best practices and pitfalls to consider.
17 videos | 1h 23m has Assessment available Badge
SRE Metric Management: Software Reliability Monitoring and Reporting
Once SRE metrics have been identified, site reliability engineers (SREs) must know how to perform fault analysis on a system, classify defects, and monitor and report data. In this course, you'll explore the tools and best practices for carrying out these procedures. You'll begin by identifying various fault analysis methods and tools. You'll then classify software defects and bugs with a focus on severity and priority. Next, you'll investigate strategies for monitoring APIs and explore some tools used for this task. You'll then examine in detail several tools for collecting, analyzing, and reporting metric data using a customizable dashboard, including those that comprise the ELK Stack - Elasticsearch, Logstash, and Kibana. Furthermore, you'll explore the data collection tool Beats and the beneficial use cases for Elasticsearch notifications.
17 videos | 1h 17m has Assessment available Badge
Core Skills for Site Reliability Engineers: SRE Collaboration & Communication
Collaboration is key to getting the most out of your team and ensuring your clients receive their desired service. In this course, you'll learn to collaborate and communicate as an SRE effectively. You'll learn how to run traditional and virtual meetings to ensure maximum effectiveness and productivity, whether it's with customers, internal or external team members, or distributed teams. You'll examine how to plan, carry out, and post-analyze meetings using best practices and sufficient preparation, tailoring these methods to suit the participants and the end-goal. You'll delve into the unique characteristics of different meeting types, such as those for problem-solving or innovation. You'll explore the advantages and challenges of SRE pair programming. You'll then end the course by investigating some helpful collaboration and communication tools.
14 videos | 1h 5m has Assessment available Badge
SRE Engagement: Production Readiness Review
Production Readiness Review (PRR), the standard first step of SRE engagement, and its phases are used to identify a service's reliability needs. The concept of ""early engagement"" is then used to evolve the Simple PRR model. In this course, you'll investigate SRE engagement, early engagement, and Production Readiness Review. You'll start by delving into each phase of the SRE Production Readiness Review (PRR) model, namely, engagement, analysis, refactoring, training, onboarding, and continuous improvement. Next, you'll learn how early engagement can be used to evolve the Simple PRR model. You'll then examine how SRE platforms and frameworks can provide structural solutions. Finally, you'll learn how to use the SRE engagement model to manage software projects, comparing it to the traditional System Development Life Cycle (SDLC) model.
14 videos | 1h has Assessment available Badge
SRE Engagement: The SRE Engagement Model
The SRE engagement model and SRE service lifecycle have note-worthy similarities and differences to the traditional software development life cycle. In this course, you'll explore these differences and investigate the SRE engagement model's components and how to work with it in various circumstances. You'll learn the steps for setting up and building SRE service relationships and establishing a roadmap for sprints and communication. You'll examine how to measure the impact of SRE engagement, set ground rules for SRE teams, and sustain effective relationships with other SREs and developers. Next, you'll study the steps to take for scaling SRE to larger environments and for ending an engagement. Lastly, you'll review case studies to see the results of how others have used the SRE engagement model used in real-life.
14 videos | 1h 3m has Assessment available Badge
Final Exam: Chaos Engineer
Final Exam: Chaos Engineer will test your knowledge and application of the topics presented throughout the Chaos Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.
1 video | 32s has Assessment available Badge
Final Exam: Network Admin
Final Exam: Network Admin will test your knowledge and application of the topics presented throughout the Network Admin track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.
1 video | 32s has Assessment available Badge


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.



Chaos Engineering
By reading this book you will learn how Chaos Engineering enables your organization to navigate complexity.
book Duration 7h 17m book Authors By Casey Rosenthal


Likes 286 Likes 286  
Likes 129 Likes 129  
Likes 399 Likes 399