SRE Emergency & Incident Response: Responding to Emergencies

SRE    |    Intermediate
  • 18 videos | 1h 12m 46s
  • Includes Assessment
  • Earns a Badge
Rating 4.7 of 288 users Rating 4.7 of 288 users (288)
Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.

WHAT YOU WILL LEARN

  • Discover the key concepts covered in this course
    Outline the fundamental emergency response principles sres need to be familiar with and recognize the critical steps to take when a system breaks
    Recognize the benefits of performing test-induced emergencies and outline what this involves
    Name the causes and outcomes of change-induced emergencies and outline how to respond to these emergencies
    Define what is meant by a process-induced emergency, describe the effects of them, and outline how to respond to them
    Describe why it is vital to keep a history of outages and mistakes and outline best practices when doing so
    Recognize the importance of asking important, relevant, and challenging questions
    Define what is meant by proactive testing, compare it to reactive testing, recognize the importance of encouraging proactive testing, and name best practices when carrying out this type of testing
    Define what is meant by business continuity and describe why this type of planning matters
  • Outline the six steps involved in developing a business continuity plan
    Outline methods to test a business continuity plan, recognize the importance of testing this type of plan, and describe some tips when testing
    Recognize the importance of ongoing efforts to review and improve a business continuity plan and outline how to go about doing it
    Recognize the importance of having 'top-level' support for business plans and promoting user awareness, and outline how to achieve these goals
    Define what is meant by a business impact analysis, outline how to conduct one and its typical structure, and name the possible effects on business operations
    Recognize the importance of developing an it disaster recovery plan, list the goals of this type of plan, and describe what to consider when developing one
    Outline key steps to creating a working disaster recovery plan
    Name some types of it recovery strategies and recognize the importance of recovery strategies developed for it systems, applications, and data
    Summarize the key concepts covered in this course

IN THIS COURSE

  • 1m 47s
  • 5m 39s
    Site reliability engineers (SREs) need to be familiar with the fundamental emergency response principles in order to respond effectively to system failures. The video discusses post-mortem philosophy, triggers for a post-mortem, and steps an SRE should take when a system breaks. FREE ACCESS
  • Locked
    3.  Test-induced Emergencies
    3m 31s
    In this video, you'll learn about three stages of Test-induced Emergencies. You'll learn how to induce an emergency and how to respond to it. You'll also learn what you can expect as outcomes from test-induced emergencies. Whether you induce an emergency in some staging environment or in production, it's a form of testing that affects others. FREE ACCESS
  • Locked
    4.  Change-induced Emergencies
    4m 24s
    In this video, you'll learn more about Change-induced Emergencies. These are emergencies that are a direct result of internal change to the system, such as configuration pushes and code pushes. You'll learn the various causes of change-induced emergencies, identifying the various changes that go on in a complex environment and how they're managed. You'll also discover how to respond to an emergency in an efficient manner and look at the possible outcomes of change-induced emergencies. FREE ACCESS
  • Locked
    5.  Process-induced Emergencies
    3m 16s
    In this video, you'll learn about Process-induced Emergencies. You'll examine what constitutes a process-induced emergency, discuss responding appropriately to a process-induced emergency, and look at appropriate outcomes of a process-induced emergency. On the screen is a diagram of a large organization with many services clustered within it. The services are grouped by category and color coded. The diagram includes a large clock labeled 1:00 PM. FREE ACCESS
  • Locked
    6.  Documenting Incidents
    3m 29s
    In this video, you'll learn why it is vital to keep a history of outages and mistakes and outline best practices when doing so. You'll learn that as an SRE it's your responsibility to learn from incidents so you can avoid them in the future. This means documenting incidents is crucial. FREE ACCESS
  • Locked
    7.  Open-ended Questions
    4m 31s
    This video is about how to ask questions effectively in order to gain insights and improve systems. Open-ended questions are better than closed-ended questions because they often invite more thought and are less predictable. Tough questions are especially useful in the technical arena because they force people to think deeply about their understanding of the subject. FREE ACCESS
  • Locked
    8.  Proactive Testing
    5m 10s
    In this video, you will learn about Proactive Testing. You will define what is meant by proactive testing, compare it to reactive testing, recognize the importance of encouraging proactive testing, and name best practices when carrying out this type of testing. FREE ACCESS
  • Locked
    9.  Business Continuity
    4m 8s
    In this video, you'll learn more about what a business continuity plan is, why it matters to your business, and the benefits of business continuity planning. You'll also learn that a Business Continuity plan outlines procedures and instructions to follow should such a disaster occur. This typically includes a business impact analysis that outlines the cost of such a disaster, broken down by different aspects of your business. FREE ACCESS
  • Locked
    10.  Developing a Business Continuity Plan
    4m 55s
    In this video, you'll learn more about Developing a Business Continuity Plan. When disaster strikes, you can maintain your business functionality and weather the storm. You'll first discuss the process of developing your business continuity plan. What do you have to take into consideration? What aspects of your business should you focus on? Then, you'll learn about common business continuity threats and how to identify acceptable downtime for each critical function. FREE ACCESS
  • Locked
    11.  Testing a Business Continuity Plan
    3m 53s
    In this video, you'll learn more about how to test a Business Continuity Plan. You'll learn there are several methods you can take to test a business continuity plan. This includes a table-top exercise and a structured walk-through. The video outlines some of the benefits of testing your Business Continuity Plan. FREE ACCESS
  • Locked
    12.  Improving a Business Continuity Plan
    3m 49s
    This video discusses the importance of reviewing and improving a business continuity plan on a regular basis. It covers things to consider before the review, the key aspects of the review, and how to improve the plan following a review. FREE ACCESS
  • Locked
    13.  Business Continuity Plan Awareness
    3m
    In this video, you'll learn about the importance of having 'top-level' support for business plans and promoting user awareness. You'll also outline how to achieve these goals. You'll learn that an emergency strikes, most areas of a business are vulnerable to some sort of emergency incident. This means that it's essential that all business areas are aware of the company's business continuity plan and their role in that plan. FREE ACCESS
  • Locked
    14.  Business Impact Analysis (BIA)
    4m 41s
    In this video, you'll learn about the importance of a Business Impact Analysis, or BIA. You'll learn that a BIA is used to predict the outcomes on your business of certain business disruptions. It does this by gathering and analyzing data relevant to your business. FREE ACCESS
  • Locked
    15.  Disaster Recovery Planning
    7m 2s
    In this video, you'll learn more about the IT Disaster Recovery Plan. You'll learn what a disaster recovery plan is, what the structure of a typical IT disaster recovery plan looks like, what some of the benefits of an IT disaster recovery plan are, and how to develop an effective data backup plan.You'll also learn about some of the options you have in data backup. So, watch this video to find out more about this topic. FREE ACCESS
  • Locked
    16.  Creating a Disaster Recovery Plan
    4m 7s
    During this video, you will learn more about the details of creating a Disaster Recovery Plan. You will explore how complex IT systems can involve a lot of assets in the form of software and hardware, critical servers, data and cloud services. The first step to a Disaster Recovery Plan is to identify all of these assets, their locations, and how they interact with one another. You will also learn how to identify the context of these assets, such as how they function and how they relate to one another. FREE ACCESS
  • Locked
    17.  IT Recovery Strategies
    4m 5s
    When it comes to an IT recovery plan, there are different strategies to consider regarding how to implement your plan. There are also some common ways that IT fails, which any strategy you choose should take into consideration. In this video, we're going to look at IT Recovery Strategies. We'll look at an internal recovery strategy where a business manages their own internal systems and data backups and access to them. FREE ACCESS
  • Locked
    18.  Course Summary
    1m 19s

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

YOU MIGHT ALSO LIKE

Rating 5.0 of 6 users Rating 5.0 of 6 users (6)
Rating 4.5 of 1358 users Rating 4.5 of 1358 users (1358)
Rating 4.4 of 64 users Rating 4.4 of 64 users (64)

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Rating 4.6 of 62 users Rating 4.6 of 62 users (62)
Rating 4.7 of 316 users Rating 4.7 of 316 users (316)
Rating 4.7 of 291 users Rating 4.7 of 291 users (291)