SRE Troubleshooting Processes

SRE    |    Intermediate
  • 18 videos | 1h 2m 34s
  • Includes Assessment
  • Earns a Badge
Likes 159 Likes 159
Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.


  • discover the key concepts covered in this course
    describe how engineers think differently to "novices" when it comes to troubleshooting
    outline best practices and approaches to troubleshooting and how to keep those skills sharp
    outline an idealized troubleshooting model (e.g., report, triage, examine, diagnose, test/treat, and cure.)
    list potential pitfalls to avoid, such as looking for symptoms that are not relevant
    outline how to manage operational loads
    recognize the importance of an adequate initial problem report
    recognize the importance of triaging problems from the onset
    recognize the importance of examining each component of a system to understand whether it is functioning properly
  • identify the steps and approaches used to diagnose issues
    describe methods for testing and treating possible causes to identify actual problems
    recognize how to simplify and reduce troubleshooting using techniques such as dividing and conquering
    describe the "what, why, where" technique and how it can be used to diagnose a malfunctioning system
    interpret how determining who last touched a system can be helpful when identifying what is going on with a system
    define what is meant by "negative results"
    recognize that systems are complex and that often you can only identify probable cause factors to document what went wrong with a system
    outline steps to make troubleshooting easier
    summarize the key concepts covered in this course


  • 1m 34s
  • 1m 23s
  • Locked
    3.  Troubleshooting Skills
    2m 15s
  • Locked
    4.  Troubleshooting Models
    3m 16s
  • Locked
    5.  Common Troubleshooting Difficulties
    4m 31s
  • Locked
    6.  Managing Operational Load
    5m 1s
  • Locked
    7.  Troubleshooting and Issue Reports
    4m 11s
  • Locked
    8.  Troubleshooting and Triaging
    2m 21s
  • Locked
    9.  Troubleshooting and Examination
    5m 14s
    In this video, you'll learn more about the importance of examining each component of a system to understand whether it is functioning properly. You'll learn that when you're troubleshooting a problem, there are some system metrics you might want to consult. These can be anything from memory and CPU consumption to system logs or even audit and change logs. Having system metrics in your tool belt helps you find correlations in the behavior you're seeing. FREE ACCESS
  • Locked
    10.  Troubleshooting and Diagnosis
    2m 15s
  • Locked
    11.  Troubleshooting and Testing
    6m 5s
    In this video, you'll learn how to troubleshoot problems. In order to do this, you'll need to come up with a list of possible causes. You'll learn that using experimentation, you can rule out various causes. This means you'll run tests and treat the system, and then see if it resolves the problem or at least affects it in some way. FREE ACCESS
  • Locked
    12.  Troubleshooting Simplification and Reduction
    2m 43s
    In this video, you'll learn how to recognize how to simplify and reduce troubleshooting using techniques such as dividing and conquering. You'll discover how to simplify the problem and identify the connections between components. This allows you to divide and conquer, which is a very useful general-purpose technique for troubleshooting and finding the solution to a problem. FREE ACCESS
  • Locked
    13.  Troubleshooting: Key Questions
    3m 9s
    In this video, you'll learn more about the six essential questions to ask when doing any investigation. These include who, what, when, where, why, and how. You'll discover that these questions can lead you to the solution as well as future prevention. FREE ACCESS
  • Locked
    14.  Troubleshooting and Recent Change Evaluation
    2m 9s
  • Locked
    15.  Troubleshooting and Negative Results
    6m 26s
  • Locked
    16.  Troubleshooting and Probable Cause Factors
    4m 21s
  • Locked
    17.  Effective Troubleshooting
    4m 27s
    In this video, you'll learn more about how to make troubleshooting easier. You'll discover there are many ways to do this, such as building observability into the system. Logging, insights, status pages, and other outputs can help someone who's performing active troubleshooting gauge the health of any specific component. You'll also learn about consistency and information availability. All components should have well-designed interfaces that are observable. FREE ACCESS
  • Locked
    18.  Course Summary
    1m 13s
    In this video, you'll summarize what you've learned in the course. You've discovered that a site reliability engineer must be able to perform effective and efficient troubleshooting of malfunctioning systems. FREE ACCESS


Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.