SRE Troubleshooting Processes

SRE    |    Intermediate
  • 18 Videos | 1h 10m 34s
  • Includes Assessment
  • Earns a Badge
Likes 159 Likes 159
Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.

WHAT YOU WILL LEARN

  • discover the key concepts covered in this course
    describe how engineers think differently to "novices" when it comes to troubleshooting
    outline best practices and approaches to troubleshooting and how to keep those skills sharp
    outline an idealized troubleshooting model (e.g., report, triage, examine, diagnose, test/treat, and cure.)
    list potential pitfalls to avoid, such as looking for symptoms that are not relevant
    outline how to manage operational loads
    recognize the importance of an adequate initial problem report
    recognize the importance of triaging problems from the onset
    recognize the importance of examining each component of a system to understand whether it is functioning properly
  • identify the steps and approaches used to diagnose issues
    describe methods for testing and treating possible causes to identify actual problems
    recognize how to simplify and reduce troubleshooting using techniques such as dividing and conquering
    describe the "what, why, where" technique and how it can be used to diagnose a malfunctioning system
    interpret how determining who last touched a system can be helpful when identifying what is going on with a system
    define what is meant by "negative results"
    recognize that systems are complex and that often you can only identify probable cause factors to document what went wrong with a system
    outline steps to make troubleshooting easier
    summarize the key concepts covered in this course

IN THIS COURSE

  • Playable
    1. 
    Course Overview
    1m 34s
    UP NEXT
  • Playable
    2. 
    The Troubleshooting Mindset
    1m 23s
  • Locked
    3. 
    Troubleshooting Skills
    2m 15s
  • Locked
    4. 
    Troubleshooting Models
    3m 16s
  • Locked
    5. 
    Common Troubleshooting Difficulties
    4m 31s
  • Locked
    6. 
    Managing Operational Load
    5m 1s
  • Locked
    7. 
    Troubleshooting and Issue Reports
    4m 11s
  • Locked
    8. 
    Troubleshooting and Triaging
    2m 21s
  • Locked
    9. 
    Troubleshooting and Examination
    5m 14s
  • Locked
    10. 
    Troubleshooting and Diagnosis
    2m 15s
  • Locked
    11. 
    Troubleshooting and Testing
    6m 5s
  • Locked
    12. 
    Troubleshooting Simplification and Reduction
    2m 43s
  • Locked
    13. 
    Troubleshooting: Key Questions
    3m 9s
  • Locked
    14. 
    Troubleshooting and Recent Change Evaluation
    2m 9s
  • Locked
    15. 
    Troubleshooting and Negative Results
    6m 26s
  • Locked
    16. 
    Troubleshooting and Probable Cause Factors
    4m 21s
  • Locked
    17. 
    Effective Troubleshooting
    4m 27s
  • Locked
    18. 
    Course Summary
    1m 13s

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion of this course, which can be shared on any social network or business platform

Digital badges are yours to keep, forever.

YOU MIGHT ALSO LIKE

Likes 399 Likes 399  
Likes 6 Likes 6