Anyone who works with technology knows that, eventually, something will go wrong—even in today’s complex, distributed, and highly available IT systems. In the battle to maintain uninterrupted service, your DevOps teams require updated methods for detecting and solving problems fast. In this report, author Jason Hand explores ways to conduct effective post-incident reviews—not only for responding to failure quickly, but also to help you continuously learn about and improve your system.
Traditional techniques for conducting post-incident analyses don’t work well in modern IT organizations, mainly because the command-and- control approach offers team members no incentive to explore the system and detect flaws when they occur. This guide presents an up-to- date approach to post-incident reviews that embraces the human element and adds more eyes for discovering system flaws and potential improvements.
- Examine why traditional post-incident approaches, such as Root Cause Analysis, do little to provide greater availability and reliability of IT services
- Understand the role that team members can play in discovering system flaws
- Learn why it’s often difficult to determine the cause and effect of outages in complex systems
- Understand why sustained success depends on a core value of continuous improvement
- Get a case study that examines the unique phases of an outage incident
- Explore post-incident analysis in depth by moving away from cause and deeper in to the phases of the incident lifecycle