Incident Management SRE Tips
Asset Management
You can’t virtualize everything yet. Software services still require physical devices and hardware to function. And organizations need to monitor, manage, and constantly update these devices to keep their services running smoothly. Asset management is also called IT asset management or ITAM.
Knowledge management, policies and procedures
The goal of knowledge management is to reduce redundancy in terms of collecting, analyzing and sharing information within an organization. This helps improve efficiency and ensures that information is consistent, up-to-date and accessible.
Incident Management Life Cycle: Process and Stages
An organization’s response to an incident, whether we’re talking about downtime, security breaches or cyberattacks, or even long delays and repeated errors, is critical to continued business success and customer or end user confidence. SPEs must manage complex distributed systems. While the benefits of these systems are that they are more reliable, scalable, and fault tolerant, this also makes them extremely complex, which can lead to longer resolution times as problems are harder to detect and pinpoint. The best SRE incident management teams follow a rigorous incident management and resolution process. While the actual steps and processes may vary by organization, most of them follow the same basic path. Let’s take a look at the SRE Incident Management process and steps.
Incident identification
You can’t fix problems you don’t know about. Incident identification begins with some form of monitoring or alerting mechanism. We talked about monitoring distributed systems in another article and how it applies to SRE commands. Knowing when and where an error, downtime, or application delay occurs is a critical factor in limiting the impact on users and customers. However, in some cases an incident becomes known through a support ticket, a phone call, or even social media, which is never good news when issues are posted publicly for everyone to see.
Incident logging
Whatever the detection method, once an incident has been identified, it must be logged. Incident logging serves several purposes. It ensures that there is a formal protocol that has been submitted and for subsequent review of incident trends. If the same or similar incident occurs repeatedly, this may indicate a more complex problem that needs to be addressed. When an incident is logged, relevant information is also included, such as the timestamp, a description of the incident, and who discovered the problem. The more detailed information, the better.
Incident categorization
What follows is a classification of the incident based on factors such as severity, urgency, or functional area of impact. Similar to filing an incident, the more information that is provided can help later in identifying the correct team or person to assign a response to an incident.
Incident prioritization
Depending on how the incident has been classified, the next step is to set the priority level. Again, some of these steps happen at the same time, so in some cases they can be done at the same time. Organizations typically use a simple low, medium, or high scale, however, some incidents may automatically fall into certain categories depending on what is affected. For example, if an incident is related to an outage, this is automatically prioritized.
Incident response, resolution and closure
The last step is to finally respond and resolve the incident in order to end it. This last step is more like an art than a science. There is no simple button here. It may take several cycles and tries to confirm that the incident is finally resolved. Each attempt may bring more information and additional theories as to why the incident might be happening. It can also lead to the identification of additional opportunities where weaknesses may be present. Once the incident has been addressed, it’s time to close the request and respond to the original user who reported the incident.