SRE Incident Management

In the world of the Site Reliability Engineer (SRE)

failure is not only an option, but an expected one. Systems, web applications, servers, devices, etc. are all prone to performance issues and unexpected shutdowns at some point. This is an inevitable fact. These unexpected failures can result in huge losses in revenue, customer confidence and, depending on the industry, possibly fines. Fortunately, SRE incident management is one of the main practices used to limit disruptions caused by unexpected issues. In another article, we talked about designing chaos and how SRE teams proactively look for and test for failures to prevent the worst. However, as we all know, problems can slip through the cracks. The goal is to prevent these incidents from becoming large-scale cascading outages. SPE and DevOps teams can use these incidents to build and improve their systems and services more effectively.

What is an incident?

Before we delve into this topic, we first need to discuss what an incident is. Where is the line drawn between something that requires immediate action and something that can be explored later? If every issue were classified as urgent, no one would get any solution. In an IT (information technology) context, an incident is simply an event or problem that disrupts normal operation or quality of service. This did not crash, but if left unchecked, it can have a greater impact on your services and operations. And they usually happen at 2:00 am when you are blissfully asleep and awakened by the sound of your phone being turned off. We are joking, of course, but you know that something is bad if it happens early in the morning. Nothing good at 2:00 .m., especially when it comes to the IT industry.

What is Incident Management?

Now that we’ve talked about what an incident is, incident management is the process by which teams resolve these events and return systems and services back to normal. It should also be noted that incident management is only one element of a broader concept known as IT service management or ITSM. ITSM defines how teams design, build and deliver their services. This is much more than just IT support. ITSM is the policies, processes, and framework that underpins the IT service life cycle. ITSM is one of the practices of the Information Technology Infrastructure Library, or ITIL.

ITIL provides the framework and guidelines for building ITSM solutions. You may already be familiar with other frameworks such as Business Process Framework (eTOM), Control Objectives for Information and Related Technologies (COBIT), FitSM, ISO/IEC 20000, and Microsoft Operations Framework (MOF).

IT Service Management Framework (ITSM)

If we step back and just focus a little on the elements in the ITSM framework, there are six other components that make up the ITSM “wheel” along with incident management. We won’t go into too much detail about this for now, but it’s important to understand how all of these pieces fit together with incident management.

Service catalog

An IT service catalog is typically a database or resource that an organization creates to provide users with information about their operational services and offerings. These service directories provide useful information about current and planned services, as well as pricing, the procurement process, points of contact, and other outcomes.

Support

The service desk can be thought of as the point of contact between the service provider and users such as internal employees, stakeholders, or customers. This is the central “hub” where users go to get help and service. As ITIL defines, a help desk can take the form of incident resolution or service requests, but in either case, the primary purpose of a help desk is to provide fast and efficient service.

Problem Management

When we talk about incident management, the SRE team can quickly resolve the incident, but the underlying problem may still exist and persist for some time. Problem management is the process by which the root causes of incidents are continually addressed, improving long-term performance and future service deployments.

Change management

There is always an element of risk in any type of change, whether it’s new service rollouts or personal changes. Change management is the process of determining how changes will affect the deployment of a service and/or considering their impact on the business itself. Change management is also sometimes grouped with release management.