The idea of moving on a path toward operational excellence has been on my mind lately.
When we encounter failures in production, these should be treated as significant events. A failure that triggered a page to a person on call is even more critical. The way an organization handles these incidents represents their operational response. I have been thinking about the different layers that make up this response. I see this as composed of:
- Root cause analysis
The organization must achieve the right balance of these elements to respond appropriately.
Documenting the details of an incident is the foundation for addressing the issue. If the appropriate detail is lacking, the operational response is out of balance. Details are required to determine root cause and to prioritize any incident.
Communication begins with the first notice of an incident having occurred. Hopefully this is due to a page or automated response for critical issues. There should be a record that an incident occurred along with all relevant details: who, what, when, where, why, and how.
The fact that an incident has occurred should be visible and communicated to interested parties.
Following root cause analysis, any changes that need to be made to address an issue should be clearly documented and associated with the incident documentation.
Every incident must have a clear owner. The owner is responsible for triage, data collection, and documenting what occurred. Ownership may transfer once data collection makes root cause discovery possible.
The organization must hold the owner accountable for navigating an incident through its lifecycle. Incidents must not fall through the cracks and fail to be investigated.
Root Cause Analysis
Once sufficient data has been collected and attached to an incident, the process of root cause analysis can begin. The most important result of this effort is to provide demonstrable proof of the cause of an issue.
Too often I have seen incidents where a user saw Event X was occurring, they took Action Y, and this solved the incident. Root Cause Analysis requires that we provide evidence that Event X was actually occurring and that taking Action Y directly led to the resolution of the incident. This ties back to the Communication point where we document these findings for future reference.
Operational incidents take priority over other work until the root cause can be identified. This is why ownership is critical so it is clear who owns this effort.
Recurring operational incidents should be prioritized over other work. Recurring issues are a scourge because they lead to alert fatigue and reduce confidence in systems.
Moving Toward Operational Excellence
At Backstop, we have a weekly meeting where we review production incidents that have occurred and discuss upcoming maintenance items to be performed. One of the ways we can improve is in our communication. The level of detail that gets recorded as a result of an incident is highly variable depending upon the person that gets paged, the type of incident, what time they were paged, how many times they have been paged that day, etc. Sometimes incidents occur that do not get logged.
Action items from the meeting get assigned, but there is no direct followup to ensure items have been addressed appropriately. Often times, the list of outstanding action items is not reviewed at all.
Ways to Improve
I authored an incident template that lists the details that should be recorded when an incident has occurred. I even wrote a small script to create this documentation based upon the template!
I have also started asking for process changes around the weekly discussion to ensure action items are not lost.
I try to be a voice pushing for root cause analysis when we do not know why an incident has occurred. I ask for details that may be missing on incidents to underscore how important it is to understand why an incident has occurred.