March 2018 – Brian Riehman

I had a small achievement the other day when I made my first effort to contribute to an open source project. We use spotbugs in several of our tools at work to perform static analysis and inform us of potential issues. We had an issue with the tool attempting to process resource files that were included on our CLASSPATH. When spotbugs encountered non-class files, it printed a long stacktrace that did not cause the tool to fail but it certainly appeared so from the output.

I had done some Google searching and discovered an issue had been logged for this problem already. On digging into the code, I felt that I could actually fix the issue myself. A short time later, I had a checkout of spotbugs that I installed locally and verified my fix squelched the verbose output. Great, but now what? We try and give back, that’s what!

I opened a pull request, worked with the maintainers to come up with the best solution, and my change was accepted shortly after. How exciting! It’s just like the feeling of fixing an issue for a user that makes their day better. I was happy to be able to solve my own problem and give back to a tool that we use.

Encouraged by this interaction, I am trying to do the same with FitNesse. An incident logged several years ago has resurfaced and causes an error page when saving in certain circumstances. Instead of logging the issue and leaving it at that, I submitted a pull request that details the original issue and when it was reintroduced. I hope to be able to help out here as well!

I hope I have more opportunities to try and contribute to open source projects that we use. It feels good being part of a community!

The idea of moving on a path toward operational excellence has been on my mind lately.

When we encounter failures in production, these should be treated as significant events. A failure that triggered a page to a person on call is even more critical. The way an organization handles these incidents represents their operational response. I have been thinking about the different layers that make up this response. I see this as composed of:

Communication
Accountability
Root cause analysis
Prioritization

The organization must achieve the right balance of these elements to respond appropriately.

Communication

Documenting the details of an incident is the foundation for addressing the issue. If the appropriate detail is lacking, the operational response is out of balance. Details are required to determine root cause and to prioritize any incident.

Communication begins with the first notice of an incident having occurred. Hopefully this is due to a page or automated response for critical issues. There should be a record that an incident occurred along with all relevant details: who, what, when, where, why, and how.

The fact that an incident has occurred should be visible and communicated to interested parties.

Following root cause analysis, any changes that need to be made to address an issue should be clearly documented and associated with the incident documentation.

Accountability

Every incident must have a clear owner. The owner is responsible for triage, data collection, and documenting what occurred. Ownership may transfer once data collection makes root cause discovery possible.

The organization must hold the owner accountable for navigating an incident through its lifecycle. Incidents must not fall through the cracks and fail to be investigated.

Root Cause Analysis

Once sufficient data has been collected and attached to an incident, the process of root cause analysis can begin. The most important result of this effort is to provide demonstrable proof of the cause of an issue.

Too often I have seen incidents where a user saw Event X was occurring, they took Action Y, and this solved the incident. Root Cause Analysis requires that we provide evidence that Event X was actually occurring and that taking Action Y directly led to the resolution of the incident. This ties back to the Communication point where we document these findings for future reference.

Prioritization

Operational incidents take priority over other work until the root cause can be identified. This is why ownership is critical so it is clear who owns this effort.

Recurring operational incidents should be prioritized over other work. Recurring issues are a scourge because they lead to alert fatigue and reduce confidence in systems.

Moving Toward Operational Excellence

Current Status

At Backstop, we have a weekly meeting where we review production incidents that have occurred and discuss upcoming maintenance items to be performed. One of the ways we can improve is in our communication. The level of detail that gets recorded as a result of an incident is highly variable depending upon the person that gets paged, the type of incident, what time they were paged, how many times they have been paged that day, etc. Sometimes incidents occur that do not get logged.

Action items from the meeting get assigned, but there is no direct followup to ensure items have been addressed appropriately. Often times, the list of outstanding action items is not reviewed at all.

Ways to Improve

I authored an incident template that lists the details that should be recorded when an incident has occurred. I even wrote a small script to create this documentation based upon the template!

I have also started asking for process changes around the weekly discussion to ensure action items are not lost.

I try to be a voice pushing for root cause analysis when we do not know why an incident has occurred. I ask for details that may be missing on incidents to underscore how important it is to understand why an incident has occurred.

Month: March 2018

Open source contributions

Operational Excellence