How We Break Production
At Harvest, we like shipping new features, and we take great care to introduce changes smoothly. We test our code, QA new features and sizable fixes, and make sure to review our code line by line as part of our daily code reviews.
Harvest is eight years old. It has grown to serve many customers through many features, accumulating substantial legacy code along the way that depends on many moving parts. It’s also maintained by actual people, which means that no matter how hard we try not to break things, due to the nature of the product and our humanness, we’re bound to break something from time to time.
However, we don’t just resign ourselves to this being a fact of life. We are fully aware that any hiccups in our software or our infrastructure affect the teams and businesses that use Harvest daily. That’s why after the storm has passed and our blood pressure has dropped, we take some time to reflect. We think about what went wrong, what can we learn from it, how to do better in the future, and perhaps most importantly, we share it with the rest of the team.
Over time, we have developed the custom of writing a post on our internal blog with these reflections, in a section that we call “How I Broke Production”. These posts usually share the same structure:
-
A narrative of what happened, how we reacted, and how we fixed it. When a problem arises, we usually collaborate through HipChat to notify our customer support experts so they can handle any related support tickets, gather a team to investigate the cause of the problem, and discuss how can we fix it. This recorded, timestamped, textual history makes it easy to build a timeline of events and review how we reacted as a group.
-
The root cause. While extinguishing a fire, our first goal is to bring our application back to normal and minimize the impact to our customers. Once that’s done, we dedicate some quiet time to dig up the causes: Was it a bug the tests didn’t catch? Some interaction with third-party software or APIs we didn’t think of? A system malfunction? We are usually able to track an exception, a log entry, an alert, or some piece of hard evidence that helps us make sense of what happened. We usually apply the 5 Whys technique.
-
The impact. The first of our core values is listening to our customers and this is especially true when something has gone wrong. Our customer support experts strive to give quick and honest responses to the customers affected by the problem, and follow up with them after the issue has been solved. We keep track of those tickets in Zendesk, and cross-reference them with any issues related to the incident or its investigation. It reminds us that we are here because we have customers — when we break production, we affect the workflow of our customers and the schedule of other team members when they help us put out the fire.
-
How we can do better in the future. The downside of being human is making mistakes, but the upside is having an unlimited ability to learn and adapt. In these posts, we identify how we can improve our process — individually and collectively. Maybe it’s by increasing the test coverage of some part of the product, changing our process, or more deeply thinking through the potential risks of a change in the future.
-
Encouragement and group learning. The goal of these posts isn’t public shaming — they are actually a great way to acknowledge our guilt and let it go. Comments from other team members are always encouraging: you can feel how the people involved grew as software developers by thinking about and fixing the incident. And the rest of the team gets to learn about tricky parts of our product, unexpected situations, and techniques that we all can use in our daily work or when we have to face a similar situation.
This process helps us grow as a team and lets us move on to building useful features for our customers. It doesn’t guarantee we won’t make mistakes, but it removes the drama and encourages learning from these experiences so we hopefully won’t repeat them. As an accomplished author of my own “How I Broke Production” post, I can attest that when we make mistakes, they are at least new mistakes!
Thanks to Kerri Miller for inspiring this process!