Tech Time

Incantations, poetry, and intellectual detritus flowing from great minds. By the makers of Harvest.

Improving Report Performance

Last year, I made a promise to share the gritty details of how we improved our reporting performance dramatically, anywhere from a factor of 4x to 10x. I spent over a year on this project and I am very pleased with the results. I want to share some of our wins with you all in case they help anyone else.

First, you have to know a little about our data model for these gains to make sense. Our customers have users in their account track time for tasks on projects they’re assigned to.

Time Entry User Task Project n 1 1 1 n n

We use Percona MySQL in production. We have verified that all of our indexes are appropriate, and our queries are using the right indexes. So we knew that in order to get better performance out of these queries we needed to make some architectural changes. I spent around a month experimenting with a few different approaches on hardware similar to production using the production dataset. After trial and error, and looking at the resulting numbers, I finally had a plan.

Improvement 1: Organize time entries table by project

My first change may sound strange to Rails developers, but made a lot of sense at the MySQL layer. I reorganized the data on disk by changing the primary key of our time entries table. A table that holds over 500 million rows. This is not a MySQL-specific concept. If you’re interested you can also search for “clustered index”, but for MySQL the primary key is the clustered index.

MySQL’s docs explain it better than I can:

Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.

Making this change sped up our report queries instantly by about 50% across the board (queries that took 12 seconds now took around 6 seconds).

MySQL stores data in pages, or clumps of rows/records. By default, for typical tables with an auto-incremented id, that means related rows are scattered around the disk based on the order they were inserted. Most of our queries retrieve and report on all time entries for a given project, so when I changed the primary key from (id) to an existing index which is (project_id, spent_at, id), it meant that on disk all time entries for a given project were sitting next to each other. Reducing a lot of random access.

To help illustrate the point. Let’s assume I have a page of rows organized by the id column. It might look something like this.

id project_id notes
1000 50 Reviewed Web Design
1001 203 Spoke with client
1002 50 Worked on new logo
1003 203 Wrote down meeting notes

As you can see, two entirely separate projects are intertwined together based on the id order. However once I change the primary key to be based on the project id, we get something like this:

id project_id notes
1000 50 Reviewed Web Design
1002 50 Worked on new logo
2000 50 Helped web designer with CSS
5000 50 Final call with client

All time entries for that project are now grouped together within the pages on disk.

Now I was really skeptical that this would have any performance boost because our monster database servers are using high performance enterprise SSDs in a RAID 10. On top of that, we have an extremely large InnoDB Buffer Pool with a hit rate of around 99%. So most data is served from memory, and even if it does have to hit the disk, it’s still much faster than than a traditional spinning platter. But our metrics proved me wrong and we were able to take an easy win right away.

To keep Rails happy, we still have the auto-incremented id, but I added a manual unique index on it and I added self.primary_key = :id to the top of our time entry model. But for all intents and purposes this change is transparent to Rails and the rest of our code.

This technique is not a free lunch. This will create more data pages overall for the table because MySQL will want to keep a portion of the page empty as defined by the MERGE_THRESHOLD. We are effectively trading disk space for much better read performance, and in our scenario that was perfectly acceptable. The disk space required for the time entries table ballooned after this change (and the next one I’ll talk about). If you are interested in more information about MySQL’s page mechanics, take a look at this great post on Percona’s blog.

Improvement 2: Denormalize important reporting data into time entries

The next step to increasing performance was to denormalize the important reporting data into the time entries table. This is not a new practice at all, but it did have some shocking results for our data set.

One thing to know is that Harvest also allows a lot of flexibility in the way a project can be configured. Those settings will determine what time is billable and by how much. Meaning that some combinations of project settings store important report information in the join tables. For example, a project may bill by a person’s hourly rate. So for our report queries we also have to join those tables in order to have all the key pieces of information available. As you can imagine our reporting queries end up requiring a lot of boilerplate for our joins.

Time Entry User Task Project n 1 1 1 n n User Assignment Task Assignment 1 n 1 n 1 n 1 n 1 n 1 n

If you have used our public API, these concepts will be rather straight forward.

If you look back at that graph, you can imagine how our queries reporting on time entries would have to continue to join both the user and task assignments tables. Just looking at “what is the billable rate for this time entry” could end up looking at either the project, user assignment, or task assignment based on how the project is setup to be billed (billed by project, user, or task). Even asking “is this time entry billable?” will always consult the task assignments table (a task is marked billable or non-billable for a given project).

That means in order to show anything meaningful for our customers, we have to join these tables all the time. The nice thing about that design is that if a value changes in one place, all reports are instantly updated.

However, what I didn’t realize was the cost involved in joining those tables. As I said previously, our indexes are set up just fine. However even with great indexes, when you take millions of time entry rows and join them to hundreds of related task assignment rows, there is still lots of work for MySQL to process. All things considered, it is rather impressive how quickly it can process that much work. But even after I reorganized the time entries table, I wanted better performance than our 6 second worst case query.

So in addition to changing the primary key, I also added a few more columns on the time entries table, like billable and billable_rate. I also updated our Rails code to populate those fields when a time entry was created or updated, and also when a project’s settings changed dramatically.

This section of denormalization code is complex. However, the upshot is that now all of our reporting queries have immediate access to the necessary data at the time entries level. We no longer need to join any other tables in order to calculate reporting data, meaning our reporting queries become easier to understand and maintain and less error prone. It also means that the bulk of the complexity is isolated to one section that the rest of the application depends on.

Shockingly, that took our worst case 6 second query down to around 1.5 seconds by just eliminating those joins.

To step back, with these two changes, our worst case 12 second query was now around 1.5 seconds. Decent sized projects that originally took around 2 seconds now ran in 0.25 seconds. Both of these changes contributed to this massive performance boost.

Just like changing the primary key, this strategy isn’t a free lunch either. Now that the reporting data is denormalized, there’s always a possibility for it to fall out of sync with the original data stored in separate tables. Scenarios like programmer errors, customer support changes, or even freakish once-in-a-blue-moon events all have the potential of messing with that balance.

To mitigate those problems, as part of the Rails changes, we’re testing our denormalization code very heavily. We are also running a consistency check once a day to make sure all of those denormalized fields stored have correct data. These two strategies have worked remarkably well for finding problems when they occur.

This also worked out wonderfully as we transparently rolled out these changes to our customers. Even before we started using the new data in the report queries, we could store and verify the contents, allowing me to hunt down any holes in our logic and ensure the data can be trusted. These steps have also helped us with current development as we explore new project settings.

Aside from the logical errors that could happen, we also take another disk space penalty as we widen our large table. With the new primary key and the new columns, our time entries table ballooned by about 25%. But for our data set, this trade off was an easy decision.

Parting thoughts

I am extremely happy with these results and I hope that this information can help other teams as they examine their own application’s query performance. I believe we made these changes at the right time and I would not have started out our application with either of these techniques. Going forward we’ll have to see whether these approaches still apply or if there are smarter choices we can make, but for now, we can celebrate.

Hacksgiving 2015: A Post Mortem

Last week we celebrated our third Hacksgiving at Harvest. This is an opportunity for us to work in small teams on fun projects that interest us, but still benefit Harvest in some way. The only limit is our imagination. A few of us thought it would be fun to make an HTML5 game, and it was a blast! I learned a few things during Hacksgiving that I would like to share.

Keep it Simple

Keep it Simple is one of those obvious things we like to say, but often forget. Initially when I started sketching out ideas, our game was much more complicated. Some ideas that we ended up scrapping early on included: alternating tasks and team budgets, time-based budget increments, and more detailed designs/animations.

When we were figuring out the tech stack we wanted to use, originally the game was going to leverage the Harvest API to use an account’s real project names, as well as tasks and people. Doing so would have drastically slowed us down for a variety of reasons:

  • We would’ve had to make sure players of the game had permission to view the projects, tasks and people used in the game.
  • Harvest is a rather large and complicated code base. Working inside the app can be slow just because of its sheer size.
  • There’s an element of risk committing code to Harvest willy nilly at the speed we were developing.

In the end, we went with the fastest and easiest solution: HTML, Sass and CoffeeScript. We built the game on top of a Jekyll server for local development and deployed it to GitHub pages. We also didn’t use any JavaScript libraries, which helped keep our code lean.

Keeping things simple not only made development much faster and easier, but also made the game more enjoyable to play. And this is really the heart of Hacksgiving: build something as fast as you can!

Ideas are Cheap! Lengthy Discussions are Expensive!

While building our game, we didn’t spend a lot of time discussing whether a feature was good or bad or how to implement it. It usually took us only 10 minutes of development to figure out if an idea was worth keeping or scrapping. We had only two full work days to make this game, so we couldn’t spend a ton of time discussing things. Having this mindset forced us to constantly churn out work.

When we typically work on a normal project we tend to start with a discussion about it. I think discussion is good, but it can be easy to fall down a rabbit hole of endless discussion. I don’t think lengthy discussion in itself is a bad thing. However, in some cases, I find that lengthy discussions can slow down a project’s momentum.

One thing I like to do to help foster discussions is build quick and dirty prototypes. This gives the team something tangible to play around with so we can get a feel for how the feature will work. More often than not, our prototypes will answer our questions, and help us build better features.

Discussion is a good thing, and so is having a plan. But, when you have that feeling that a discussion has been going on for too long, try getting into some code, and test theories by attempting to execute.

Hanging Out in a Hangout

When you work on a remote team sometimes you miss out on in-person collaboration. It’s a lot easier to ask across the table, “Hey Brad, what do you think of this?” than to ask the question in Slack. You get a real time response complete with emotion and visual cues instead of staring at the screen waiting for Brad to formulate the perfect response only to wait five minutes for a simple, “I don’t like it…” Asking Brad in real time saved me at least five minutes—time I can use to work on something else!

While working on this game, we created a Google Hangout where we could come and go as we pleased, though we were all in that Hangout for probably 75% of the time. It made it so much easier to communicate our thoughts about the game. If I had a JavaScript question, I could ask a developer and get an answer in under 15 seconds. We talked about what we were going to eat for lunch or our plans for the evening. Sometimes 30 minutes of silence would pass. Heck, I didn’t even have the Hangout tab visible most of the time.

Having a personal connection with the people on the team made it so much more enjoyable to work, and made us more productive.

Wrap Up

I always look forward to Hacksgiving. It’s always a great opportunity to try new things and explore ideas—both techwise and socially. I always come away feeling recharged. Now on to the next feature!

Happily Fixing Bugs with a Bugmash

In 2015 we tried a new technique for fixing bugs. Instead of fixing issues as they come in we triaged them. When enough collected, our entire team took a break from their normal projects and focused on fixing bugs - a bugmash. Using bugmashes for bug fixing has taken a depressing process of constantly fixing broken software and turned it into a joyful sprint to the finish line. Here’s how we go about it.

Look at bugs weekly

Every week a few of us gather in a Google Hangout to talk about the most recently updated bugs in our GitHub issue tracker:

For each new bug reported we take the time to consider the impact of the bug. Does it affect many people? Is it in an area of the application that is soon-to-be redesigned? Is it likely to be difficult to fix?

Our discussion about these questions will lead us to mark the bugs with various labels: Priority: Low, Priority: Medium, Priority: High, Low-hanging Fruit, and so on. If we run into high priority bugs, like security concerns or bugs impacting a large percentage of customers, we will look for someone to solve the problem right away.

Bugs tagged Priority: Medium or Low-hanging Fruit are ideal candidates for a bugmash and we give them one more label:

Watch bug counts

Over the weeks and months, the collection of bugs labelled ¡A GO! grows bit by bit. We keep our eye on this number as it helps to determine when we call our next bugmash.

So far we have found when the bug count gets above twenty it’s getting to be time to pay down our debt. Since calling a bugmash involves interrupting the entire team, we look at projects and events going on at Harvest to help us to schedule the bugmash with as little impact as possible. For instance, we’ve found scheduling a bugmash directly after one of our biannual Summits in NYC to be a great fit. When the team returns home from a joyful week of socializing, it feels pretty good to clean house.

Bugmash until it’s done

We do not have bugmash weeks at Harvest, but rather we have bugmashes. Since our weekly bug processing work has led to a pile of bugs that are of excellent value to complete, it makes the most sense to focus on fixing them all rather than constraining ourselves to a one-week window. However, we definitely reserve the right to pull a bug out of the bugmash if we discover it’s not worth the time required to fix it. At Harvest we’re fortunate that the entire organization supports the idea of bugmashing until it is done.

Make it fun

Since a bugmash involves the whole team taking a break from their normal projects, it’s a great time to work with someone new. It’s also good to take time to cheer on teammates as they mash bugs. It feels great to celebrate polishing up your product. (This is especially easy when you have worked at Harvest as long as I have and you see someone smoosh a four-year-old-problem-in-one-form-or-another once and for all!) Shout about progress at the end of each day — it’s part of what makes it fun!

HOWTO

Here’s our HOWTO:

  • Grab a bug: Find a bug labeled ¡A GO! and assign it to yo’self.
  • Chat: Devs room in Slack.
  • Track time: Bugmash – August 2015.
  • Celebrate: Announce mashed bugs on Slack with hashtag #bugmash: “With the help of Doug I mashed the bug where PayPal payments weren’t updating invoices. #bugmash”

Easy!

Clarifying CSS with Linking Verbs

While cleaning up code throughout Harvest, I’ve been addressing how we indicate how an element’s state has changed. Some classic examples you might be familiar with include when an element is hidden, visible, or active. I’ll use the hidden example to show you how I prefer to indicate an element’s state.

In Harvest, we have this reusable utility:

.is-hidden {
  display: none !important;
}

When we apply this class to an element, that element is hidden. There are other class names I could have used, such as .hide or .hidden, but there are some general guidelines I like to follow when naming a class to indicate state change.

State Modifiers Should Start with a Linking Verb

Linking verbs connect the subject of the verb to additional information about the subject. In short, they make text easier to read and provide greater clarity. When you’re talking to someone, do you say, “That button hidden?” No! You say, “That button is hidden.” The most common helping verbs I use when indicating state are is and has. The state of the <button> in this example is easier to read when looking at the class names:

<button class="button is-hidden"></button>

State Modifiers Shouldn’t Be Unnecessarily Verbose

There isn’t really a problem with a class name like .button-is-hidden, but it is a little unnecessarily wordy.

<button class="button button-is-hidden"></button>

A reusable class .is-hidden that could be applied to buttons as well as to other elements results in less code and is easier to maintain. Sometimes exceptions need to be made, though! Let’s switch to a different example for this next scenario. If your .button class needed some additional styles, we would modify the state class when chained with the .button class, like so:

.has-loading-animation {
  color: red;
}

.button.has-loading-animation {
  background: blue;
}

This allows us to have one reusable state modifier, while also having a variation of it to use with buttons. Plus, this keeps the code easy to maintain.

The Real Shebang

Our expense entry form at Harvest uses a custom component for attaching receipt images. This is a very watered down example, but allows you to see how we use the state modifier to toggle the different states.

When there’s no receipt, we show the .attach-receipt element and its children, and hide the .remove-receipt element by adding the state modifier class .is-hidden. When the receipt is attached, we remove the class .is-hidden from the .remove-receipt element, and add .is-hidden to the .attach-receipt element.

That’s it! That’s how we do state modifiers in CSS at Harvest. It’s simple, easy to read, and easy to understand.

How We Break Production

At Harvest, we like shipping new features, and we take great care to introduce changes smoothly. We test our code, QA new features and sizable fixes, and make sure to review our code line by line as part of our daily code reviews.

Harvest is eight years old. It has grown to serve many customers through many features, accumulating substantial legacy code along the way that depends on many moving parts. It’s also maintained by actual people, which means that no matter how hard we try not to break things, due to the nature of the product and our humanness, we’re bound to break something from time to time.

However, we don’t just resign ourselves to this being a fact of life. We are fully aware that any hiccups in our software or our infrastructure affect the teams and businesses that use Harvest daily. That’s why after the storm has passed and our blood pressure has dropped, we take some time to reflect. We think about what went wrong, what can we learn from it, how to do better in the future, and perhaps most importantly, we share it with the rest of the team.

Over time, we have developed the custom of writing a post on our internal blog with these reflections, in a section that we call “How I Broke Production”. These posts usually share the same structure:

  • A narrative of what happened, how we reacted, and how we fixed it. When a problem arises, we usually collaborate through HipChat to notify our customer support experts so they can handle any related support tickets, gather a team to investigate the cause of the problem, and discuss how can we fix it. This recorded, timestamped, textual history makes it easy to build a timeline of events and review how we reacted as a group.

  • The root cause. While extinguishing a fire, our first goal is to bring our application back to normal and minimize the impact to our customers. Once that’s done, we dedicate some quiet time to dig up the causes: Was it a bug the tests didn’t catch? Some interaction with third-party software or APIs we didn’t think of? A system malfunction? We are usually able to track an exception, a log entry, an alert, or some piece of hard evidence that helps us make sense of what happened. We usually apply the 5 Whys technique.

  • The impact. The first of our core values is listening to our customers and this is especially true when something has gone wrong. Our customer support experts strive to give quick and honest responses to the customers affected by the problem, and follow up with them after the issue has been solved. We keep track of those tickets in Zendesk, and cross-reference them with any issues related to the incident or its investigation. It reminds us that we are here because we have customers — when we break production, we affect the workflow of our customers and the schedule of other team members when they help us put out the fire.

  • How we can do better in the future. The downside of being human is making mistakes, but the upside is having an unlimited ability to learn and adapt. In these posts, we identify how we can improve our process — individually and collectively. Maybe it’s by increasing the test coverage of some part of the product, changing our process, or more deeply thinking through the potential risks of a change in the future.

  • Encouragement and group learning. The goal of these posts isn’t public shaming — they are actually a great way to acknowledge our guilt and let it go. Comments from other team members are always encouraging: you can feel how the people involved grew as software developers by thinking about and fixing the incident. And the rest of the team gets to learn about tricky parts of our product, unexpected situations, and techniques that we all can use in our daily work or when we have to face a similar situation.

This process helps us grow as a team and lets us move on to building useful features for our customers. It doesn’t guarantee we won’t make mistakes, but it removes the drama and encourages learning from these experiences so we hopefully won’t repeat them. As an accomplished author of my own “How I Broke Production” post, I can attest that when we make mistakes, they are at least new mistakes!

Thanks to Kerri Miller for inspiring this process!

Upgrade Rails Without The Risk

As you might have recently read, we’re not very fond of taking risks at Harvest. But what about a major upgrade, like upgrading Rails?

The first version of Harvest was deployed on Rails 0.14 — so we’ve gone through a few Rails upgrades over the years. Our most recent upgrade was to Rails 4.0, and although it was a big change for us, we were able to break it down and deploy many parts of the update before the actual Rails version update went out.

Our actual deploy that upgraded Rails to 4.0 was very small — a few gem version updates and really minor code changes. Here’s the story of how we did it.

Break Everything

So, how did we start? Well, some things are obvious just by reading the official Rails Upgrade Guide. Other things you find by virtue of trying to boot your application and running the tests. Upgrading a Rails app is a lot like climbing a mountain: you keep moving upwards, fixing one thing after the other, solving obstacles until you reach the top.

Release Small Fixes

Many of the changes required were compatible with the previous version of Rails as well — so as we fixed issues on our Rails 4.0 branch, we were able to merge the changes back into our master branch and release them, keeping each release small and easy to understand.

  • Strong Parameters. This is one of the bigger changes in Rails 4. Luckily the Rails team released strong_parameters, a gem that let us add this feature while still on Rails 3. Even better, when set up properly, it lets us convert slowly model by model, so the granularity of deploys can be really small.
  • match routes requiring the verb. We had tons of routes that needed fixing, but again, this is something we could do beforehand, and in our particular case it was done in 16 different pull requests, all deployed safely and independently.
  • Gem updates. One of the first things you have to do when updating a Rails version is update a bunch gems. Most of the newer gem versions (except the Rails ones themselves) worked just as well with Rails 3.
  • Autoloading. We had a bunch of errors running our test suite — most of the changes required we reference classes or modules by their full name (External::Export instead of Export while you’re inside the External module). These are changes that can be merged and deployed at any time.
  • Undigested assets. Rails 4 stopped generating undigested assets, so if you depended on them you had to do something. The recommended solutions were to make sure you reference digested versions by using one of the various rails helpers or just move those assets to /public. Again, a small, simple change that we released before the upgrade of Rails.

Besides that, there were 6 other pull requests with a variety of tiny tweaks in the way we did things that could be reimplemented in a way that worked both with Rails 3 and Rails 4. Every week, after getting a few more tests green on the rails-4 branch, I’d list the commits and see if there was something that could go back to master. You’d be surprised how much stuff can be backported.

Don’t drop the discipline of single meaningful commits. That will help you clearly see which commits can be backported and deployed to production right away. Imagine you usually work off master and you’re working on a rails-4 branch. You can very easily create a backport branch with:

git checkout master
git checkout -b rails-4-backmerge

# Repeat for every commit you think you can backport
git cherry-pick <sha> 
rake # run tests

Once you’re done and your backport is merged into master, make sure to merge master into rails-4 (or if you’re feeling adventurous, rebase rails-4 off of master).

Ignore Deprecation Warnings

We completely ignored deprecation warnings from Rails 4. They’re warnings for a reason — they don’t need to be dealt with immediately. Remember, our goal is to go live with as few changes as possible.

We dealt with all our deprecation warnings after the initial launch — over 18 pull requests in a few weeks, touching more than 1,700 lines of code.

Rehearse a Rollback Plan

With an update of a core gem like Rails, we couldn’t simply rely on our normal cap production deploy:rollback. While we were preparing for the release, we developed a rollback plan, and rehearsed the release and rollback on one of our test machines.

This turned out to be great practice, because the first time we attempted the Rails 4.0 release we discovered something wrong through our checkup plan and quickly rolled back the release without any issues.

Balance Risk Versus Effort

Some people have gone the extra mile and made their apps dual bootable. GitHub has a nice story about it and Discourse had it for a while. That would have given us great flexibility and let us slowly release this upgrade one server at the time.

We considered dual booting, but decided it wasn’t worth the effort for our application. Both for the extra complexity in actually implementing it, and the fact that a whole new set of problems arise from the fact that two different versions coexist at the same time led us to go with backports instead.

After merging most of our changes back into master we realized that what we needed to deploy was actually quite minimal. So we all agreed we’d try a deploy with a well-thought checkup plan and a rehearsed revert strategy instead.

Release With Confidence

Although upgrading to Rails 4.0 was a major change for us, we were able to break down the changes and release them slowly over the course of a few weeks. We merged as many changes back into our master branch (Rails 3) as possible, keeping the changes required for the Rails 4.0 release as small and simple as possible.

We developed a rollback and checkup plan, rehearsed them, and used them successfully — and followed up our release with a series of changes which removed our reliance on deprecated methods.

After following these techniques we were able to painlessly upgrade Harvest, a pretty large application, without scheduling a downtime or any disruptive customer outage. As a developer, that feels fantastic.

Deploy Safely

We’re not very fond of taking unnecessary risks at Harvest — and the easiest way to avoid big risks is to make small, incremental changes. Like a cook tasting a dish, it allows us to make small adjustments as needed to make Harvest the best it can be. Mmmm, salty passwords.

You’re hopefully unaware, but we update Harvest a lot (for example, 13 times yesterday) — and most of the time it’s easy and natural to make small, simple changes. But how do you make small, incremental changes with upgrades or new features?

Minimize Changes

“Is this the minimum set of changes that we need to go live?”

Sometimes when you find yourself in a section of code that hasn’t been touched in a while it’s tempting to make a whole bunch of changes by accident. “Oh, this is using the old hash syntax, I’ll just make a quick update to take out the hash rockets” can quickly become “Wait, why is this even a class anyway?” — and before you know it, you’ve made a set of changes which has nothing to do with what you set out to do.

“Is this the minimum set of changes that we need to go live?”

Thanks to our amazing developer operations team, there’s virtually no cost to make updates to Harvest — so why lump your changes into something that’s impossible to predict how it will react when released?

We try to make our pull requests as boring as possible. They should be simple, plain, and easy to digest. No need to include those hash update changes — you can just as easily follow up afterwards with another pull request.

“Is this the minimum set of changes that we need to go live?”

It Was The Best Of Times, It Was The Worst Of Times

No matter the change, always have a plan for after the pull request is released — a plan for success and a plan for failure.

“We can just cap production deploy:rollback, right?” Well, not always. Some updates can have rippling consequences — sessions expiring and logging everyone out, secure cookies becoming incompatible and unreadable, serialized data suddenly becoming unserializable — and not all of these consequences are immediately apparent.

Deploys should be boring. Rehearsing a revert plan can mean the difference between high-stress downtime and “Whoops, let’s try again tomorrow.” Rehearse what it means to fail, at what point you’ll decide to rollback a release, and what needs to be done (and who will be doing it). We all make mistakes — and we should expect them to happen often, and be prepared for them.

Just as important as a rollback plan is a checkup plan — the plan you follow when an update is successful. You can consider it the definition of success — all the places you’ll need to check once the update is made to make sure things are working the way you expect.

Releasing Large Features

Large features (like our new Projects section) can easily balloon into huge, risky releases — but we’ve adopted using feature flags to break down big features into small, incremental changes.

If you’ve been using Harvest since November of 2013, you’ve been using a copy of Harvest which supports the new Projects section — it’s just been hidden from you (or not, if your account has been in our early access group (thank you!)). Each change we’ve made has been released into our production code, one small change at a time.

Feature flagged releases can be some of the safest releases possible — since we can release a feature to a small subset of our customers and make sure it’s behaving the way we expect (by following the checkup plan).

Real Talk

At Harvest, we like releases to be boring, simple and straightforward, prepared and predictable. Small changes don’t keep you late at work or make you work on the weekends or on holidays. We’ve found that keeping our releases small means constantly asking ourselves and each other “Is this the minimum set of changes that we need to go live?” It means preparing and planning for failure and success. It means using tools like feature flags to break down big, risky releases.

It means we can eat dinner with our families.

Why I enjoy writing software at Harvest

I occasionally help out the team by interviewing new candidates. The most common question I have been asked — “What do you enjoy most about working at Harvest?” — is a great question for an interviewee to ask. There are several stock answers that I think most people reach for. You can talk about how much you like the team, the culture, the flexibility, or the customers you help. I appreciate all of these at Harvest, but there are some specific anecdotes that I think really help paint a picture of what it is like to work at Harvest.

The Codebase

I enjoy working with our codebase. That sounds shallow since the codebase is just a means to an end: providing a great experience for our customers. However working daily with Harvest’s codebase is the full-time job for myself and the team I work with. If it were painful, that’d be a negative daily experience to weigh against the other benefits. I also firmly believe that our team’s ability to help solve our customer’s needs is directly influenced by the ease at which we can build features and fix problems.

You should know that Harvest launched in 2005 before Rails had even hit version 1.0 and we have our share of legacy code that has persisted over the years. I joined Harvest over 2 years ago and my experience pretty much amounted to enterprises, startups, and consulting. For the longest time my favorite projects were greenfield, where the teams could work on the problem at hand with very little regard to legacy code. So why do I enjoy working on a large old app like Harvest? Because it is amazing and inspiring to see it transform and improve over the years.

It is one thing to launch a project for a client, fix a few bugs, and then move on or throw a ball of hacks together in order to get profitable as soon as possible. It is another world altogether to deploy a feature set, and then continue to improve and polish it as time goes on. We do have our spurts where we will rewrite old code from scratch, and we might even grumble at the previous developer’s short-sighted design choices while we do it (which is funnier when you realize you were the previous developer). But our team recognizes that old code helped get Harvest to where it is today.

There is a respect, a beauty, and an admiration that comes working long term on a legacy codebase alongside this team.

The Team

The team here at Harvest is critical for our codebase to improve. Working on a legacy codebase with an understaffed crew is definitely unfun. Have you ever accidentally deployed a bug and had to answer a flood of support tickets while you’re trying to figure out a fix and deploy it? Have you ever critically botched the infrastructure and suddenly had to scramble to fix the servers?

I am grateful for the teams we have here at Harvest. I’m a big fan of our Harvest Experts who genuinely care about solving our customer’s problems and diagnose bug symptoms to pass on to developers. They honestly mean it when they tell our customers to write or call in when they need help.

I am relieved we have a brilliant and hardworking Operations team that is focused on our infrastructure 24x7 dealing with daily problems ranging from security reports and patches to capacity planning.

I am thankful for our Product Design team who thinks a lot about the customer’s interface experience. They also work closely with the Development team and adapt if technical limitations are uncovered.

We have several other roles too, like our Mobile Developers, Quality Assurance, Marketing, External Integrations, and Office Managers that all fit in a unique way to improving our company and product. Because of these roles, I can focus on what I do best.

Roles and Communication

Now you might get the impression that each of these roles are rigidly defined and add to office bureaucracy; Developers need to submit “Database Change Requests” to Ops and everyone is busy filling out TPS reports. Not at all.

Each one of these teams fluidly communicates with other teams in order to solve the problems at hand. Developers are asked to take over certain support tickets. Ops members are invited to a technical design discussion. Experts are asked about common customer problems while we redesign an existing feature. Developers let Experts know beforehand when we’re going to launch features so they can field support issues with current information.

There isn’t any central authority that dictates these processes. Each team puts together their process and if things go poorly, they talk about it and improve it. Sometimes that requires talking and changing a process between teams. We do have a few people who float between teams that can view issues in the aggregate and suggest improvements.

Empowerment

A legacy codebase would never improve if the team working on it didn’t have the authority to make big changes. The freedom to improve areas is a wonderful advantage. But with that advantage comes responsibility: we still have to prioritize features and get a general consensus among the immediate team.

I’ve worked in some places where the unspoken rule was “It ain’t broken enough, so don’t fix it” and that is not the case at Harvest. Most of us have our own personal short list of areas we want to make better and have the flexibility to squeeze wins here and there within a normal work week.

The team submits and vets bigger feature changes and rewrites, not a Product Manager or a Project Manager, and those projects are prioritized alongside everything else.

A Direct Influence

The direct influence of everyone on our team on our customers means a lot to me. I like that my work positively affects contractors, designers, business owners, and many others. Our customer’s needs are very real and have serious impacts on their livelihood.

Some of our customers are independent with various distractions that pull at them. It’s very satisfying to know that my work can improve their day-to-day life. Our team has a responsibility to keep a stable infrastructure with as few bugs as possible because our screw-ups can cause a big disruption for our customers.

A Glimpse

A recent meeting inspired me to write this post. My team is working with another team on an integration that have some particularly interesting (and fun!) performance implications.

We met a few times to try and hammer out how one piece of code should talk to another. These meetings have included Design and Ops to help weigh in with different perspectives. Our last meeting between the teams had five developers focusing on a specific technical issue.

  • We talked about a few straightforward solutions and we agreed on one along with some metrics.
  • As we learn more, we agreed we might need to bring design back into the discussion in order to figure out some user interface compromises because of performance.
  • We agreed we need to take a serious look at the root cause of the performance issue and prioritize that when we can.
  • We ended the meeting with “Let’s all expect to rewrite parts of this we roll it out 100%. We can consider ourselves lucky if it makes it to production as-is.” That got a few chuckles but everyone nodded in agreement.

This is a great glimpse of what it is like to work here. We know we may have to change as we go. The team is not afraid of being wrong and has the courage to move forward with the information we have. Multiple areas can weigh in with input. The teams are not shutting down the conversation by pointing fingers or huffing “that’s not my problem.”

A Coworker’s Comment

I asked my coworkers to review a draft before I published this. Andrew said parts of this post resonated with him. I liked his comment so much that I asked his permission to end with it.

I love watching This Old House — it’s one of my weekend guilty pleasures. I keep coming back to the show because of the amount of work they put into a project house — it’s incredible. The current project has a closed-in back yard, so when the project called for replacing three stone walls, they had to disassemble the wall and carry it out one wheelbarrow at a time, through the house, to a dumpster on the street.

They care for the house. But it’s not just care — it’s a kind of love. My mother used to tell me, “no matter what you do, you’ll be our son, and we love you for that.” I like to think of the craftspeople on This Old House that way — “No matter what kind of mold and crumbling masonry we find, you’ll always be our project house, and we’ll make it right.”

How We Fix Bugs at Harvest

As Harvest’s team has grown, we’ve had to evolve our bug-fixing process. Years ago we developed a concept called Delta Force as a way to protect the bulk of the development team from a constant need to respond bugs. On a rotating basis, one person would handle escalated support and fix bugs as time allowed. This used to keep us collectively sane.

In 2014 we discovered this no longer worked well for us. When a developer would take their turn on Delta Force, they would feel the weight of numerous unfixed bugs all day and well into the night. They would wake up with a hopeless feeling. When you fix three bugs in a week while four new bug reports come in, you’re bound to feel disappointed in yourself. The additional strain of a growing customer base, and by extension more need for escalated customer support, made for an untenable situation.

The development team got together and brainstormed ways to make the process of fixing bugs more bearable. Not only did we want to improve the life of whomever was in that Delta Force role, but we also wanted to lower our bug count from week to week.

We found that most of the stress of the Delta Force role was coming from bugfixing, not from escalated support. We decided to separate those roles – to put bugs in the hands of the entire development team. Rather than defining goals like we had in the past (Bug count less than ten! Fix five bugs per week!), we simply asked people to claim a bug at the beginning of the week and do their best to fix it. Some bugs are big and take longer. Some bugs are quick and feel a little cheap. It’s all good!

Through our brainstorming process we also learned that there was wide support across the team for a bugmash week — a week where the whole team would pause their full-time projects and focus on bugfixing. We kept that in our back pocket during the summer. As Harvest continued to grow and bugs continued to accumulate, we decided to give a bugmash week a try.

For a week in late September we each claimed a small pile of bugs and set to work. We touched everything from customer-reported bugs to monster queries to support tools to staging environments to stale, abandoned code. We spent 326 hours closing over fifty-five bugs in the Harvest suite of applications. We couldn’t be happier with the results.

Keep this lesson in mind as your organization grows. Processes that you trusted and felt confident about for years can become obsolete. The trick is to change how you do things without changing the good intentions behind the original structure. And don’t forget that you‘ll probably have to change again in the future.

Of course, it never hurts to have an amazing team that can work together with skill, humility, and supportiveness. :)

Code Reviews at Harvest

Let’s face it — code reviews can be tough. Even if your team fully adheres to a certain style guide, programming is so subjective that smart people can argue great points on conflicting approaches.

We use GitHub Pull Requests heavily at Harvest and we require code reviews for everything meaningful that goes into production. Here’re a few guidelines that we’ve developed for reviewing each other’s changes so we can stay productive and focused on what’s important — launching code.

Our current process

Every company has a different deployment process, but here at Harvest it looks a little something like this:

  1. We discuss what needs to be built, changed, or fixed.
  2. A developer/designer creates a branch off of master.
  3. They work on their branch. When it’s complete they push it up to GitHub and create a Pull Request tagging an appropriate person or team to review it.
  4. One or more tagged people from that team will review the code and give a +1 when they feel it’s ready to be deployed.
  5. The original submitter will then merge it into master and deploy it.

Multiple posts could be written about each of these steps. However I’m only going to talk about #3 and #4.

We use code reviews at Harvest to help communicate with the team what’s going into production, to help each other learn new tricks and techniques as things evolve, and to point out specific areas to investigate that we may unintentionally break.

If it fixes a serious bug, just let it go

We might disagree on approach, but if there’s a serious issue in Harvest that’s affecting a customer, and someone on the team has a fix for it, we will always let it ship. We can always circle back and fix it better or more thoroughly later.

Know what’s blocking your code

Code reviews can feel unproductive because they don’t have a clear goal. We’ve put together a survey we call FIAS (the Filler Impact Assessment Survey), a tongue-in-cheek acronym named after Patrick Filler, the Harvester who proposed the idea.

The idea is simple: take an educated guess at ranking three questions on a scale of one to ten:

  • How much of the app is affected?
  • How much of this change is mysterious to you?
  • How easy is it for you to imagine this performs in an unexpected way after deployment?

Then add up the scores for each question.

  • If the score is less than fifteen, you only need one person to give you a +1.
  • If the score is fifteen to twenty, you need two people to give you a +1.
  • If the score is over twenty, you need two people and full QA.

QA is different for each team. Harvest has staging environments that mirror production and we have scripts for each section that we manually test.

This isn’t perfect and the results may not make sense for other teams. For example, another team may want three people to +1 or always QA above ten points. However, the point is that, when a Pull Request begins review, we have a clear idea what it will take to launch it to production.

We also don’t always follow the FIAS. For example, even if we only need one +1, the submitter may still override the FIAS and ask for two to three +1s because they know their changes involve a particularly tough part of the codebase.

Now that we have a set goal, we can work backwards. If some of the changes makes a reviewer uncomfortable, can that part be segmented out and the rest launched instead? Can this branch be merged even if there’s a heated discussion over using the Single Responsibility Principle vs a single clear object?

With a simple process, we’ve removed the ambiguity that most code reviews start with.

Communicate clearly to reviewers what’s going on

FIAS is a great tool to get a general sense of how large (and risky) a change is to you — but it isn’t always the best tool for communicating to your reviewers what the change actually entails. Sometimes, GIFs or annotated before and after images (made with tools like Monosnap) can be more effective. Pre-commenting certain areas in your own Pull Request is also helpful: “I know this line seems unrelated, but it is because…”

Clearly communicating the change starts the code review on the right foot. The submitter may have invested days working on a change, but the people reviewing it have not.

Clarify blocking comments

Not every comment or question has to be resolved. Text doesn’t always convey emotion, and it’s easy to misread someone’s intent. It’s perfectly fine to ask a reviewer, “Is this a blocking issue?” Often it isn’t, or it can be handled separately.

Pull Requests can also end up being a lightning rod for debate. Discussions among the team can continue on a Pull Request, but they can also be moved elsewhere — to a separate issue, internal tool, or even a meeting.

Our reviewers will normally prefix comments with “[NB]” for non-blocking comments: “[NB] This works, but here’s a quick snippet that’s a little more clear”. A simple prefix like that can help speed along code reviews.

Face-to-face meetings

We often raise a white flag and ask for an impromptu face-to-face meeting or a quick chat, usually by spinning up a Google Hangout or conversation in HipChat. This seems like an obvious tip. However recognizing the need for these meetings can be tricky. If two people have posted back and forth at least once on the same topic, it will likely be easier to just hash it out face-to-face. If you find yourself writing two paragraphs, some of your concerns will likely get lost. You may be able to convey your thoughts better over audio.

It’s nice to see a comment history of how decisions were made. However, you can still accomplish this by posting a quick recap of what was decided in the meeting.

Don’t limit the number of reviewers

Everyone on the team pitches in with code reviews and we don’t have official reviewers here at Harvest. We may purposefully ping someone who we know has had a lot of experience in a particular area, but we never wait on one person to go through all the reviews.

This becomes a bigger deal as the team grows. With an application as large as Harvest, it’s extremely difficult to keep everything in your head. And even if you do feel good about a certain area, it will likely change over time as other people help out with maintenance.

We also notify people to give them a heads-up without expecting them to review the pull request. We do this by prefixing “/cc” before their name “/cc @zmoazeni this might affect your work on reports”.

Not a perfect process

We don’t have any delusions that our process is perfect. However all of these points help speed along our code reviews. Taking some of these tips and morphing it to fit your organization may help out your team too — or, if you think your process could help out our team, write up your process and send us a link! We’re always looking to improve.