Tech Time

A web log primarily about how computers work. By the makers of Harvest. We’re hiring

How We Break Production

At Harvest, we like shipping new features, and we take great care to introduce changes smoothly. We test our code, QA new features and sizable fixes, and make sure to review our code line by line as part of our daily code reviews.

Harvest is eight years old. It has grown to serve many customers through many features, accumulating substantial legacy code along the way that depends on many moving parts. It's also maintained by actual people, which means that no matter how hard we try not to break things, due to the nature of the product and our humanness, we're bound to break something from time to time.

However, we don’t just resign ourselves to this being a fact of life. We are fully aware that any hiccups in our software or our infrastructure affect the teams and businesses that use Harvest daily. That's why after the storm has passed and our blood pressure has dropped, we take some time to reflect. We think about what went wrong, what can we learn from it, how to do better in the future, and perhaps most importantly, we share it with the rest of the team.

Over time, we have developed the custom of writing a post on our internal blog with these reflections, in a section that we call “How I Broke Production”. These posts usually share the same structure:

  • A narrative of what happened, how we reacted, and how we fixed it. When a problem arises, we usually collaborate through HipChat to notify our customer support experts so they can handle any related support tickets, gather a team to investigate the cause of the problem, and discuss how can we fix it. This recorded, timestamped, textual history makes it easy to build a timeline of events and review how we reacted as a group.

  • The root cause. While extinguishing a fire, our first goal is to bring our application back to normal and minimize the impact to our customers. Once that's done, we dedicate some quiet time to dig up the causes: Was it a bug the tests didn't catch? Some interaction with third-party software or APIs we didn't think of? A system malfunction? We are usually able to track an exception, a log entry, an alert, or some piece of hard evidence that helps us make sense of what happened. We usually apply the 5 Whys technique.

  • The impact. The first of our core values is listening to our customers and this is especially true when something has gone wrong. Our customer support experts strive to give quick and honest responses to the customers affected by the problem, and follow up with them after the issue has been solved. We keep track of those tickets in Zendesk, and cross-reference them with any issues related to the incident or its investigation. It reminds us that we are here because we have customers — when we break production, we affect the workflow of our customers and the schedule of other team members when they help us put out the fire.

  • How we can do better in the future. The downside of being human is making mistakes, but the upside is having an unlimited ability to learn and adapt. In these posts, we identify how we can improve our process — individually and collectively. Maybe it's by increasing the test coverage of some part of the product, changing our process, or more deeply thinking through the potential risks of a change in the future.

  • Encouragement and group learning. The goal of these posts isn't public shaming — they are actually a great way to acknowledge our guilt and let it go. Comments from other team members are always encouraging: you can feel how the people involved grew as software developers by thinking about and fixing the incident. And the rest of the team gets to learn about tricky parts of our product, unexpected situations, and techniques that we all can use in our daily work or when we have to face a similar situation.

This process helps us grow as a team and lets us move on to building useful features for our customers. It doesn't guarantee we won't make mistakes, but it removes the drama and encourages learning from these experiences so we hopefully won’t repeat them. As an accomplished author of my own “How I Broke Production” post, I can attest that when we make mistakes, they are at least new mistakes!

Thanks to Kerri Miller for inspiring this process!

Upgrade Rails Without The Risk

As you might have recently read, we’re not very fond of taking risks at Harvest. But what about a major upgrade, like upgrading Rails?

The first version of Harvest was deployed on Rails 0.14 — so we’ve gone through a few Rails upgrades over the years. Our most recent upgrade was to Rails 4.0, and although it was a big change for us, we were able to break it down and deploy many parts of the update before the actual Rails version update went out.

Our actual deploy that upgraded Rails to 4.0 was very small — a few gem version updates and really minor code changes. Here’s the story of how we did it.

Break Everything

So, how did we start? Well, some things are obvious just by reading the official Rails Upgrade Guide. Other things you find by virtue of trying to boot your application and running the tests. Upgrading a Rails app is a lot like climbing a mountain: you keep moving upwards, fixing one thing after the other, solving obstacles until you reach the top.

Release Small Fixes

Many of the changes required were compatible with the previous version of Rails as well — so as we fixed issues on our Rails 4.0 branch, we were able to merge the changes back into our master branch and release them, keeping each release small and easy to understand.

  • Strong Parameters. This is one of the bigger changes in Rails 4. Luckily the Rails team released strong_parameters, a gem that let us add this feature while still on Rails 3. Even better, when set up properly, it lets us convert slowly model by model, so the granularity of deploys can be really small.
  • match routes requiring the verb. We had tons of routes that needed fixing, but again, this is something we could do beforehand, and in our particular case it was done in 16 different pull requests, all deployed safely and independently.
  • Gem updates. One of the first things you have to do when updating a Rails version is update a bunch gems. Most of the newer gem versions (except the Rails ones themselves) worked just as well with Rails 3.
  • Autoloading. We had a bunch of errors running our test suite — most of the changes required we reference classes or modules by their full name (External::Export instead of Export while you’re inside the External module). These are changes that can be merged and deployed at any time.
  • Undigested assets. Rails 4 stopped generating undigested assets, so if you depended on them you had to do something. The recommended solutions were to make sure you reference digested versions by using one of the various rails helpers or just move those assets to /public. Again, a small, simple change that we released before the upgrade of Rails.

Besides that, there were 6 other pull requests with a variety of tiny tweaks in the way we did things that could be reimplemented in a way that worked both with Rails 3 and Rails 4. Every week, after getting a few more tests green on the rails-4 branch, I'd list the commits and see if there was something that could go back to master. You'd be surprised how much stuff can be backported.

Don’t drop the discipline of single meaningful commits. That will help you clearly see which commits can be backported and deployed to production right away. Imagine you usually work off master and you’re working on a rails-4 branch. You can very easily create a backport branch with:

git checkout master
git checkout -b rails-4-backmerge

# Repeat for every commit you think you can backport
git cherry-pick <sha> 
rake # run tests

Once you’re done and your backport is merged into master, make sure to merge master into rails-4 (or if you’re feeling adventurous, rebase rails-4 off of master).

Ignore Deprecation Warnings

We completely ignored deprecation warnings from Rails 4. They’re warnings for a reason — they don’t need to be dealt with immediately. Remember, our goal is to go live with as few changes as possible.

We dealt with all our deprecation warnings after the initial launch — over 18 pull requests in a few weeks, touching more than 1,700 lines of code.

Rehearse a Rollback Plan

With an update of a core gem like Rails, we couldn’t simply rely on our normal cap production deploy:rollback. While we were preparing for the release, we developed a rollback plan, and rehearsed the release and rollback on one of our test machines.

This turned out to be great practice, because the first time we attempted the Rails 4.0 release we discovered something wrong through our checkup plan and quickly rolled back the release without any issues.

Balance Risk Versus Effort

Some people have gone the extra mile and made their apps dual bootable. GitHub has a nice story about it and Discourse had it for a while. That would have given us great flexibility and let us slowly release this upgrade one server at the time.

We considered dual booting, but decided it wasn’t worth the effort for our application. Both for the extra complexity in actually implementing it, and the fact that a whole new set of problems arise from the fact that two different versions coexist at the same time led us to go with backports instead.

After merging most of our changes back into master we realized that what we needed to deploy was actually quite minimal. So we all agreed we’d try a deploy with a well-thought checkup plan and a rehearsed revert strategy instead.

Release With Confidence

Although upgrading to Rails 4.0 was a major change for us, we were able to break down the changes and release them slowly over the course of a few weeks. We merged as many changes back into our master branch (Rails 3) as possible, keeping the changes required for the Rails 4.0 release as small and simple as possible.

We developed a rollback and checkup plan, rehearsed them, and used them successfully — and followed up our release with a series of changes which removed our reliance on deprecated methods.

After following these techniques we were able to painlessly upgrade Harvest, a pretty large application, without scheduling a downtime or any disruptive customer outage. As a developer, that feels fantastic.

Deploy Safely

We’re not very fond of taking unnecessary risks at Harvest — and the easiest way to avoid big risks is to make small, incremental changes. Like a cook tasting a dish, it allows us to make small adjustments as needed to make Harvest the best it can be. Mmmm, salty passwords.

You’re hopefully unaware, but we update Harvest a lot (for example, 13 times yesterday) — and most of the time it’s easy and natural to make small, simple changes. But how do you make small, incremental changes with upgrades or new features?

Minimize Changes

“Is this the minimum set of changes that we need to go live?”

Sometimes when you find yourself in a section of code that hasn’t been touched in a while it’s tempting to make a whole bunch of changes by accident. “Oh, this is using the old hash syntax, I’ll just make a quick update to take out the hash rockets” can quickly become “Wait, why is this even a class anyway?” — and before you know it, you’ve made a set of changes which has nothing to do with what you set out to do.

“Is this the minimum set of changes that we need to go live?”

Thanks to our amazing developer operations team, there’s virtually no cost to make updates to Harvest — so why lump your changes into something that’s impossible to predict how it will react when released?

We try to make our pull requests as boring as possible. They should be simple, plain, and easy to digest. No need to include those hash update changes — you can just as easily follow up afterwards with another pull request.

“Is this the minimum set of changes that we need to go live?”

It Was The Best Of Times, It Was The Worst Of Times

No matter the change, always have a plan for after the pull request is released — a plan for success and a plan for failure.

“We can just cap production deploy:rollback, right?” Well, not always. Some updates can have rippling consequences — sessions expiring and logging everyone out, secure cookies becoming incompatible and unreadable, serialized data suddenly becoming unserializable — and not all of these consequences are immediately apparent.

Deploys should be boring. Rehearsing a revert plan can mean the difference between high-stress downtime and “Whoops, let’s try again tomorrow.” Rehearse what it means to fail, at what point you’ll decide to rollback a release, and what needs to be done (and who will be doing it). We all make mistakes — and we should expect them to happen often, and be prepared for them.

Just as important as a rollback plan is a checkup plan — the plan you follow when an update is successful. You can consider it the definition of success — all the places you’ll need to check once the update is made to make sure things are working the way you expect.

Releasing Large Features

Large features (like our new Projects section) can easily balloon into huge, risky releases — but we’ve adopted using feature flags to break down big features into small, incremental changes.

If you’ve been using Harvest since November of 2013, you’ve been using a copy of Harvest which supports the new Projects section — it’s just been hidden from you (or not, if your account has been in our early access group (thank you!)). Each change we’ve made has been released into our production code, one small change at a time.

Feature flagged releases can be some of the safest releases possible — since we can release a feature to a small subset of our customers and make sure it’s behaving the way we expect (by following the checkup plan).

Real Talk

At Harvest, we like releases to be boring, simple and straightforward, prepared and predictable. Small changes don’t keep you late at work or make you work on the weekends or on holidays. We’ve found that keeping our releases small means constantly asking ourselves and each other “Is this the minimum set of changes that we need to go live?” It means preparing and planning for failure and success. It means using tools like feature flags to break down big, risky releases.

It means we can eat dinner with our families.

Why I enjoy writing software at Harvest

I occasionally help out the team by interviewing new candidates. The most common question I have been asked — “What do you enjoy most about working at Harvest?” — is a great question for an interviewee to ask. There are several stock answers that I think most people reach for. You can talk about how much you like the team, the culture, the flexibility, or the customers you help. I appreciate all of these at Harvest, but there are some specific anecdotes that I think really help paint a picture of what it is like to work at Harvest.

The Codebase

I enjoy working with our codebase. That sounds shallow since the codebase is just a means to an end: providing a great experience for our customers. However working daily with Harvest’s codebase is the full-time job for myself and the team I work with. If it were painful, that’d be a negative daily experience to weigh against the other benefits. I also firmly believe that our team’s ability to help solve our customer’s needs is directly influenced by the ease at which we can build features and fix problems.

You should know that Harvest launched in 2005 before Rails had even hit version 1.0 and we have our share of legacy code that has persisted over the years. I joined Harvest over 2 years ago and my experience pretty much amounted to enterprises, startups, and consulting. For the longest time my favorite projects were greenfield, where the teams could work on the problem at hand with very little regard to legacy code. So why do I enjoy working on a large old app like Harvest? Because it is amazing and inspiring to see it transform and improve over the years.

It is one thing to launch a project for a client, fix a few bugs, and then move on or throw a ball of hacks together in order to get profitable as soon as possible. It is another world altogether to deploy a feature set, and then continue to improve and polish it as time goes on. We do have our spurts where we will rewrite old code from scratch, and we might even grumble at the previous developer’s short-sighted design choices while we do it (which is funnier when you realize you were the previous developer). But our team recognizes that old code helped get Harvest to where it is today.

There is a respect, a beauty, and an admiration that comes working long term on a legacy codebase alongside this team.

The Team

The team here at Harvest is critical for our codebase to improve. Working on a legacy codebase with an understaffed crew is definitely unfun. Have you ever accidentally deployed a bug and had to answer a flood of support tickets while you’re trying to figure out a fix and deploy it? Have you ever critically botched the infrastructure and suddenly had to scramble to fix the servers?

I am grateful for the teams we have here at Harvest. I’m a big fan of our Harvest Experts who genuinely care about solving our customer’s problems and diagnose bug symptoms to pass on to developers. They honestly mean it when they tell our customers to write or call in when they need help.

I am relieved we have a brilliant and hardworking Operations team that is focused on our infrastructure 24x7 dealing with daily problems ranging from security reports and patches to capacity planning.

I am thankful for our Product Design team who thinks a lot about the customer’s interface experience. They also work closely with the Development team and adapt if technical limitations are uncovered.

We have several other roles too, like our Mobile Developers, Quality Assurance, Marketing, External Integrations, and Office Managers that all fit in a unique way to improving our company and product. Because of these roles, I can focus on what I do best.

Roles and Communication

Now you might get the impression that each of these roles are rigidly defined and add to office bureaucracy; Developers need to submit “Database Change Requests” to Ops and everyone is busy filling out TPS reports. Not at all.

Each one of these teams fluidly communicates with other teams in order to solve the problems at hand. Developers are asked to take over certain support tickets. Ops members are invited to a technical design discussion. Experts are asked about common customer problems while we redesign an existing feature. Developers let Experts know beforehand when we’re going to launch features so they can field support issues with current information.

There isn’t any central authority that dictates these processes. Each team puts together their process and if things go poorly, they talk about it and improve it. Sometimes that requires talking and changing a process between teams. We do have a few people who float between teams that can view issues in the aggregate and suggest improvements.

Empowerment

A legacy codebase would never improve if the team working on it didn’t have the authority to make big changes. The freedom to improve areas is a wonderful advantage. But with that advantage comes responsibility: we still have to prioritize features and get a general consensus among the immediate team.

I’ve worked in some places where the unspoken rule was “It ain’t broken enough, so don’t fix it” and that is not the case at Harvest. Most of us have our own personal short list of areas we want to make better and have the flexibility to squeeze wins here and there within a normal work week.

The team submits and vets bigger feature changes and rewrites, not a Product Manager or a Project Manager, and those projects are prioritized alongside everything else.

A Direct Influence

The direct influence of everyone on our team on our customers means a lot to me. I like that my work positively affects contractors, designers, business owners, and many others. Our customer’s needs are very real and have serious impacts on their livelihood.

Some of our customers are independent with various distractions that pull at them. It’s very satisfying to know that my work can improve their day-to-day life. Our team has a responsibility to keep a stable infrastructure with as few bugs as possible because our screw-ups can cause a big disruption for our customers.

A Glimpse

A recent meeting inspired me to write this post. My team is working with another team on an integration that have some particularly interesting (and fun!) performance implications.

We met a few times to try and hammer out how one piece of code should talk to another. These meetings have included Design and Ops to help weigh in with different perspectives. Our last meeting between the teams had five developers focusing on a specific technical issue.

  • We talked about a few straightforward solutions and we agreed on one along with some metrics.
  • As we learn more, we agreed we might need to bring design back into the discussion in order to figure out some user interface compromises because of performance.
  • We agreed we need to take a serious look at the root cause of the performance issue and prioritize that when we can.
  • We ended the meeting with “Let’s all expect to rewrite parts of this we roll it out 100%. We can consider ourselves lucky if it makes it to production as-is.” That got a few chuckles but everyone nodded in agreement.

This is a great glimpse of what it is like to work here. We know we may have to change as we go. The team is not afraid of being wrong and has the courage to move forward with the information we have. Multiple areas can weigh in with input. The teams are not shutting down the conversation by pointing fingers or huffing “that’s not my problem.”

A Coworker’s Comment

I asked my coworkers to review a draft before I published this. Andrew said parts of this post resonated with him. I liked his comment so much that I asked his permission to end with it.

I love watching This Old House — it’s one of my weekend guilty pleasures. I keep coming back to the show because of the amount of work they put into a project house — it’s incredible. The current project has a closed-in back yard, so when the project called for replacing three stone walls, they had to disassemble the wall and carry it out one wheelbarrow at a time, through the house, to a dumpster on the street.

They care for the house. But it’s not just care — it’s a kind of love. My mother used to tell me, “no matter what you do, you’ll be our son, and we love you for that.” I like to think of the craftspeople on This Old House that way — “No matter what kind of mold and crumbling masonry we find, you’ll always be our project house, and we’ll make it right.”

How We Fix Bugs at Harvest

As Harvest's team has grown, we've had to evolve our bug-fixing process. Years ago we developed a concept called Delta Force as a way to protect the bulk of the development team from a constant need to respond bugs. On a rotating basis, one person would handle escalated support and fix bugs as time allowed. This used to keep us collectively sane.

In 2014 we discovered this no longer worked well for us. When a developer would take their turn on Delta Force, they would feel the weight of numerous unfixed bugs all day and well into the night. They would wake up with a hopeless feeling. When you fix three bugs in a week while four new bug reports come in, you're bound to feel disappointed in yourself. The additional strain of a growing customer base, and by extension more need for escalated customer support, made for an untenable situation.

The development team got together and brainstormed ways to make the process of fixing bugs more bearable. Not only did we want to improve the life of whomever was in that Delta Force role, but we also wanted to lower our bug count from week to week.

We found that most of the stress of the Delta Force role was coming from bugfixing, not from escalated support. We decided to separate those roles – to put bugs in the hands of the entire development team. Rather than defining goals like we had in the past (Bug count less than ten! Fix five bugs per week!), we simply asked people to claim a bug at the beginning of the week and do their best to fix it. Some bugs are big and take longer. Some bugs are quick and feel a little cheap. It's all good!

Through our brainstorming process we also learned that there was wide support across the team for a bugmash week — a week where the whole team would pause their full-time projects and focus on bugfixing. We kept that in our back pocket during the summer. As Harvest continued to grow and bugs continued to accumulate, we decided to give a bugmash week a try.

For a week in late September we each claimed a small pile of bugs and set to work. We touched everything from customer-reported bugs to monster queries to support tools to staging environments to stale, abandoned code. We spent 326 hours closing over fifty-five bugs in the Harvest suite of applications. We couldn't be happier with the results.

Keep this lesson in mind as your organization grows. Processes that you trusted and felt confident about for years can become obsolete. The trick is to change how you do things without changing the good intentions behind the original structure. And don't forget that you‘ll probably have to change again in the future.

Of course, it never hurts to have an amazing team that can work together with skill, humility, and supportiveness. :)

Code Reviews at Harvest

Let’s face it — code reviews can be tough. Even if your team fully adheres to a certain style guide, programming is so subjective that smart people can argue great points on conflicting approaches.

We use GitHub Pull Requests heavily at Harvest and we require code reviews for everything meaningful that goes into production. Here’re a few guidelines that we’ve developed for reviewing each other’s changes so we can stay productive and focused on what’s important — launching code.

Our current process

Every company has a different deployment process, but here at Harvest it looks a little something like this:

  1. We discuss what needs to be built, changed, or fixed.
  2. A developer/designer creates a branch off of master.
  3. They work on their branch. When it’s complete they push it up to GitHub and create a Pull Request tagging an appropriate person or team to review it.
  4. One or more tagged people from that team will review the code and give a +1 when they feel it’s ready to be deployed.
  5. The original submitter will then merge it into master and deploy it.

Multiple posts could be written about each of these steps. However I’m only going to talk about #3 and #4.

We use code reviews at Harvest to help communicate with the team what’s going into production, to help each other learn new tricks and techniques as things evolve, and to point out specific areas to investigate that we may unintentionally break.

If it fixes a serious bug, just let it go

We might disagree on approach, but if there’s a serious issue in Harvest that’s affecting a customer, and someone on the team has a fix for it, we will always let it ship. We can always circle back and fix it better or more thoroughly later.

Know what’s blocking your code

Code reviews can feel unproductive because they don’t have a clear goal. We’ve put together a survey we call FIAS (the Filler Impact Assessment Survey), a tongue-in-cheek acronym named after Patrick Filler, the Harvester who proposed the idea.

The idea is simple: take an educated guess at ranking three questions on a scale of one to ten:

  • How much of the app is affected?
  • How much of this change is mysterious to you?
  • How easy is it for you to imagine this performs in an unexpected way after deployment?

Then add up the scores for each question.

  • If the score is less than fifteen, you only need one person to give you a +1.
  • If the score is fifteen to twenty, you need two people to give you a +1.
  • If the score is over twenty, you need two people and full QA.

QA is different for each team. Harvest has staging environments that mirror production and we have scripts for each section that we manually test.

This isn’t perfect and the results may not make sense for other teams. For example, another team may want three people to +1 or always QA above ten points. However, the point is that, when a Pull Request begins review, we have a clear idea what it will take to launch it to production.

We also don’t always follow the FIAS. For example, even if we only need one +1, the submitter may still override the FIAS and ask for two to three +1s because they know their changes involve a particularly tough part of the codebase.

Now that we have a set goal, we can work backwards. If some of the changes makes a reviewer uncomfortable, can that part be segmented out and the rest launched instead? Can this branch be merged even if there’s a heated discussion over using the Single Responsibility Principle vs a single clear object?

With a simple process, we’ve removed the ambiguity that most code reviews start with.

Communicate clearly to reviewers what’s going on

FIAS is a great tool to get a general sense of how large (and risky) a change is to you — but it isn’t always the best tool for communicating to your reviewers what the change actually entails. Sometimes, GIFs or annotated before and after images (made with tools like Monosnap) can be more effective. Pre-commenting certain areas in your own Pull Request is also helpful: “I know this line seems unrelated, but it is because…”

Clearly communicating the change starts the code review on the right foot. The submitter may have invested days working on a change, but the people reviewing it have not.

Clarify blocking comments

Not every comment or question has to be resolved. Text doesn’t always convey emotion, and it’s easy to misread someone’s intent. It’s perfectly fine to ask a reviewer, “Is this a blocking issue?” Often it isn’t, or it can be handled separately.

Pull Requests can also end up being a lightning rod for debate. Discussions among the team can continue on a Pull Request, but they can also be moved elsewhere — to a separate issue, internal tool, or even a meeting.

Our reviewers will normally prefix comments with “[NB]” for non-blocking comments: “[NB] This works, but here’s a quick snippet that’s a little more clear”. A simple prefix like that can help speed along code reviews.

Face-to-face meetings

We often raise a white flag and ask for an impromptu face-to-face meeting or a quick chat, usually by spinning up a Google Hangout or conversation in HipChat. This seems like an obvious tip. However recognizing the need for these meetings can be tricky. If two people have posted back and forth at least once on the same topic, it will likely be easier to just hash it out face-to-face. If you find yourself writing two paragraphs, some of your concerns will likely get lost. You may be able to convey your thoughts better over audio.

It’s nice to see a comment history of how decisions were made. However, you can still accomplish this by posting a quick recap of what was decided in the meeting.

Don’t limit the number of reviewers

Everyone on the team pitches in with code reviews and we don’t have official reviewers here at Harvest. We may purposefully ping someone who we know has had a lot of experience in a particular area, but we never wait on one person to go through all the reviews.

This becomes a bigger deal as the team grows. With an application as large as Harvest, it’s extremely difficult to keep everything in your head. And even if you do feel good about a certain area, it will likely change over time as other people help out with maintenance.

We also notify people to give them a heads-up without expecting them to review the pull request. We do this by prefixing “/cc” before their name “/cc @zmoazeni this might affect your work on reports”.

Not a perfect process

We don’t have any delusions that our process is perfect. However all of these points help speed along our code reviews. Taking some of these tips and morphing it to fit your organization may help out your team too — or, if you think your process could help out our team, write up your process and send us a link! We’re always looking to improve.

Understanding D3 Selection Operations

As you might have heard, we’ve been working on a few new things around here, including new ways of visualizing data. One of the new libraries we’ve pulled in to help out is D3.

If you’re not already familiar with D3, it’s library designed to help transform DOM elements in response to data sets. Although D3 is frequently used with SVG elements (as we are using it at Harvest) it’s not specifically required to use SVG.

Sometimes when working with D3 especially in the context of animating graph elements, we’ve discovered that the DOM elements don’t quite do what’s expected. I had an epiphany moment the other day when trying to understand how D3 selects DOM elements and compares them with the data set — hopefully this post can help you out if you’re stuck in the same rut that I was!

The Epiphany

There are two critically important things I missed when I thought I knew how D3 selections work:

  1. D3 stores the data object which is responsible for creating a DOM node on the DOM node itself.
  2. You can control the method by which D3 compares your data objects with those stored on DOM nodes to determine if they are the same.

Each time your graph is drawn, you can think of D3 as grouping the required actions into three segments on a Venn diagram:

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.
  • Remember, D3 stores the data object that created a DOM element on the element itself. If that data object is no longer present in the data set, it’s considered an “exit”.
  • If a data object exists both in the data set and as a property of a DOM node in the selection, it’s considered an “update”.
  • If the data object has no DOM element in the selection, then it’s considered an “enter”.

Any operation called within the context of the selection.enter() or selection.exit() functions will be executed only during those phases:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", yValue)
    .attr("y2", yValue);

lines
  .exit()
    .remove();

Run example on jsfiddle.

Anything not within selection.enter() or selection.exit() is considered part of the “update” operation. The “update” operation is also called immediately after “enter”, and updates can easily be animated:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", 0)
    .attr("y2", 0); // start from zero (entrance animation)

lines
  .transition()
    .attr("y1", yValue)
    .attr("y2", yValue); // called immediately after enter(), and when value changes

lines
  .exit()
    .remove();

Run example on jsfiddle.

Controlling How Objects Are Compared

In order to successfully animate chart elements when data changes, it’s important to keep the relationship between the proper DOM nodes and items in your data set. By default, D3 uses the index of the data in the data array to compare the data set with DOM elements. This isn’t always ideal — but thankfully, you can pass a second (optional) parameter when assigning data to a selection (when calling the .data() function):

var dataset = [{ id: 2553, value: 35 }, { id: 2554, value: 13 }];

var lines = d3.select("body").append("svg").selectAll("line")
  .data(dataset, function (d) { return d.id; });

One Last Thing…

The first time the chart is rendered, you shouldn’t feel strange about creating a selection which has no elements:

var lines = svg.selectAll("line"); // but no <line> elements exist yet!

Just think about it in terms of our diagram, and remember that if the DOM element doesn’t exist yet, you’ll get a chance to create it during selection.enter().

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.

Happy charting!

If you’re looking for further reading on selections, you should consider reading How Selections Work, Object Constancy, and Thinking with Joins (where the Venn Diagram for this article (adapted) was taken from).

How we migrated our assets to S3 without any downtime

Harvest used to have a local solution for storing user-generated files. We're talking avatars, estimate attachments or expense receipts. This solution was based on GlusterFS and required that we maintain our own set of file servers. There was also a fair bit of work involved to keep everything stable.

For example, keeping GlusterFS assets in sync between multiple locations is problematic. There is either a batch oriented sync process we have to keep on top of, or we have to use GlusterFS native geo-replication.

This in itself is not a huge problem, and the amount of data we have isn't that big, but it's data that keeps growing every week.

For example, this is a totally outdated rundown of the numbers:

The numbers

S3 means having a single, decoupled and reliable store with endless capacity. We really like that.

Now, migrating this data without downtime is not that simple. Besides, we like to release things slow starting small to catch up any problems before they hit all of our users.

We used mostly Paperclip to handle our assets, so we decided to build an extension that would allow us to migrate seamlessly. This extension is called paperclip-multiple and provided a new storage called :multiple that stored files both in the filesystem and in S3 (using the fantastic fog gem).

Thanks to that we could migrate our assets one by one without any downtime. The process worked like this:

  1. Change one type of asset (say, user avatars) to use this multiple storage and deploy it. From this moment on new avatars will be both on GlusterFS and Amazon S3.
  2. Synchronize the local filesystem with S3. For this we used the s3cmd sync command, which worked fantastic and was really fast if you had to do multiple syncs.
  3. Now all files should be on S3, meaning we can change the storage to just :fog and be done with it!

paperclip-multiple allowed even more flexibility. With feature flags we could specify for which users we wanted to store files on S3 or just keep working with the filesystem as before. We could also feature flag which user avatars we wanted to display from S3. This let us try file uploading for a while on our internal accounts, and later on try displaying them from S3.

Now, this process wasn't as simple as that. We discovered some bugs in the process, some of the code didn't use Paperclip properly and needed a lot of modifications, and we also built a proxy to serve all this data even faster and cheaper (but that will be a story for another time).

Even more interesting were the exports, which didn't even use Paperclip and, quite literally, were all over the place. We've rebuilt all of them, they are stored on S3, automatically expired after 15 days, and the new code is a lot easier to work with and, most importantly, extend. This actually meant that the biggest growth per week that we used to have is not really growth anymore because the files are deleted after certain time.

You can find the code on github, with more in-depth explanations about how to use it and how it works.

Finding the Courage to Constantly Critique

Constantly be thinking, critiquing your work in real time. ~ The Pragmatic Programmer

A big part of my job is to turn a critical eye on the work of others: code review, job applications, and so on. My harshest criticism falls on the code I wrote years ago. I've been at Harvest longer than any other developer on the team. Few at Harvest get to live with their terrible five-year-old code like I get to live with it. Daily it is in my face. In the course of our days we all run into our prior code failures. The idea of turning a critical eye on myself even more frequently than I'm already forced to is daunting, but embracing this fact is probably for the best.

Self reflection helps you become a better programmer. Most programmers go through the process once per year. Hello annual review! I believe a habit of frequent reflection is a habit with compounding returns. Understanding your strengths and weaknesses will allow you to learn, adapt and grow at a rapid pace.

A weekly reflection on your work feels like a nice compromise between annual reflections and a daily journal. I did this for a couple of years during my Getting Things Done® phase. I think weekly reflections can be pretty effective, but they have two failings. First, I don't think a weekly reflection saves any time over a daily journal. A proper weekly reflection should take one or two hours. Second, I don't think a weekly reflection is habit forming. If you miss a single weekly reflection you've pretty much broken the habit. Trust me, the weight of doing a three-hour reflection about the past two weeks is too much to bear.

I've never made a fair attempt at a rigorous daily developer journal. If I try a true developer journal, I will go into it with a checklist of questions:

  • What did you do today? (Be detailed.)
  • What decisions did you make?
  • How could these decisions backfire?
  • Did you have any conflicts today?
  • Did you enjoy the work you did?
  • What can you do better tomorrow?
  • What are you grateful for?

I often struggle with my memory. For whatever reason I find others more capable of retrieving experiences from months ago. Perhaps my brain is just built differently than most. Perhaps a daily journal is just what I need to better internalize the lessons learned from my daily successes and failures. Perhaps a daily journal will help me better collect and communicate my thoughts.

Perhaps I just need the courage to find out.

Segmentation Faults

For some time Harvest had been struggling with unsolved segmentation faults. I'd like to tell everyone the story of this journey. But first...

What is a segmentation fault?

Segmentation faults, or segfaults as we like to call them, are the errors that happen when a process tries to access memory it cannot physically address. Wait, what? Addressing memory? I thought I was working with Ruby and didn't have to take care of that!?

It turns out the Ruby interpreter itself has bugs (surprise!). Gems that have C extensions must worry about segfaults as well.

Segfault

What was happening?

A segfault is different from other kinds of errors, because they're so terrible that they kill the process. I mean, if things get so messy in the inside that you're accessing memory you shouldn't, the best you can do is stop doing what you're doing so you don’t make it worse.

When the process dies it is usually not sent to whatever exception tracking software you use. In our case, it's Passenger that notices the segfaults and dumps the error that Ruby returns into the passenger logs.

We were aware of the segfaults, but not being on our exception tracker made them very easy to ignore. That is until we started using New Relic and lots of exceptions started happening.

Exceptions?

But you said segfaults didn't cause exceptions, they just killed the process? You're right my friend, but let me get to the point.

Segmentation Fault

What we saw on our exception tracker was a lot of exceptions happening within a three-minute period, all crashing on the same line of code in the newrelic_rpm gem. The first suspicion, of course, was a bug in that gem. We opened a support ticket with New Relic immediately and they started investigating. On our side, we discovered that the errors happened on the same server and, with further debugging, on the same process.

The way New Relic works is it first spawns a parallel thread together with the Rails process. This process sends instrumentation data to New Relic. Since New Relic is a separate process, it cannot affect the performance of the primary Rails process.

What we saw on the New Relic logs is that an agent would spawn, only to cause an exception a few minutes later and keep causing exceptions until yet another few minutes later it would stop. We have configured passenger to have every Rails process respond only to a limited amount of requests, which explained this limited lifetime pattern we saw.

The exceptions looked different on almost all these incidents, but they would repeat until the agent stopped. Here are some examples:

ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `<' for nil:NilClass
ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `count_segments' for false:FalseClass
ERROR : Caught exception in trace_method_execution footer.
ERROR : NoMethodError: undefined method `tag' for #<Array:0x007f70a3c7bbe8>

On their side, Ben from New Relic pointed to a very funny exception that did get tracked: NotImplementedError: method 'method_missing' called on unexpected T_NONE object (0x007ffa15822ca8 flags=0x400 klass=0x0). According to his analysis, this is a theoretically impossible combination of flags in Ruby Land.

Stuff that shouldn't be false

That exception got me thinking, wait, something is corrupted without causing a segfault?

I've been checking our exceptions daily for a few months, and let me tell you, the amount of activity in there was suspicious. Here are some example exceptions:

  • TypeError: false is not a symbol
  • NoMethodError: undefined method 'keys' for nil:NilClass (This error happened on controller#action called false#false)
  • TypeError: can't convert false into String
  • NoMethodError: undefined method 'zero?' for false:FalseClass

I think every Ruby developer is used to the NoMethodError's on nil. I mean, it's the Ruby version of an Uncaught Exception in Java. You see it almost every day in your work life. But NoMethodError's on false? It's really not that easy to get a false by mistake.

Corruption

At the end, the data pointed to things getting corrupted. Random variables seemed to become values that weren't right. All of it sprinkled with a healthy amount of segfaults.

My gut feeling was that it all had to be the same, but what can be done about it?

The (only) good thing of segfaults is that they leave a trace. So instead of focusing on seemingly random errors, we could at least try to get some useful information from the thousands of segfault logs.

How do you debug a segfault?

Short answer: it's very difficult and almost impossible. As with any bug, the first thing you should try to do is reproduce the error. With these segfaults we never got to it, since they where so random. Besides, of all the information provided by a segfault, Ruby devs only understand the normal stack trace and that's about it.

So unless you're a C expert, you're pretty much screwed. However, we can use the common wisdom of the masses. If you ever need to fix a segmentation fault, don't waste your time debugging. These are almost always the solutions:

  • It's probably a bug in one of the gems that uses C extensions. Hope that somebody else had the problem first and they solved it. Therefore, update or remove all your gems with C extensions.
  • If not, then it's probably a bug in the Ruby interpreter. Update Ruby to the latest patchlevel.
  • If you're already at the latest patchlevel, then try the next version.

Updating gems

Thanks to the part of the segfaults that list the files and C extensions that were loaded, we compiled a list of possibly offending gems. Then we updated or removed all of these gems, and the segfaults stopped. Our update process pointed to a very outdated version of oj we were using. Oj is a fantastic gem that fixes bugs diligently, but we had not kept it up to date.

And now our exception tracker has been free of seemingly random exceptions!

Lessons learned

We learned many things, among them:

  • Don't try to understand a segfault. It's almost never worth it.
  • It's really important to keep gems up to date.
  • Things that don't blow up loudly are very easy to ignore.

I'd like to thank New Relic for their support. Their contribution helped a lot, even when in the end it turned out to be a problem on our system.

In retrospect the solution seems so obvious that it hurts. We're very sorry to anyone who got a totally random error using Harvest. We take errors very seriously and we're very glad that we can again focus on the errors in our application code.