Tech Time

Living in harmony with grizzly bears, writing code, and reflections. By the makers of Harvest. We’re hiring

Deploy Safely

We’re not very fond of taking unnecessary risks at Harvest — and the easiest way to avoid big risks is to make small, incremental changes. Like a cook tasting a dish, it allows us to make small adjustments as needed to make Harvest the best it can be. Mmmm, salty passwords.

You’re hopefully unaware, but we update Harvest a lot (for example, 13 times yesterday) — and most of the time it’s easy and natural to make small, simple changes. But how do you make small, incremental changes with upgrades or new features?

Minimize Changes

“Is this the minimum set of changes that we need to go live?”

Sometimes when you find yourself in a section of code that hasn’t been touched in a while it’s tempting to make a whole bunch of changes by accident. “Oh, this is using the old hash syntax, I’ll just make a quick update to take out the hash rockets” can quickly become “Wait, why is this even a class anyway?” — and before you know it, you’ve made a set of changes which has nothing to do with what you set out to do.

“Is this the minimum set of changes that we need to go live?”

Thanks to our amazing developer operations team, there’s virtually no cost to make updates to Harvest — so why lump your changes into something that’s impossible to predict how it will react when released?

We try to make our pull requests as boring as possible. They should be simple, plain, and easy to digest. No need to include those hash update changes — you can just as easily follow up afterwards with another pull request.

“Is this the minimum set of changes that we need to go live?”

It Was The Best Of Times, It Was The Worst Of Times

No matter the change, always have a plan for after the pull request is released — a plan for success and a plan for failure.

“We can just cap production deploy:rollback, right?” Well, not always. Some updates can have rippling consequences — sessions expiring and logging everyone out, secure cookies becoming incompatible and unreadable, serialized data suddenly becoming unserializable — and not all of these consequences are immediately apparent.

Deploys should be boring. Rehearsing a revert plan can mean the difference between high-stress downtime and “Whoops, let’s try again tomorrow.” Rehearse what it means to fail, at what point you’ll decide to rollback a release, and what needs to be done (and who will be doing it). We all make mistakes — and we should expect them to happen often, and be prepared for them.

Just as important as a rollback plan is a checkup plan — the plan you follow when an update is successful. You can consider it the definition of success — all the places you’ll need to check once the update is made to make sure things are working the way you expect.

Releasing Large Features

Large features (like our new Projects section) can easily balloon into huge, risky releases — but we’ve adopted using feature flags to break down big features into small, incremental changes.

If you’ve been using Harvest since November of 2013, you’ve been using a copy of Harvest which supports the new Projects section — it’s just been hidden from you (or not, if your account has been in our early access group (thank you!)). Each change we’ve made has been released into our production code, one small change at a time.

Feature flagged releases can be some of the safest releases possible — since we can release a feature to a small subset of our customers and make sure it’s behaving the way we expect (by following the checkup plan).

Real Talk

At Harvest, we like releases to be boring, simple and straightforward, prepared and predictable. Small changes don’t keep you late at work or make you work on the weekends or on holidays. We’ve found that keeping our releases small means constantly asking ourselves and each other “Is this the minimum set of changes that we need to go live?” It means preparing and planning for failure and success. It means using tools like feature flags to break down big, risky releases.

It means we can eat dinner with our families.

Why I enjoy writing software at Harvest

I occasionally help out the team by interviewing new candidates. The most common question I have been asked — “What do you enjoy most about working at Harvest?” — is a great question for an interviewee to ask. There are several stock answers that I think most people reach for. You can talk about how much you like the team, the culture, the flexibility, or the customers you help. I appreciate all of these at Harvest, but there are some specific anecdotes that I think really help paint a picture of what it is like to work at Harvest.

The Codebase

I enjoy working with our codebase. That sounds shallow since the codebase is just a means to an end: providing a great experience for our customers. However working daily with Harvest’s codebase is the full-time job for myself and the team I work with. If it were painful, that’d be a negative daily experience to weigh against the other benefits. I also firmly believe that our team’s ability to help solve our customer’s needs is directly influenced by the ease at which we can build features and fix problems.

You should know that Harvest launched in 2005 before Rails had even hit version 1.0 and we have our share of legacy code that has persisted over the years. I joined Harvest over 2 years ago and my experience pretty much amounted to enterprises, startups, and consulting. For the longest time my favorite projects were greenfield, where the teams could work on the problem at hand with very little regard to legacy code. So why do I enjoy working on a large old app like Harvest? Because it is amazing and inspiring to see it transform and improve over the years.

It is one thing to launch a project for a client, fix a few bugs, and then move on or throw a ball of hacks together in order to get profitable as soon as possible. It is another world altogether to deploy a feature set, and then continue to improve and polish it as time goes on. We do have our spurts where we will rewrite old code from scratch, and we might even grumble at the previous developer’s short-sighted design choices while we do it (which is funnier when you realize you were the previous developer). But our team recognizes that old code helped get Harvest to where it is today.

There is a respect, a beauty, and an admiration that comes working long term on a legacy codebase alongside this team.

The Team

The team here at Harvest is critical for our codebase to improve. Working on a legacy codebase with an understaffed crew is definitely unfun. Have you ever accidentally deployed a bug and had to answer a flood of support tickets while you’re trying to figure out a fix and deploy it? Have you ever critically botched the infrastructure and suddenly had to scramble to fix the servers?

I am grateful for the teams we have here at Harvest. I’m a big fan of our Harvest Experts who genuinely care about solving our customer’s problems and diagnose bug symptoms to pass on to developers. They honestly mean it when they tell our customers to write or call in when they need help.

I am relieved we have a brilliant and hardworking Operations team that is focused on our infrastructure 24x7 dealing with daily problems ranging from security reports and patches to capacity planning.

I am thankful for our Product Design team who thinks a lot about the customer’s interface experience. They also work closely with the Development team and adapt if technical limitations are uncovered.

We have several other roles too, like our Mobile Developers, Quality Assurance, Marketing, External Integrations, and Office Managers that all fit in a unique way to improving our company and product. Because of these roles, I can focus on what I do best.

Roles and Communication

Now you might get the impression that each of these roles are rigidly defined and add to office bureaucracy; Developers need to submit “Database Change Requests” to Ops and everyone is busy filling out TPS reports. Not at all.

Each one of these teams fluidly communicates with other teams in order to solve the problems at hand. Developers are asked to take over certain support tickets. Ops members are invited to a technical design discussion. Experts are asked about common customer problems while we redesign an existing feature. Developers let Experts know beforehand when we’re going to launch features so they can field support issues with current information.

There isn’t any central authority that dictates these processes. Each team puts together their process and if things go poorly, they talk about it and improve it. Sometimes that requires talking and changing a process between teams. We do have a few people who float between teams that can view issues in the aggregate and suggest improvements.

Empowerment

A legacy codebase would never improve if the team working on it didn’t have the authority to make big changes. The freedom to improve areas is a wonderful advantage. But with that advantage comes responsibility: we still have to prioritize features and get a general consensus among the immediate team.

I’ve worked in some places where the unspoken rule was “It ain’t broken enough, so don’t fix it” and that is not the case at Harvest. Most of us have our own personal short list of areas we want to make better and have the flexibility to squeeze wins here and there within a normal work week.

The team submits and vets bigger feature changes and rewrites, not a Product Manager or a Project Manager, and those projects are prioritized alongside everything else.

A Direct Influence

The direct influence of everyone on our team on our customers means a lot to me. I like that my work positively affects contractors, designers, business owners, and many others. Our customer’s needs are very real and have serious impacts on their livelihood.

Some of our customers are independent with various distractions that pull at them. It’s very satisfying to know that my work can improve their day-to-day life. Our team has a responsibility to keep a stable infrastructure with as few bugs as possible because our screw-ups can cause a big disruption for our customers.

A Glimpse

A recent meeting inspired me to write this post. My team is working with another team on an integration that have some particularly interesting (and fun!) performance implications.

We met a few times to try and hammer out how one piece of code should talk to another. These meetings have included Design and Ops to help weigh in with different perspectives. Our last meeting between the teams had five developers focusing on a specific technical issue.

  • We talked about a few straightforward solutions and we agreed on one along with some metrics.
  • As we learn more, we agreed we might need to bring design back into the discussion in order to figure out some user interface compromises because of performance.
  • We agreed we need to take a serious look at the root cause of the performance issue and prioritize that when we can.
  • We ended the meeting with “Let’s all expect to rewrite parts of this we roll it out 100%. We can consider ourselves lucky if it makes it to production as-is.” That got a few chuckles but everyone nodded in agreement.

This is a great glimpse of what it is like to work here. We know we may have to change as we go. The team is not afraid of being wrong and has the courage to move forward with the information we have. Multiple areas can weigh in with input. The teams are not shutting down the conversation by pointing fingers or huffing “that’s not my problem.”

A Coworker’s Comment

I asked my coworkers to review a draft before I published this. Andrew said parts of this post resonated with him. I liked his comment so much that I asked his permission to end with it.

I love watching This Old House — it’s one of my weekend guilty pleasures. I keep coming back to the show because of the amount of work they put into a project house — it’s incredible. The current project has a closed-in back yard, so when the project called for replacing three stone walls, they had to disassemble the wall and carry it out one wheelbarrow at a time, through the house, to a dumpster on the street.

They care for the house. But it’s not just care — it’s a kind of love. My mother used to tell me, “no matter what you do, you’ll be our son, and we love you for that.” I like to think of the craftspeople on This Old House that way — “No matter what kind of mold and crumbling masonry we find, you’ll always be our project house, and we’ll make it right.”

How We Fix Bugs at Harvest

As Harvest's team has grown, we've had to evolve our bug-fixing process. Years ago we developed a concept called Delta Force as a way to protect the bulk of the development team from a constant need to respond bugs. On a rotating basis, one person would handle escalated support and fix bugs as time allowed. This used to keep us collectively sane.

In 2014 we discovered this no longer worked well for us. When a developer would take their turn on Delta Force, they would feel the weight of numerous unfixed bugs all day and well into the night. They would wake up with a hopeless feeling. When you fix three bugs in a week while four new bug reports come in, you're bound to feel disappointed in yourself. The additional strain of a growing customer base, and by extension more need for escalated customer support, made for an untenable situation.

The development team got together and brainstormed ways to make the process of fixing bugs more bearable. Not only did we want to improve the life of whomever was in that Delta Force role, but we also wanted to lower our bug count from week to week.

We found that most of the stress of the Delta Force role was coming from bugfixing, not from escalated support. We decided to separate those roles – to put bugs in the hands of the entire development team. Rather than defining goals like we had in the past (Bug count less than ten! Fix five bugs per week!), we simply asked people to claim a bug at the beginning of the week and do their best to fix it. Some bugs are big and take longer. Some bugs are quick and feel a little cheap. It's all good!

Through our brainstorming process we also learned that there was wide support across the team for a bugmash week — a week where the whole team would pause their full-time projects and focus on bugfixing. We kept that in our back pocket during the summer. As Harvest continued to grow and bugs continued to accumulate, we decided to give a bugmash week a try.

For a week in late September we each claimed a small pile of bugs and set to work. We touched everything from customer-reported bugs to monster queries to support tools to staging environments to stale, abandoned code. We spent 326 hours closing over fifty-five bugs in the Harvest suite of applications. We couldn't be happier with the results.

Keep this lesson in mind as your organization grows. Processes that you trusted and felt confident about for years can become obsolete. The trick is to change how you do things without changing the good intentions behind the original structure. And don't forget that you‘ll probably have to change again in the future.

Of course, it never hurts to have an amazing team that can work together with skill, humility, and supportiveness. :)

Code Reviews at Harvest

Let’s face it — code reviews can be tough. Even if your team fully adheres to a certain style guide, programming is so subjective that smart people can argue great points on conflicting approaches.

We use GitHub Pull Requests heavily at Harvest and we require code reviews for everything meaningful that goes into production. Here’re a few guidelines that we’ve developed for reviewing each other’s changes so we can stay productive and focused on what’s important — launching code.

Our current process

Every company has a different deployment process, but here at Harvest it looks a little something like this:

  1. We discuss what needs to be built, changed, or fixed.
  2. A developer/designer creates a branch off of master.
  3. They work on their branch. When it’s complete they push it up to GitHub and create a Pull Request tagging an appropriate person or team to review it.
  4. One or more tagged people from that team will review the code and give a +1 when they feel it’s ready to be deployed.
  5. The original submitter will then merge it into master and deploy it.

Multiple posts could be written about each of these steps. However I’m only going to talk about #3 and #4.

We use code reviews at Harvest to help communicate with the team what’s going into production, to help each other learn new tricks and techniques as things evolve, and to point out specific areas to investigate that we may unintentionally break.

If it fixes a serious bug, just let it go

We might disagree on approach, but if there’s a serious issue in Harvest that’s affecting a customer, and someone on the team has a fix for it, we will always let it ship. We can always circle back and fix it better or more thoroughly later.

Know what’s blocking your code

Code reviews can feel unproductive because they don’t have a clear goal. We’ve put together a survey we call FIAS (the Filler Impact Assessment Survey), a tongue-in-cheek acronym named after Patrick Filler, the Harvester who proposed the idea.

The idea is simple: take an educated guess at ranking three questions on a scale of one to ten:

  • How much of the app is affected?
  • How much of this change is mysterious to you?
  • How easy is it for you to imagine this performs in an unexpected way after deployment?

Then add up the scores for each question.

  • If the score is less than fifteen, you only need one person to give you a +1.
  • If the score is fifteen to twenty, you need two people to give you a +1.
  • If the score is over twenty, you need two people and full QA.

QA is different for each team. Harvest has staging environments that mirror production and we have scripts for each section that we manually test.

This isn’t perfect and the results may not make sense for other teams. For example, another team may want three people to +1 or always QA above ten points. However, the point is that, when a Pull Request begins review, we have a clear idea what it will take to launch it to production.

We also don’t always follow the FIAS. For example, even if we only need one +1, the submitter may still override the FIAS and ask for two to three +1s because they know their changes involve a particularly tough part of the codebase.

Now that we have a set goal, we can work backwards. If some of the changes makes a reviewer uncomfortable, can that part be segmented out and the rest launched instead? Can this branch be merged even if there’s a heated discussion over using the Single Responsibility Principle vs a single clear object?

With a simple process, we’ve removed the ambiguity that most code reviews start with.

Communicate clearly to reviewers what’s going on

FIAS is a great tool to get a general sense of how large (and risky) a change is to you — but it isn’t always the best tool for communicating to your reviewers what the change actually entails. Sometimes, GIFs or annotated before and after images (made with tools like Monosnap) can be more effective. Pre-commenting certain areas in your own Pull Request is also helpful: “I know this line seems unrelated, but it is because…”

Clearly communicating the change starts the code review on the right foot. The submitter may have invested days working on a change, but the people reviewing it have not.

Clarify blocking comments

Not every comment or question has to be resolved. Text doesn’t always convey emotion, and it’s easy to misread someone’s intent. It’s perfectly fine to ask a reviewer, “Is this a blocking issue?” Often it isn’t, or it can be handled separately.

Pull Requests can also end up being a lightning rod for debate. Discussions among the team can continue on a Pull Request, but they can also be moved elsewhere — to a separate issue, internal tool, or even a meeting.

Our reviewers will normally prefix comments with “[NB]” for non-blocking comments: “[NB] This works, but here’s a quick snippet that’s a little more clear”. A simple prefix like that can help speed along code reviews.

Face-to-face meetings

We often raise a white flag and ask for an impromptu face-to-face meeting or a quick chat, usually by spinning up a Google Hangout or conversation in HipChat. This seems like an obvious tip. However recognizing the need for these meetings can be tricky. If two people have posted back and forth at least once on the same topic, it will likely be easier to just hash it out face-to-face. If you find yourself writing two paragraphs, some of your concerns will likely get lost. You may be able to convey your thoughts better over audio.

It’s nice to see a comment history of how decisions were made. However, you can still accomplish this by posting a quick recap of what was decided in the meeting.

Don’t limit the number of reviewers

Everyone on the team pitches in with code reviews and we don’t have official reviewers here at Harvest. We may purposefully ping someone who we know has had a lot of experience in a particular area, but we never wait on one person to go through all the reviews.

This becomes a bigger deal as the team grows. With an application as large as Harvest, it’s extremely difficult to keep everything in your head. And even if you do feel good about a certain area, it will likely change over time as other people help out with maintenance.

We also notify people to give them a heads-up without expecting them to review the pull request. We do this by prefixing “/cc” before their name “/cc @zmoazeni this might affect your work on reports”.

Not a perfect process

We don’t have any delusions that our process is perfect. However all of these points help speed along our code reviews. Taking some of these tips and morphing it to fit your organization may help out your team too — or, if you think your process could help out our team, write up your process and send us a link! We’re always looking to improve.

Understanding D3 Selection Operations

As you might have heard, we’ve been working on a few new things around here, including new ways of visualizing data. One of the new libraries we’ve pulled in to help out is D3.

If you’re not already familiar with D3, it’s library designed to help transform DOM elements in response to data sets. Although D3 is frequently used with SVG elements (as we are using it at Harvest) it’s not specifically required to use SVG.

Sometimes when working with D3 especially in the context of animating graph elements, we’ve discovered that the DOM elements don’t quite do what’s expected. I had an epiphany moment the other day when trying to understand how D3 selects DOM elements and compares them with the data set — hopefully this post can help you out if you’re stuck in the same rut that I was!

The Epiphany

There are two critically important things I missed when I thought I knew how D3 selections work:

  1. D3 stores the data object which is responsible for creating a DOM node on the DOM node itself.
  2. You can control the method by which D3 compares your data objects with those stored on DOM nodes to determine if they are the same.

Each time your graph is drawn, you can think of D3 as grouping the required actions into three segments on a Venn diagram:

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.
  • Remember, D3 stores the data object that created a DOM element on the element itself. If that data object is no longer present in the data set, it’s considered an “exit”.
  • If a data object exists both in the data set and as a property of a DOM node in the selection, it’s considered an “update”.
  • If the data object has no DOM element in the selection, then it’s considered an “enter”.

Any operation called within the context of the selection.enter() or selection.exit() functions will be executed only during those phases:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", yValue)
    .attr("y2", yValue);

lines
  .exit()
    .remove();

Run example on jsfiddle.

Anything not within selection.enter() or selection.exit() is considered part of the “update” operation. The “update” operation is also called immediately after “enter”, and updates can easily be animated:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", 0)
    .attr("y2", 0); // start from zero (entrance animation)

lines
  .transition()
    .attr("y1", yValue)
    .attr("y2", yValue); // called immediately after enter(), and when value changes

lines
  .exit()
    .remove();

Run example on jsfiddle.

Controlling How Objects Are Compared

In order to successfully animate chart elements when data changes, it’s important to keep the relationship between the proper DOM nodes and items in your data set. By default, D3 uses the index of the data in the data array to compare the data set with DOM elements. This isn’t always ideal — but thankfully, you can pass a second (optional) parameter when assigning data to a selection (when calling the .data() function):

var dataset = [{ id: 2553, value: 35 }, { id: 2554, value: 13 }];

var lines = d3.select("body").append("svg").selectAll("line")
  .data(dataset, function (d) { return d.id; });

One Last Thing…

The first time the chart is rendered, you shouldn’t feel strange about creating a selection which has no elements:

var lines = svg.selectAll("line"); // but no <line> elements exist yet!

Just think about it in terms of our diagram, and remember that if the DOM element doesn’t exist yet, you’ll get a chance to create it during selection.enter().

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.

Happy charting!

If you’re looking for further reading on selections, you should consider reading How Selections Work, Object Constancy, and Thinking with Joins (where the Venn Diagram for this article (adapted) was taken from).

How we migrated our assets to S3 without any downtime

Harvest used to have a local solution for storing user-generated files. We're talking avatars, estimate attachments or expense receipts. This solution was based on GlusterFS and required that we maintain our own set of file servers. There was also a fair bit of work involved to keep everything stable.

For example, keeping GlusterFS assets in sync between multiple locations is problematic. There is either a batch oriented sync process we have to keep on top of, or we have to use GlusterFS native geo-replication.

This in itself is not a huge problem, and the amount of data we have isn't that big, but it's data that keeps growing every week.

For example, this is a totally outdated rundown of the numbers:

The numbers

S3 means having a single, decoupled and reliable store with endless capacity. We really like that.

Now, migrating this data without downtime is not that simple. Besides, we like to release things slow starting small to catch up any problems before they hit all of our users.

We used mostly Paperclip to handle our assets, so we decided to build an extension that would allow us to migrate seamlessly. This extension is called paperclip-multiple and provided a new storage called :multiple that stored files both in the filesystem and in S3 (using the fantastic fog gem).

Thanks to that we could migrate our assets one by one without any downtime. The process worked like this:

  1. Change one type of asset (say, user avatars) to use this multiple storage and deploy it. From this moment on new avatars will be both on GlusterFS and Amazon S3.
  2. Synchronize the local filesystem with S3. For this we used the s3cmd sync command, which worked fantastic and was really fast if you had to do multiple syncs.
  3. Now all files should be on S3, meaning we can change the storage to just :fog and be done with it!

paperclip-multiple allowed even more flexibility. With feature flags we could specify for which users we wanted to store files on S3 or just keep working with the filesystem as before. We could also feature flag which user avatars we wanted to display from S3. This let us try file uploading for a while on our internal accounts, and later on try displaying them from S3.

Now, this process wasn't as simple as that. We discovered some bugs in the process, some of the code didn't use Paperclip properly and needed a lot of modifications, and we also built a proxy to serve all this data even faster and cheaper (but that will be a story for another time).

Even more interesting were the exports, which didn't even use Paperclip and, quite literally, were all over the place. We've rebuilt all of them, they are stored on S3, automatically expired after 15 days, and the new code is a lot easier to work with and, most importantly, extend. This actually meant that the biggest growth per week that we used to have is not really growth anymore because the files are deleted after certain time.

You can find the code on github, with more in-depth explanations about how to use it and how it works.

Finding the Courage to Constantly Critique

Constantly be thinking, critiquing your work in real time. ~ The Pragmatic Programmer

A big part of my job is to turn a critical eye on the work of others: code review, job applications, and so on. My harshest criticism falls on the code I wrote years ago. I've been at Harvest longer than any other developer on the team. Few at Harvest get to live with their terrible five-year-old code like I get to live with it. Daily it is in my face. In the course of our days we all run into our prior code failures. The idea of turning a critical eye on myself even more frequently than I'm already forced to is daunting, but embracing this fact is probably for the best.

Self reflection helps you become a better programmer. Most programmers go through the process once per year. Hello annual review! I believe a habit of frequent reflection is a habit with compounding returns. Understanding your strengths and weaknesses will allow you to learn, adapt and grow at a rapid pace.

A weekly reflection on your work feels like a nice compromise between annual reflections and a daily journal. I did this for a couple of years during my Getting Things Done® phase. I think weekly reflections can be pretty effective, but they have two failings. First, I don't think a weekly reflection saves any time over a daily journal. A proper weekly reflection should take one or two hours. Second, I don't think a weekly reflection is habit forming. If you miss a single weekly reflection you've pretty much broken the habit. Trust me, the weight of doing a three-hour reflection about the past two weeks is too much to bear.

I've never made a fair attempt at a rigorous daily developer journal. If I try a true developer journal, I will go into it with a checklist of questions:

  • What did you do today? (Be detailed.)
  • What decisions did you make?
  • How could these decisions backfire?
  • Did you have any conflicts today?
  • Did you enjoy the work you did?
  • What can you do better tomorrow?
  • What are you grateful for?

I often struggle with my memory. For whatever reason I find others more capable of retrieving experiences from months ago. Perhaps my brain is just built differently than most. Perhaps a daily journal is just what I need to better internalize the lessons learned from my daily successes and failures. Perhaps a daily journal will help me better collect and communicate my thoughts.

Perhaps I just need the courage to find out.

Segmentation Faults

For some time Harvest had been struggling with unsolved segmentation faults. I'd like to tell everyone the story of this journey. But first...

What is a segmentation fault?

Segmentation faults, or segfaults as we like to call them, are the errors that happen when a process tries to access memory it cannot physically address. Wait, what? Addressing memory? I thought I was working with Ruby and didn't have to take care of that!?

It turns out the Ruby interpreter itself has bugs (surprise!). Gems that have C extensions must worry about segfaults as well.

Segfault

What was happening?

A segfault is different from other kinds of errors, because they're so terrible that they kill the process. I mean, if things get so messy in the inside that you're accessing memory you shouldn't, the best you can do is stop doing what you're doing so you don’t make it worse.

When the process dies it is usually not sent to whatever exception tracking software you use. In our case, it's Passenger that notices the segfaults and dumps the error that Ruby returns into the passenger logs.

We were aware of the segfaults, but not being on our exception tracker made them very easy to ignore. That is until we started using New Relic and lots of exceptions started happening.

Exceptions?

But you said segfaults didn't cause exceptions, they just killed the process? You're right my friend, but let me get to the point.

Segmentation Fault

What we saw on our exception tracker was a lot of exceptions happening within a three-minute period, all crashing on the same line of code in the newrelic_rpm gem. The first suspicion, of course, was a bug in that gem. We opened a support ticket with New Relic immediately and they started investigating. On our side, we discovered that the errors happened on the same server and, with further debugging, on the same process.

The way New Relic works is it first spawns a parallel thread together with the Rails process. This process sends instrumentation data to New Relic. Since New Relic is a separate process, it cannot affect the performance of the primary Rails process.

What we saw on the New Relic logs is that an agent would spawn, only to cause an exception a few minutes later and keep causing exceptions until yet another few minutes later it would stop. We have configured passenger to have every Rails process respond only to a limited amount of requests, which explained this limited lifetime pattern we saw.

The exceptions looked different on almost all these incidents, but they would repeat until the agent stopped. Here are some examples:

ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `<' for nil:NilClass
ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `count_segments' for false:FalseClass
ERROR : Caught exception in trace_method_execution footer.
ERROR : NoMethodError: undefined method `tag' for #<Array:0x007f70a3c7bbe8>

On their side, Ben from New Relic pointed to a very funny exception that did get tracked: NotImplementedError: method 'method_missing' called on unexpected T_NONE object (0x007ffa15822ca8 flags=0x400 klass=0x0). According to his analysis, this is a theoretically impossible combination of flags in Ruby Land.

Stuff that shouldn't be false

That exception got me thinking, wait, something is corrupted without causing a segfault?

I've been checking our exceptions daily for a few months, and let me tell you, the amount of activity in there was suspicious. Here are some example exceptions:

  • TypeError: false is not a symbol
  • NoMethodError: undefined method 'keys' for nil:NilClass (This error happened on controller#action called false#false)
  • TypeError: can't convert false into String
  • NoMethodError: undefined method 'zero?' for false:FalseClass

I think every Ruby developer is used to the NoMethodError's on nil. I mean, it's the Ruby version of an Uncaught Exception in Java. You see it almost every day in your work life. But NoMethodError's on false? It's really not that easy to get a false by mistake.

Corruption

At the end, the data pointed to things getting corrupted. Random variables seemed to become values that weren't right. All of it sprinkled with a healthy amount of segfaults.

My gut feeling was that it all had to be the same, but what can be done about it?

The (only) good thing of segfaults is that they leave a trace. So instead of focusing on seemingly random errors, we could at least try to get some useful information from the thousands of segfault logs.

How do you debug a segfault?

Short answer: it's very difficult and almost impossible. As with any bug, the first thing you should try to do is reproduce the error. With these segfaults we never got to it, since they where so random. Besides, of all the information provided by a segfault, Ruby devs only understand the normal stack trace and that's about it.

So unless you're a C expert, you're pretty much screwed. However, we can use the common wisdom of the masses. If you ever need to fix a segmentation fault, don't waste your time debugging. These are almost always the solutions:

  • It's probably a bug in one of the gems that uses C extensions. Hope that somebody else had the problem first and they solved it. Therefore, update or remove all your gems with C extensions.
  • If not, then it's probably a bug in the Ruby interpreter. Update Ruby to the latest patchlevel.
  • If you're already at the latest patchlevel, then try the next version.

Updating gems

Thanks to the part of the segfaults that list the files and C extensions that were loaded, we compiled a list of possibly offending gems. Then we updated or removed all of these gems, and the segfaults stopped. Our update process pointed to a very outdated version of oj we were using. Oj is a fantastic gem that fixes bugs diligently, but we had not kept it up to date.

And now our exception tracker has been free of seemingly random exceptions!

Lessons learned

We learned many things, among them:

  • Don't try to understand a segfault. It's almost never worth it.
  • It's really important to keep gems up to date.
  • Things that don't blow up loudly are very easy to ignore.

I'd like to thank New Relic for their support. Their contribution helped a lot, even when in the end it turned out to be a problem on our system.

In retrospect the solution seems so obvious that it hurts. We're very sorry to anyone who got a totally random error using Harvest. We take errors very seriously and we're very glad that we can again focus on the errors in our application code.

Deploying Multiple Heroku Apps From a Single Repo

We’ve been toying around with socket.io recently at Harvest to make our apps more responsive — and one of the big challenges that presented itself was how to manage and deploy a polyglot application on Heroku (as most of our application code is in Ruby, but socket.io is a node.js project).

Since the “web” process is the only process to receive HTTP traffic and only one web process per application is allowed, we needed to deploy the Ruby and node.js parts of the application separately.

Traditionally, there is a one-to-one mapping of repositories to apps on Heroku — but we wanted to keep both apps in the same repository (as they’re really the same thing, conceptually) and yet deploy them as two apps on Heroku. So, what’s standing in the way?

Heroku Buildpacks

Heroku uses this thing called buildpacks — the utilities and frameworks that are required to fetch your app and install dependencies. There are different buildpacks for ruby and node.js on Heroku.

Buildpacks are usually automatically chosen — but since we’re pushing the same code to both apps, the order in which buildpacks are selected matters. Turns out that ruby is preferred over node.js.

No worries, you can select a buildpack manually with an environmental variable on Heroku’s server — which is what we did to the node.js app, and it worked like a charm.

Procfiles and Foreman

The only other thing standing in our way was the way that Heroku starts web workers — using Foreman and a Procfile to define processes.

With ruby applications, the Procfile is optional — it will try to start a Rails application if it isn’t specified.

When deploying node.js to Heroku, you must specify a Procfile — there is no default command to start a server. Since we want to be able to push the same code to both Heroku remotes without any modifications, we got a little clever with the web process in the Procfile — it executes a shell script which examines the environment and runs the proper web process.

#!/bin/bash

if [ "$RAILS_DEPLOYMENT" == "true" ]; then
  bundle exec rails server -p $PORT
else
  node node/index.js
fi

Procfile:

web: bin/web

Hat tip to David Dollar (the creator of Foreman) for the idea.

Happy polyglotin’!

Harvest Performance in July: A Troubleshooting Tale

I hope you slept well during July. Unfortunately, the Ops team here at Harvest did not. Harvest had a series of very short outages during July, all occuring at approximately the same time (12:30 a.m. ET). We still achieved 99.95% uptime for the month, but our availablity report for July shows all of these events (too many for our liking). I want to apologize to customers who were affected by these events.

The summary is that an overloaded, and incorrectly designed logging system was impacting Harvest application requests when a nightly archive job ran. For others who have implemented rsyslog and the RELP protocol, I urge you to be careful about the potential for blocking applications which use the syslog service. Read on for the details.

The problem begins

At 12:31 a.m on July 2nd, we received our first notification that some requests were timing out. By the time I looked at the situation on the servers, we were many thousands of log entries past the event and things were back to normal. Odd. I assumed at the time that a brief network issue had degraded performance between two components of our stack, and that the issue was between the app servers and the database servers. This was pure conjecture, since no logs showed any evidence supporting this theory, or any other theory I was able to formulate.

A similar incident occurred on July 9th, then again on July 16th, both around the same time. Each time the event occurred, we would see a large number of Rails requests become very slow, and our available pool of Passenger processes for inbound requests would be depleted, causing some customer requests to receive errors.

On the night of July 16th, 2013, I bade farewell to sleep and proposed to caffeine as I turned my Harvest cap backwards.

Searching for evidence

For a few nights, I was logging in to our systems to watch for this event in action, looking for its genesis. This watched pot did not boil, and the issue never occurred during the 3 nights I was watching for it.

Developers and ops folks reading this post are probably rolling their eyes. Logs and performance metrics exist to find these patterns. It should be simple to find the issue, right?

Right. But also wrong.

At this point, we knew:

  • The event lasts for a few minutes (3–7 minutes).
  • The timeframe of the event was predictable to within 10 minutes.
  • Most dynamic requests would become slow, then suddenly recover.
  • The load balancer logs showed us how slow the requests were (some greater than 15 seconds, which is an aeon).
  • The Rails logs, on the other hand, indicated that requests were fast once processed.
  • This many slow requests was exhausting our application server resources and we couldn't queue further requests for processing. Customers saw errors at this point.

On July 23rd, I was able to watch the issue live for the first time. As request times increased, I was profiling the performance of network traffic between components and watching every metric I could for the smoking gun. I had a bunch of systems utilities logging their data so I could review the open network connections for Harvest's backend services. I was extracting performance numbers from streaming application logs and was hoping for something to be glaring at me in the details. Nothing. The issue only lasted for 3 minutes while I was watching it. We then recovered in the space of a second, and that was it.

Without a smoking gun, we had to invent theories. The first theory was that the issue was caused by an external service that becomes slow for a short interval.

The first external service we (incorrectly) looked at was Skylight, which Harvest uses to log performance data. We spoke with Carl, explained our issue, and asked for his thoughts. A patient and wise man, Carl analyzed all the Harvest data on their end and concluded that although Harvest was indeed slow during these intervals, the way that Skylight is implemented would not cause this issue. We decided to remove Skylight temporarily to be sure, and on July 25th, without Skylight being deployed, we saw the issue recur. Skylight was not the culprit. Thanks to Carl and the Skylight crew for entertaining our theories.

At this point, I asked our excellent data center providers, Server Central, to perform a network traffic analysis to determine if there was a pattern of network traffic anywhere at all during the outage windows, or if some network path was being saturated and degrading performance. The team at Server Central really stepped up and did an exhaustive check of all of our network traffic and determined that nothing untoward was happening during the time we were seeing the issue. (If you are looking for a place to host your servers, you should talk to these guys.)

So now we have Rails logs that say things are fast, load balancer logs that say things are slow, an external performance analysis service which agrees with the load balancers, and a predictable window for the issue.

We split our global pool of application servers into a few different function-based pools. The theory was that if the problem was exhibited in only one of the pools, that we would have narrowed down the possible scope of the issue somewhat. However, the last problem window had occured among all of the pools, so it wasn't isolated to one area of Harvest.

Narrow down the scope of the investigation

We had to decide at this point if we believed the issue was caused by inbound requests which were doing something unexpected, or if the issue was caused by something on our end which was performing poorly and then recovering. We started our investigation on the former.

Harvest handles a lot of traffic. We also have a lot of hardware focused on serving this traffic. This means two things: first, there is a lot of data to analyze if you need to analyze our traffic data in a new way; second, it means that if this issue was being caused by external requests that there would have to be a very large number of them to have this kind of effect.

We spent a couple of hours looking at our traffic logs, filtering them in various ways, trying to find a large number of requests isolated in the problem timeframe and looking like they could have wide ranging performance effects. We didn't find anything.

In order to concentrate our energy, we decided to abandon the idea that the problem was caused by inbound traffic and focus our efforts on searching for an anomaly in our systems.

Working through possible theories

I set out to collect and systematically debunk any theories we could come up with, no matter how absurd they sounded. I drew up a list of 10 theories to explore and debunk.

Was memcache becoming slow for some reason? No. Was MySQL becoming slow for some reason? No. Was nginx encountering a bug somewhere and logging the evidence? No. And so on, through five or six different theories. Next on the list: was the problem caused by a Passenger bug?

If you use Passenger, you are no doubt familiar with using a SIGABRT signal to force Passenger to log a backtrace to the Rails log. Testing this procedure, I was unable to locate the backtrace data in our Rails logs. I asked Hongli Lai over at Phusion to help me find this data. We even arranged for Hongli to be available as a consultant to help us troubleshoot the problem if required. Hongli was very gracious and I would like to thank him for his willingness to help.

Trying to force the Passenger backtrace into the Rails log ultimately led me to the cause of the problem, though it wasn't a Passenger bug. We use rsyslog to send our Rails logs from each application server to central log servers where various tools parse and analyze the application logs. The Passenger backtraces were not appearing in the Rails logs, and I was unsuccessfully searching every possible log location for them.

At the same time, we were doing yet another fine grained audit of every single scheduled task which gets run across the entire Harvest architecture, looking for a scheduled task near the outage window which could possibly cause the problem.

Finally, the smoking gun

I noticed a harmless looking cron job responsible for archiving old log files on one of our rsyslog servers. It ran every day a few minutes before the window. Looking at the system metrics for this log server, I noticed increased levels of IOwait while this archiving ran through the outage window.

log server CPU IOWait spike

When we implemented rsyslog we had chosen to use the RELP protocol, because we wanted maximum log reliability. We had assumed at the time that the impact of any rsyslog performance or availability issues would be contained by the rsyslog subsystem, not spread to the applications which were using the syslog facility. This turns out to have been naïve.

We found some reports of other folks having similar experiences with rsyslog/RELP. Performance issues could easily lead to rsyslog local queue exhaustion, which could lead to applications blocking while waiting on the syslog facility.

At last, a credible theory.

I readied a move away from TCP/RELP towards UDP for rsyslog traffic in Chef and deployed it. There was a syntax bug in our config which immediately proved our theory to be correct. While rsyslog messages were being queued locally because of the syntax bug, Passenger processes began to pile up as Rails requests blocked on calls to syslog. This was the exact behavior we would see during the mysterious outage windows.

The problem that we were struggling to diagnose because our application logging didn't confirm the existence of the problem turned out to be the application logging itself.

Fixing the design

We ended up moving back to using RELP in rsyslog, but instead of Rails using the syslog facility directly, Rails now writes logs to a local file on the filesystem and rsyslog uses this text file as an input. Rails logs are now shipped to the central logging servers in a way which is decoupled from Rails entirely. This way, any future rsyslog performance issues should be contained within rsyslog and not impact our application.

The mystery finally solved, caffeine and I took some time apart. (We have since gotten back together.)

I'm not proud of how long it took us to find the problem, and I am really not proud of the number of events on our uptime report during the time it took us to discover the cause. I am very proud of the team that helped troubleshoot this issue, though. Harvest has an amazing team who are good at working under pressure.

Like a great joke, or a visionary piece of technology, it seems obvious once you see it. Someone had to see it for the first time, though.

Thanks for being patient while we figured this one out.