Tech Time

Science, musings, humor, writing, and thoughts that are all kinds of random. By the makers of Harvest. We’re hiring

How We Fix Bugs at Harvest

As Harvest's team has grown, we've had to evolve our bug-fixing process. Years ago we developed a concept called Delta Force as a way to protect the bulk of the development team from a constant need to respond bugs. On a rotating basis, one person would handle escalated support and fix bugs as time allowed. This used to keep us collectively sane.

In 2014 we discovered this no longer worked well for us. When a developer would take their turn on Delta Force, they would feel the weight of numerous unfixed bugs all day and well into the night. They would wake up with a hopeless feeling. When you fix three bugs in a week while four new bug reports come in, you're bound to feel disappointed in yourself. The additional strain of a growing customer base, and by extension more need for escalated customer support, made for an untenable situation.

The development team got together and brainstormed ways to make the process of fixing bugs more bearable. Not only did we want to improve the life of whomever was in that Delta Force role, but we also wanted to lower our bug count from week to week.

We found that most of the stress of the Delta Force role was coming from bugfixing, not from escalated support. We decided to separate those roles – to put bugs in the hands of the entire development team. Rather than defining goals like we had in the past (Bug count less than ten! Fix five bugs per week!), we simply asked people to claim a bug at the beginning of the week and do their best to fix it. Some bugs are big and take longer. Some bugs are quick and feel a little cheap. It's all good!

Through our brainstorming process we also learned that there was wide support across the team for a bugmash week — a week where the whole team would pause their full-time projects and focus on bugfixing. We kept that in our back pocket during the summer. As Harvest continued to grow and bugs continued to accumulate, we decided to give a bugmash week a try.

For a week in late September we each claimed a small pile of bugs and set to work. We touched everything from customer-reported bugs to monster queries to support tools to staging environments to stale, abandoned code. We spent 326 hours closing over fifty-five bugs in the Harvest suite of applications. We couldn't be happier with the results.

Keep this lesson in mind as your organization grows. Processes that you trusted and felt confident about for years can become obsolete. The trick is to change how you do things without changing the good intentions behind the original structure. And don't forget that you‘ll probably have to change again in the future.

Of course, it never hurts to have an amazing team that can work together with skill, humility, and supportiveness. :)

Code Reviews at Harvest

Let’s face it — code reviews can be tough. Even if your team fully adheres to a certain style guide, programming is so subjective that smart people can argue great points on conflicting approaches.

We use GitHub Pull Requests heavily at Harvest and we require code reviews for everything meaningful that goes into production. Here’re a few guidelines that we’ve developed for reviewing each other’s changes so we can stay productive and focused on what’s important — launching code.

Our current process

Every company has a different deployment process, but here at Harvest it looks a little something like this:

  1. We discuss what needs to be built, changed, or fixed.
  2. A developer/designer creates a branch off of master.
  3. They work on their branch. When it’s complete they push it up to GitHub and create a Pull Request tagging an appropriate person or team to review it.
  4. One or more tagged people from that team will review the code and give a +1 when they feel it’s ready to be deployed.
  5. The original submitter will then merge it into master and deploy it.

Multiple posts could be written about each of these steps. However I’m only going to talk about #3 and #4.

We use code reviews at Harvest to help communicate with the team what’s going into production, to help each other learn new tricks and techniques as things evolve, and to point out specific areas to investigate that we may unintentionally break.

If it fixes a serious bug, just let it go

We might disagree on approach, but if there’s a serious issue in Harvest that’s affecting a customer, and someone on the team has a fix for it, we will always let it ship. We can always circle back and fix it better or more thoroughly later.

Know what’s blocking your code

Code reviews can feel unproductive because they don’t have a clear goal. We’ve put together a survey we call FIAS (the Filler Impact Assessment Survey), a tongue-in-cheek acronym named after Patrick Filler, the Harvester who proposed the idea.

The idea is simple: take an educated guess at ranking three questions on a scale of one to ten:

  • How much of the app is affected?
  • How much of this change is mysterious to you?
  • How easy is it for you to imagine this performs in an unexpected way after deployment?

Then add up the scores for each question.

  • If the score is less than fifteen, you only need one person to give you a +1.
  • If the score is fifteen to twenty, you need two people to give you a +1.
  • If the score is over twenty, you need two people and full QA.

QA is different for each team. Harvest has staging environments that mirror production and we have scripts for each section that we manually test.

This isn’t perfect and the results may not make sense for other teams. For example, another team may want three people to +1 or always QA above ten points. However, the point is that, when a Pull Request begins review, we have a clear idea what it will take to launch it to production.

We also don’t always follow the FIAS. For example, even if we only need one +1, the submitter may still override the FIAS and ask for two to three +1s because they know their changes involve a particularly tough part of the codebase.

Now that we have a set goal, we can work backwards. If some of the changes makes a reviewer uncomfortable, can that part be segmented out and the rest launched instead? Can this branch be merged even if there’s a heated discussion over using the Single Responsibility Principle vs a single clear object?

With a simple process, we’ve removed the ambiguity that most code reviews start with.

Communicate clearly to reviewers what’s going on

FIAS is a great tool to get a general sense of how large (and risky) a change is to you — but it isn’t always the best tool for communicating to your reviewers what the change actually entails. Sometimes, GIFs or annotated before and after images (made with tools like Monosnap) can be more effective. Pre-commenting certain areas in your own Pull Request is also helpful: “I know this line seems unrelated, but it is because…”

Clearly communicating the change starts the code review on the right foot. The submitter may have invested days working on a change, but the people reviewing it have not.

Clarify blocking comments

Not every comment or question has to be resolved. Text doesn’t always convey emotion, and it’s easy to misread someone’s intent. It’s perfectly fine to ask a reviewer, “Is this a blocking issue?” Often it isn’t, or it can be handled separately.

Pull Requests can also end up being a lightning rod for debate. Discussions among the team can continue on a Pull Request, but they can also be moved elsewhere — to a separate issue, internal tool, or even a meeting.

Our reviewers will normally prefix comments with “[NB]” for non-blocking comments: “[NB] This works, but here’s a quick snippet that’s a little more clear”. A simple prefix like that can help speed along code reviews.

Face-to-face meetings

We often raise a white flag and ask for an impromptu face-to-face meeting or a quick chat, usually by spinning up a Google Hangout or conversation in HipChat. This seems like an obvious tip. However recognizing the need for these meetings can be tricky. If two people have posted back and forth at least once on the same topic, it will likely be easier to just hash it out face-to-face. If you find yourself writing two paragraphs, some of your concerns will likely get lost. You may be able to convey your thoughts better over audio.

It’s nice to see a comment history of how decisions were made. However, you can still accomplish this by posting a quick recap of what was decided in the meeting.

Don’t limit the number of reviewers

Everyone on the team pitches in with code reviews and we don’t have official reviewers here at Harvest. We may purposefully ping someone who we know has had a lot of experience in a particular area, but we never wait on one person to go through all the reviews.

This becomes a bigger deal as the team grows. With an application as large as Harvest, it’s extremely difficult to keep everything in your head. And even if you do feel good about a certain area, it will likely change over time as other people help out with maintenance.

We also notify people to give them a heads-up without expecting them to review the pull request. We do this by prefixing “/cc” before their name “/cc @zmoazeni this might affect your work on reports”.

Not a perfect process

We don’t have any delusions that our process is perfect. However all of these points help speed along our code reviews. Taking some of these tips and morphing it to fit your organization may help out your team too — or, if you think your process could help out our team, write up your process and send us a link! We’re always looking to improve.

Understanding D3 Selection Operations

As you might have heard, we’ve been working on a few new things around here, including new ways of visualizing data. One of the new libraries we’ve pulled in to help out is D3.

If you’re not already familiar with D3, it’s library designed to help transform DOM elements in response to data sets. Although D3 is frequently used with SVG elements (as we are using it at Harvest) it’s not specifically required to use SVG.

Sometimes when working with D3 especially in the context of animating graph elements, we’ve discovered that the DOM elements don’t quite do what’s expected. I had an epiphany moment the other day when trying to understand how D3 selects DOM elements and compares them with the data set — hopefully this post can help you out if you’re stuck in the same rut that I was!

The Epiphany

There are two critically important things I missed when I thought I knew how D3 selections work:

  1. D3 stores the data object which is responsible for creating a DOM node on the DOM node itself.
  2. You can control the method by which D3 compares your data objects with those stored on DOM nodes to determine if they are the same.

Each time your graph is drawn, you can think of D3 as grouping the required actions into three segments on a Venn diagram:

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.
  • Remember, D3 stores the data object that created a DOM element on the element itself. If that data object is no longer present in the data set, it’s considered an “exit”.
  • If a data object exists both in the data set and as a property of a DOM node in the selection, it’s considered an “update”.
  • If the data object has no DOM element in the selection, then it’s considered an “enter”.

Any operation called within the context of the selection.enter() or selection.exit() functions will be executed only during those phases:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", yValue)
    .attr("y2", yValue);

lines
  .exit()
    .remove();

Run example on jsfiddle.

Anything not within selection.enter() or selection.exit() is considered part of the “update” operation. The “update” operation is also called immediately after “enter”, and updates can easily be animated:

var dataset = [{ value: 35 }, { value: 13 }];
var lines = d3.select("body").append("svg").selectAll("line").data(dataset);
var yValue = function (d) { return d.value; };

lines
  .enter()
    .append("line")
    .style("stroke", "black")
    .attr("x1", 0)
    .attr("x2", 100)
    .attr("y1", 0)
    .attr("y2", 0); // start from zero (entrance animation)

lines
  .transition()
    .attr("y1", yValue)
    .attr("y2", yValue); // called immediately after enter(), and when value changes

lines
  .exit()
    .remove();

Run example on jsfiddle.

Controlling How Objects Are Compared

In order to successfully animate chart elements when data changes, it’s important to keep the relationship between the proper DOM nodes and items in your data set. By default, D3 uses the index of the data in the data array to compare the data set with DOM elements. This isn’t always ideal — but thankfully, you can pass a second (optional) parameter when assigning data to a selection (when calling the .data() function):

var dataset = [{ id: 2553, value: 35 }, { id: 2554, value: 13 }];

var lines = d3.select("body").append("svg").selectAll("line")
  .data(dataset, function (d) { return d.id; });

One Last Thing…

The first time the chart is rendered, you shouldn’t feel strange about creating a selection which has no elements:

var lines = svg.selectAll("line"); // but no <line> elements exist yet!

Just think about it in terms of our diagram, and remember that if the DOM element doesn’t exist yet, you’ll get a chance to create it during selection.enter().

Venn diagram showing two intersecting sets, the first being DOM Elements and the second being Data Set. The intersection is labeled Update, DOM Elements alone is labeled Exit, and Data Set alone is labeled Enter.

Happy charting!

If you’re looking for further reading on selections, you should consider reading How Selections Work, Object Constancy, and Thinking with Joins (where the Venn Diagram for this article (adapted) was taken from).

How we migrated our assets to S3 without any downtime

Harvest used to have a local solution for storing user-generated files. We're talking avatars, estimate attachments or expense receipts. This solution was based on GlusterFS and required that we maintain our own set of file servers. There was also a fair bit of work involved to keep everything stable.

For example, keeping GlusterFS assets in sync between multiple locations is problematic. There is either a batch oriented sync process we have to keep on top of, or we have to use GlusterFS native geo-replication.

This in itself is not a huge problem, and the amount of data we have isn't that big, but it's data that keeps growing every week.

For example, this is a totally outdated rundown of the numbers:

The numbers

S3 means having a single, decoupled and reliable store with endless capacity. We really like that.

Now, migrating this data without downtime is not that simple. Besides, we like to release things slow starting small to catch up any problems before they hit all of our users.

We used mostly Paperclip to handle our assets, so we decided to build an extension that would allow us to migrate seamlessly. This extension is called paperclip-multiple and provided a new storage called :multiple that stored files both in the filesystem and in S3 (using the fantastic fog gem).

Thanks to that we could migrate our assets one by one without any downtime. The process worked like this:

  1. Change one type of asset (say, user avatars) to use this multiple storage and deploy it. From this moment on new avatars will be both on GlusterFS and Amazon S3.
  2. Synchronize the local filesystem with S3. For this we used the s3cmd sync command, which worked fantastic and was really fast if you had to do multiple syncs.
  3. Now all files should be on S3, meaning we can change the storage to just :fog and be done with it!

paperclip-multiple allowed even more flexibility. With feature flags we could specify for which users we wanted to store files on S3 or just keep working with the filesystem as before. We could also feature flag which user avatars we wanted to display from S3. This let us try file uploading for a while on our internal accounts, and later on try displaying them from S3.

Now, this process wasn't as simple as that. We discovered some bugs in the process, some of the code didn't use Paperclip properly and needed a lot of modifications, and we also built a proxy to serve all this data even faster and cheaper (but that will be a story for another time).

Even more interesting were the exports, which didn't even use Paperclip and, quite literally, were all over the place. We've rebuilt all of them, they are stored on S3, automatically expired after 15 days, and the new code is a lot easier to work with and, most importantly, extend. This actually meant that the biggest growth per week that we used to have is not really growth anymore because the files are deleted after certain time.

You can find the code on github, with more in-depth explanations about how to use it and how it works.

Finding the Courage to Constantly Critique

Constantly be thinking, critiquing your work in real time. ~ The Pragmatic Programmer

A big part of my job is to turn a critical eye on the work of others: code review, job applications, and so on. My harshest criticism falls on the code I wrote years ago. I've been at Harvest longer than any other developer on the team. Few at Harvest get to live with their terrible five-year-old code like I get to live with it. Daily it is in my face. In the course of our days we all run into our prior code failures. The idea of turning a critical eye on myself even more frequently than I'm already forced to is daunting, but embracing this fact is probably for the best.

Self reflection helps you become a better programmer. Most programmers go through the process once per year. Hello annual review! I believe a habit of frequent reflection is a habit with compounding returns. Understanding your strengths and weaknesses will allow you to learn, adapt and grow at a rapid pace.

A weekly reflection on your work feels like a nice compromise between annual reflections and a daily journal. I did this for a couple of years during my Getting Things Done® phase. I think weekly reflections can be pretty effective, but they have two failings. First, I don't think a weekly reflection saves any time over a daily journal. A proper weekly reflection should take one or two hours. Second, I don't think a weekly reflection is habit forming. If you miss a single weekly reflection you've pretty much broken the habit. Trust me, the weight of doing a three-hour reflection about the past two weeks is too much to bear.

I've never made a fair attempt at a rigorous daily developer journal. If I try a true developer journal, I will go into it with a checklist of questions:

  • What did you do today? (Be detailed.)
  • What decisions did you make?
  • How could these decisions backfire?
  • Did you have any conflicts today?
  • Did you enjoy the work you did?
  • What can you do better tomorrow?
  • What are you grateful for?

I often struggle with my memory. For whatever reason I find others more capable of retrieving experiences from months ago. Perhaps my brain is just built differently than most. Perhaps a daily journal is just what I need to better internalize the lessons learned from my daily successes and failures. Perhaps a daily journal will help me better collect and communicate my thoughts.

Perhaps I just need the courage to find out.

Segmentation Faults

For some time Harvest had been struggling with unsolved segmentation faults. I'd like to tell everyone the story of this journey. But first...

What is a segmentation fault?

Segmentation faults, or segfaults as we like to call them, are the errors that happen when a process tries to access memory it cannot physically address. Wait, what? Addressing memory? I thought I was working with Ruby and didn't have to take care of that!?

It turns out the Ruby interpreter itself has bugs (surprise!). Gems that have C extensions must worry about segfaults as well.

Segfault

What was happening?

A segfault is different from other kinds of errors, because they're so terrible that they kill the process. I mean, if things get so messy in the inside that you're accessing memory you shouldn't, the best you can do is stop doing what you're doing so you don’t make it worse.

When the process dies it is usually not sent to whatever exception tracking software you use. In our case, it's Passenger that notices the segfaults and dumps the error that Ruby returns into the passenger logs.

We were aware of the segfaults, but not being on our exception tracker made them very easy to ignore. That is until we started using New Relic and lots of exceptions started happening.

Exceptions?

But you said segfaults didn't cause exceptions, they just killed the process? You're right my friend, but let me get to the point.

Segmentation Fault

What we saw on our exception tracker was a lot of exceptions happening within a three-minute period, all crashing on the same line of code in the newrelic_rpm gem. The first suspicion, of course, was a bug in that gem. We opened a support ticket with New Relic immediately and they started investigating. On our side, we discovered that the errors happened on the same server and, with further debugging, on the same process.

The way New Relic works is it first spawns a parallel thread together with the Rails process. This process sends instrumentation data to New Relic. Since New Relic is a separate process, it cannot affect the performance of the primary Rails process.

What we saw on the New Relic logs is that an agent would spawn, only to cause an exception a few minutes later and keep causing exceptions until yet another few minutes later it would stop. We have configured passenger to have every Rails process respond only to a limited amount of requests, which explained this limited lifetime pattern we saw.

The exceptions looked different on almost all these incidents, but they would repeat until the agent stopped. Here are some examples:

ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `<' for nil:NilClass
ERROR : Caught exception in trace_execution_scoped header.
ERROR : NoMethodError: undefined method `count_segments' for false:FalseClass
ERROR : Caught exception in trace_method_execution footer.
ERROR : NoMethodError: undefined method `tag' for #<Array:0x007f70a3c7bbe8>

On their side, Ben from New Relic pointed to a very funny exception that did get tracked: NotImplementedError: method 'method_missing' called on unexpected T_NONE object (0x007ffa15822ca8 flags=0x400 klass=0x0). According to his analysis, this is a theoretically impossible combination of flags in Ruby Land.

Stuff that shouldn't be false

That exception got me thinking, wait, something is corrupted without causing a segfault?

I've been checking our exceptions daily for a few months, and let me tell you, the amount of activity in there was suspicious. Here are some example exceptions:

  • TypeError: false is not a symbol
  • NoMethodError: undefined method 'keys' for nil:NilClass (This error happened on controller#action called false#false)
  • TypeError: can't convert false into String
  • NoMethodError: undefined method 'zero?' for false:FalseClass

I think every Ruby developer is used to the NoMethodError's on nil. I mean, it's the Ruby version of an Uncaught Exception in Java. You see it almost every day in your work life. But NoMethodError's on false? It's really not that easy to get a false by mistake.

Corruption

At the end, the data pointed to things getting corrupted. Random variables seemed to become values that weren't right. All of it sprinkled with a healthy amount of segfaults.

My gut feeling was that it all had to be the same, but what can be done about it?

The (only) good thing of segfaults is that they leave a trace. So instead of focusing on seemingly random errors, we could at least try to get some useful information from the thousands of segfault logs.

How do you debug a segfault?

Short answer: it's very difficult and almost impossible. As with any bug, the first thing you should try to do is reproduce the error. With these segfaults we never got to it, since they where so random. Besides, of all the information provided by a segfault, Ruby devs only understand the normal stack trace and that's about it.

So unless you're a C expert, you're pretty much screwed. However, we can use the common wisdom of the masses. If you ever need to fix a segmentation fault, don't waste your time debugging. These are almost always the solutions:

  • It's probably a bug in one of the gems that uses C extensions. Hope that somebody else had the problem first and they solved it. Therefore, update or remove all your gems with C extensions.
  • If not, then it's probably a bug in the Ruby interpreter. Update Ruby to the latest patchlevel.
  • If you're already at the latest patchlevel, then try the next version.

Updating gems

Thanks to the part of the segfaults that list the files and C extensions that were loaded, we compiled a list of possibly offending gems. Then we updated or removed all of these gems, and the segfaults stopped. Our update process pointed to a very outdated version of oj we were using. Oj is a fantastic gem that fixes bugs diligently, but we had not kept it up to date.

And now our exception tracker has been free of seemingly random exceptions!

Lessons learned

We learned many things, among them:

  • Don't try to understand a segfault. It's almost never worth it.
  • It's really important to keep gems up to date.
  • Things that don't blow up loudly are very easy to ignore.

I'd like to thank New Relic for their support. Their contribution helped a lot, even when in the end it turned out to be a problem on our system.

In retrospect the solution seems so obvious that it hurts. We're very sorry to anyone who got a totally random error using Harvest. We take errors very seriously and we're very glad that we can again focus on the errors in our application code.

Deploying Multiple Heroku Apps From a Single Repo

We’ve been toying around with socket.io recently at Harvest to make our apps more responsive — and one of the big challenges that presented itself was how to manage and deploy a polyglot application on Heroku (as most of our application code is in Ruby, but socket.io is a node.js project).

Since the “web” process is the only process to receive HTTP traffic and only one web process per application is allowed, we needed to deploy the Ruby and node.js parts of the application separately.

Traditionally, there is a one-to-one mapping of repositories to apps on Heroku — but we wanted to keep both apps in the same repository (as they’re really the same thing, conceptually) and yet deploy them as two apps on Heroku. So, what’s standing in the way?

Heroku Buildpacks

Heroku uses this thing called buildpacks — the utilities and frameworks that are required to fetch your app and install dependencies. There are different buildpacks for ruby and node.js on Heroku.

Buildpacks are usually automatically chosen — but since we’re pushing the same code to both apps, the order in which buildpacks are selected matters. Turns out that ruby is preferred over node.js.

No worries, you can select a buildpack manually with an environmental variable on Heroku’s server — which is what we did to the node.js app, and it worked like a charm.

Procfiles and Foreman

The only other thing standing in our way was the way that Heroku starts web workers — using Foreman and a Procfile to define processes.

With ruby applications, the Procfile is optional — it will try to start a Rails application if it isn’t specified.

When deploying node.js to Heroku, you must specify a Procfile — there is no default command to start a server. Since we want to be able to push the same code to both Heroku remotes without any modifications, we got a little clever with the web process in the Procfile — it executes a shell script which examines the environment and runs the proper web process.

#!/bin/bash

if [ "$RAILS_DEPLOYMENT" == "true" ]; then
  bundle exec rails server -p $PORT
else
  node node/index.js
fi

Procfile:

web: bin/web

Hat tip to David Dollar (the creator of Foreman) for the idea.

Happy polyglotin’!

Harvest Performance in July: A Troubleshooting Tale

I hope you slept well during July. Unfortunately, the Ops team here at Harvest did not. Harvest had a series of very short outages during July, all occuring at approximately the same time (12:30 a.m. ET). We still achieved 99.95% uptime for the month, but our availablity report for July shows all of these events (too many for our liking). I want to apologize to customers who were affected by these events.

The summary is that an overloaded, and incorrectly designed logging system was impacting Harvest application requests when a nightly archive job ran. For others who have implemented rsyslog and the RELP protocol, I urge you to be careful about the potential for blocking applications which use the syslog service. Read on for the details.

The problem begins

At 12:31 a.m on July 2nd, we received our first notification that some requests were timing out. By the time I looked at the situation on the servers, we were many thousands of log entries past the event and things were back to normal. Odd. I assumed at the time that a brief network issue had degraded performance between two components of our stack, and that the issue was between the app servers and the database servers. This was pure conjecture, since no logs showed any evidence supporting this theory, or any other theory I was able to formulate.

A similar incident occurred on July 9th, then again on July 16th, both around the same time. Each time the event occurred, we would see a large number of Rails requests become very slow, and our available pool of Passenger processes for inbound requests would be depleted, causing some customer requests to receive errors.

On the night of July 16th, 2013, I bade farewell to sleep and proposed to caffeine as I turned my Harvest cap backwards.

Searching for evidence

For a few nights, I was logging in to our systems to watch for this event in action, looking for its genesis. This watched pot did not boil, and the issue never occurred during the 3 nights I was watching for it.

Developers and ops folks reading this post are probably rolling their eyes. Logs and performance metrics exist to find these patterns. It should be simple to find the issue, right?

Right. But also wrong.

At this point, we knew:

  • The event lasts for a few minutes (3–7 minutes).
  • The timeframe of the event was predictable to within 10 minutes.
  • Most dynamic requests would become slow, then suddenly recover.
  • The load balancer logs showed us how slow the requests were (some greater than 15 seconds, which is an aeon).
  • The Rails logs, on the other hand, indicated that requests were fast once processed.
  • This many slow requests was exhausting our application server resources and we couldn't queue further requests for processing. Customers saw errors at this point.

On July 23rd, I was able to watch the issue live for the first time. As request times increased, I was profiling the performance of network traffic between components and watching every metric I could for the smoking gun. I had a bunch of systems utilities logging their data so I could review the open network connections for Harvest's backend services. I was extracting performance numbers from streaming application logs and was hoping for something to be glaring at me in the details. Nothing. The issue only lasted for 3 minutes while I was watching it. We then recovered in the space of a second, and that was it.

Without a smoking gun, we had to invent theories. The first theory was that the issue was caused by an external service that becomes slow for a short interval.

The first external service we (incorrectly) looked at was Skylight, which Harvest uses to log performance data. We spoke with Carl, explained our issue, and asked for his thoughts. A patient and wise man, Carl analyzed all the Harvest data on their end and concluded that although Harvest was indeed slow during these intervals, the way that Skylight is implemented would not cause this issue. We decided to remove Skylight temporarily to be sure, and on July 25th, without Skylight being deployed, we saw the issue recur. Skylight was not the culprit. Thanks to Carl and the Skylight crew for entertaining our theories.

At this point, I asked our excellent data center providers, Server Central, to perform a network traffic analysis to determine if there was a pattern of network traffic anywhere at all during the outage windows, or if some network path was being saturated and degrading performance. The team at Server Central really stepped up and did an exhaustive check of all of our network traffic and determined that nothing untoward was happening during the time we were seeing the issue. (If you are looking for a place to host your servers, you should talk to these guys.)

So now we have Rails logs that say things are fast, load balancer logs that say things are slow, an external performance analysis service which agrees with the load balancers, and a predictable window for the issue.

We split our global pool of application servers into a few different function-based pools. The theory was that if the problem was exhibited in only one of the pools, that we would have narrowed down the possible scope of the issue somewhat. However, the last problem window had occured among all of the pools, so it wasn't isolated to one area of Harvest.

Narrow down the scope of the investigation

We had to decide at this point if we believed the issue was caused by inbound requests which were doing something unexpected, or if the issue was caused by something on our end which was performing poorly and then recovering. We started our investigation on the former.

Harvest handles a lot of traffic. We also have a lot of hardware focused on serving this traffic. This means two things: first, there is a lot of data to analyze if you need to analyze our traffic data in a new way; second, it means that if this issue was being caused by external requests that there would have to be a very large number of them to have this kind of effect.

We spent a couple of hours looking at our traffic logs, filtering them in various ways, trying to find a large number of requests isolated in the problem timeframe and looking like they could have wide ranging performance effects. We didn't find anything.

In order to concentrate our energy, we decided to abandon the idea that the problem was caused by inbound traffic and focus our efforts on searching for an anomaly in our systems.

Working through possible theories

I set out to collect and systematically debunk any theories we could come up with, no matter how absurd they sounded. I drew up a list of 10 theories to explore and debunk.

Was memcache becoming slow for some reason? No. Was MySQL becoming slow for some reason? No. Was nginx encountering a bug somewhere and logging the evidence? No. And so on, through five or six different theories. Next on the list: was the problem caused by a Passenger bug?

If you use Passenger, you are no doubt familiar with using a SIGABRT signal to force Passenger to log a backtrace to the Rails log. Testing this procedure, I was unable to locate the backtrace data in our Rails logs. I asked Hongli Lai over at Phusion to help me find this data. We even arranged for Hongli to be available as a consultant to help us troubleshoot the problem if required. Hongli was very gracious and I would like to thank him for his willingness to help.

Trying to force the Passenger backtrace into the Rails log ultimately led me to the cause of the problem, though it wasn't a Passenger bug. We use rsyslog to send our Rails logs from each application server to central log servers where various tools parse and analyze the application logs. The Passenger backtraces were not appearing in the Rails logs, and I was unsuccessfully searching every possible log location for them.

At the same time, we were doing yet another fine grained audit of every single scheduled task which gets run across the entire Harvest architecture, looking for a scheduled task near the outage window which could possibly cause the problem.

Finally, the smoking gun

I noticed a harmless looking cron job responsible for archiving old log files on one of our rsyslog servers. It ran every day a few minutes before the window. Looking at the system metrics for this log server, I noticed increased levels of IOwait while this archiving ran through the outage window.

log server CPU IOWait spike

When we implemented rsyslog we had chosen to use the RELP protocol, because we wanted maximum log reliability. We had assumed at the time that the impact of any rsyslog performance or availability issues would be contained by the rsyslog subsystem, not spread to the applications which were using the syslog facility. This turns out to have been naïve.

We found some reports of other folks having similar experiences with rsyslog/RELP. Performance issues could easily lead to rsyslog local queue exhaustion, which could lead to applications blocking while waiting on the syslog facility.

At last, a credible theory.

I readied a move away from TCP/RELP towards UDP for rsyslog traffic in Chef and deployed it. There was a syntax bug in our config which immediately proved our theory to be correct. While rsyslog messages were being queued locally because of the syntax bug, Passenger processes began to pile up as Rails requests blocked on calls to syslog. This was the exact behavior we would see during the mysterious outage windows.

The problem that we were struggling to diagnose because our application logging didn't confirm the existence of the problem turned out to be the application logging itself.

Fixing the design

We ended up moving back to using RELP in rsyslog, but instead of Rails using the syslog facility directly, Rails now writes logs to a local file on the filesystem and rsyslog uses this text file as an input. Rails logs are now shipped to the central logging servers in a way which is decoupled from Rails entirely. This way, any future rsyslog performance issues should be contained within rsyslog and not impact our application.

The mystery finally solved, caffeine and I took some time apart. (We have since gotten back together.)

I'm not proud of how long it took us to find the problem, and I am really not proud of the number of events on our uptime report during the time it took us to discover the cause. I am very proud of the team that helped troubleshoot this issue, though. Harvest has an amazing team who are good at working under pressure.

Like a great joke, or a visionary piece of technology, it seems obvious once you see it. Someone had to see it for the first time, though.

Thanks for being patient while we figured this one out.

Focus

The choir doth preach: Focus, programmer, lest ye be distracted to failure. You listen. This week you will define your goals and accomplish them. You will succeed...

Suddenly it is Friday and you know not what you have done. Your back hurts. You've knowledge-worked so hard as to cause yourself physical pain. What was it you set out Monday to achieve?

Focus is a learned skill that must be practiced. You cannot wish yourself to a focused work week or a focused life. Goals are a first step. Limits help you carve out pockets of time. Practice leads to forming one of the most powerful habits you can learn.

It is impossible to maintain focus without defining limits. Pick your goals and plan time to focus on the problem at hand. Clear a day or several mornings. Remove the distractions of email and chat. Communicate your intent.

And focus.

Plan your time to work. Plan your time not to work. Do not break your rules for any reason outside of a true emergency. You cannot be a true practitioner of a focused life if you throw your limits out on a whim. Do not check email, Twitter, and Facebook. Shut off the notifications.

As focus is engrained within you, your accomplishments will speak for themselves. Others will seek to emulate you; your sense of purpose. You will be ruthless in eliminating distractions. You will have more success in your career. You will be intentional with your free time. You will be happier.

All because you learned to focus.

In Defense Of Rem Units

Over the past couple weeks, I’ve had the opportunity to design and build Harvest’s 2012 Year In Work, which afforded me a chance to try out something new: the rem unit of measurement.

You’re probably already familiar with em units – they’ve been a common feature of CSS for the past decade – but a crash course: elements specified using em units are sized relative to the font-size of their parent element. For example, if a paragraph has a specified font-size of 2em, and the div it’s inside of has a font-size of 10px, then the effective font-size of the paragraph is 10px × 2em, which is 20px.

.parent {
    font-size: 10px; }
.parent p.child {
    font-size: 2em; /* outputs to 20px */ }

“Rem” stands for “root em”. Similar to em units, they calculate size based on an ancestor element’s font-size, except rem units always calculate against the html/root element’s base size. So if the html element has a font-size of 16px, then a paragraph with font-size of 2rem will always be 32px, regardless of the its parent div or any other element.

html {
    font-size: 16px; }
.parent {
    font-size: 10px; }
.parent p.child {
    font-size: 2rem; /* relative to html - outputs to 32px */ }

This is incredibly compelling, particularly because...

I try to avoid Em Units

Em units are useful for a variety of reasons, the most important being that they’re flexible and scalable. You can set the highest parent element, html, to have a relative font-size specified with an em value or a percentage, and it will calculate a specific value based on the browser’s default font-size. This is great for printed documents (the typical default browser print size is 12px vs. 16px for screen display), and it allows your site to scale with the user’s preference, especially in older browsers that have a harder time with page/text zoom.

However, em units have their drawbacks as well, particularly for web-apps or teams where many people touch the CSS. Consider Harvest, a web-app with thousands of lines of CSS code. We use modularized CSS for various designs and extensions used in various places throughout the app (like our calendar pop-over). If any of our styles were set in ems, then there’s a possibility that the pop-over would get added to too many different parent elements in too many different places with various font-sizes, causing it to look different each time and ruining the consistency.

As both a product designer and an HTML/CSS developer accustomed to decently-sized dev teams, I’ve learned to avoid em units for the bulk of app development. On smaller and less complicated sites, these issues are less apparent, but still necessary to keep in mind and design around.

Enter the Rem unit

I first heard about rem units a few years ago, but haven’t seen much talk or use since. I’m not sure why, though, as they have always seemed incredibly well-suited for web design – they behave much more predictably like pixels, but retain the flexibility of ems. If an element is always supposed to have a font-size of 16px at base size, regardless of the parent element, then rems make more sense than ems. It’s much easier to convert design comps into code without worrying about parent elements. The same goes for width, height, padding, margin, and all other style attributes that specify size.

I decided to try rem units for Harvest’s 2012 Year In Work to learn their strengths and weaknesses first-hand in a real world environment. I found them to be very useful, and a joy to use for a variety of reasons.

What you should know about Rem units

They help with Responsive Design

Since all rem units are relative to a single number – the html font-size – you need only set the breakpoints on this number for a simple responsive layout. Here’s the code used in the Year In Work’s Sass file:

Responsive HTML font-size

While this isn’t appropriate for every design, being able to set a base size for your entire layout for every breakpoint can be incredibly useful.

**Pro-tip:* While I used pixel values, you could use % values as well.*

Great for Rapid Prototyping

I’ve signed on to the theory of prototyping a design directly in HTML/CSS, and avoiding Photoshop altogether (sometimes). When experimenting with a design right within the browser, the last thing you want to worry about is getting the sizes exactly right.

Set the html font-size to 16px, or whatever you want, and then quickly estimate the sizes for everything else and run with them. The h1 is probably 2.5x that size, h2 probably 2x, h3 probably 1.75x... etc. It’s fast. While you can use em units for this as well, using rem units avoids the issue of having to resize the h3 after putting it inside a div with a font-size on it – which is more likely as you tend to move things around quickly during rapid prototyping.

This is how I designed the Year In Work actually... there were no layout PSDs.

**Pro-tip:* Just like with ems, you’ll find you need to use fractions like 2.25 for some values. I always stuck to .25, .5, or .75 when not using a whole number, since my base size is 16px. However, as you can see in the screenshot above, the base size can change to 13px, which at 2.25rem is 29.25px. Since you can’t have a quarter pixel, some browsers render this as 29px, others as 30px. I found this to be a minor trade-off (but something to keep in mind) for the ease of the responsive layout that I explained above.*

All sizes should be in Rem units

Some developers prefer a mix of pixels, percentages, and ems for different styles on different elements, all within the same page. Personally, I think this makes development confusing, but it can work.

This doesn’t work well when using rem units, though. Using rems on only some styles doesn’t offer any advantage – you might as well just use pixels. You have to commit to replacing all pixel values with rem units if you want the responsive advantages.

**Pro-tip:* You can use ems and rems together, though. Check out the h3 on slide 3 of The Year In Work – I didn’t feel like calculating the size of the “21”, so I gave it a font-size of 1.25 em.*

They can influence your design style

Sometimes you need an element to be only 3px – this translates to 0.1875rem... not very conducive to simplified design code. However, you can allow this to influence your design, and decide to just use 0.2rem (or 0.25rem, like I mentioned above). Often the difference isn’t important.

Or, if that 3px is a border-radius, maybe you decide to remove it entirely. Those are design decisions you have to make. I increasingly found that I prefer numbers to end in 0 or 5, or at least an even number. If an element was 9rem wide, I would just make it 10rem and see how that felt. The majority of the time it worked out just fine, and psychologically I felt like my code and design was clearer, cleaner, and easier to work with.

It’s because of this reason that rem units probably aren’t the best for sophisticated web-apps. They worked great for the Year In Work, but I don’t see us using them in Harvest anytime soon – that would probably slow down development rather than speed it up. There are just too many small pixel sizes in various places for rem units to be useful.

However, if you put the last two points together, and you fully use rems from the start while allowing yourself to design around the strengths and weaknesses of rem units, you could end up with a nice web-app that scales nicely between browser and tablet sizes and page zooms.

They do not work in IE8

This may be the biggest reason why we don’t hear about rem units too often currently, as many sites (especially web-apps) still need to support IE8. Luckily, that was not a requirement for the Year In Work, which we decided should be a showcase of some forward-thinking design and techniques.

If you’re working on a personal or small site, I doubt this will be a big issue for you.

Final Thoughts

Rem units aren’t great for every occasion, but for some sites they can be a great tool. They are worth trying regardless of what you're normally comfortable with:

  • If you’re coming from a background that always uses pixels, rems are virtually the same thing, but they add the responsiveness that was missing in your old designs.
  • If you’re coming from a background that always uses ems, rems can solve your issues of elements-within-elements becoming inconsistent and having to add multiple definitions in your stylesheet.

Either way, I hope to see more rem usage in the future. I don’t feel like it’s been fully explored yet, and think there’s some still use cases that haven’t been discovered.