Case Statement

  • Archive
  • RSS

Why we are Still Deploying Overnight

(and what we’re doing about it)

So a co-worker sent me this blog post by Brian Crescimanno today that asks: why am I still deploying overnight?

Well, it’s a question that’s been bothering me for months. Sometimes when I bring up this question I find good reasons, other times I find excuses, and other times I find something no one has really thought of. Looking at the comments on that blog post there seems to be a lack of concrete examples about how people can deploy during the day or why they are stuck deploying at night.

I’ll run over our deployment process, try to explain the historical reasons for what we do, and then talk about what we plan to do about it.

Now, a lot of the advice surrounding the aforementioned post is that how you get to a point where you can deploy your application during the day is very specific to your application. People are saying that the problems your company will face have to be taken case-by-case because they’ll be so much different than another company. I think that’s a bit of a cop out, let me know if our situation is at all similar to yours.

Our application is a single page web-app with long lived (12+hour) user sessions. Our clients run offices in North America so typically we see user activity between 6am and 8pm, Monday to Friday. Since we are a single page application, our compiled Javascript file is quite large and we do our best to make sure it is cached on clients’ browsers (we use a url like app.js?version=1.01 to ensure that the caches are invalidated after each release). Because these offices depend on us completely we cannot take the site down during that 14 hour window, so if we are going to do a deploy during the day, it must be done without downtime.

The Deployment

  1. On the Thursday before the release we test our data migrations on a snapshot of client data. This is necessary because too often we were running into situations where a migration would work fine from development to staging but a particular client’s data would cause the migration to fail.
  2. On Friday a script is run to prep the release branch. The script
    • checks out the next release
    • minifies the Javascript
    • creates a tarball and copies it to our app servers
    • extracts the tarball beside the running code and then left alone
  3. On Sunday night a script is run that
    • logs out any users
    • swaps the site in Apache with a “we’re down for maintenance” site
    • turns off db replication
    • creates a snapshot of our database
    • runs the data migration
    • restores db replication
    • changes the symlink for our app from the old release directory to the one created on Friday
    • restores the Apache site
    • does a quick smoke test

The Friday step could be done on Sunday, but the Javascript build and propagating the tarball takes about 40 minutes that no one wants to spend Sunday night doing.

What is stopping us.

So what issues do we face if we were to try to deploy during the day?

  • We can’t sign users out during the day. Right now this is a cheap way for us to force them to redownload any new client side files.
  • We can’t use a “we’re down for maintenance” site.
  • Currently, we turn off replication because it would allow us to promote the slave if the data migration fails. We can’t lose a minute of customer data so replacing the master db in this manner wouldn’t work during the day.

So, in general we have two core issues and two tangential issues:

  1. A client side browser cache that references our old version.
  2. Database migrations that could potentially fail, lock up the application, or cause inconsistent data to be created while the migration was taking place.
  3. Our QA department currently does a full sweep test for regressions before each release, this takes at least two weeks but most people here feel it is a necessary evil.
  4. Our current development process is also focused around releasing groups of features, a single patch release is currently cumbersome.

As I said, I and others at work have been thinking over these issues for awhile. Here’s what we’re going to do.

First Steps

  • Separate some of our clients to point to a different app bundle so we can release to a subset first. We’ve actually had this for a long time but never use it.
  • Start with trivial changes. We’ve defined this as: ** Nothing that affects the REST interface, so as long as the json structures going back and forth remain the same, we avoid issue #1 ** No database migrations - This way we completely avoid core issue #2 ** The trivial change can not do more than one thing. This should alleviate #3, QA can do a branch test and we as developers have to be extra careful and guard against regressions with functional tests. ** We’ll just have to deal with issue #4 as necessary

Down the Road

Both issues #1 and #2 are handled by making sure the application is both backwards and forwards compatible.

In regards to the Javascript cache, this means that the data consumers need to be flexible in regards to added or missing fields. I think we are pretty close here. Using a setAttributes on a model is one side of the equation, checking if fields exist before they are used on the client side is the other. I imagine we’ll run into other issues here but if we remain responsive I’m sure we’ll be ok.

For the database, we need confidence in our migrations, already we test before deployment so that helps. In general, our migrations are adding new tables or columns, creating new foreign keys and other constraints, or optimizing stored procedures. I think each of these cases are generally compatible but we will have to watch for table locking. Our segmented users will alleviate a large part of the risk once we choose to run these types of migrations during the day.

Renaming columns and changing their data types or underlying meaning is a more difficult problem that I don’t quite know how to tackle yet. I’ve seen some vague strategies using db views or staged migrations where data exists in parallel for a few releases. If anyone has information about handling these types of migrations with no downtime, I’d be very interested.

Issue #3 is clearly an automation problem, we’ve started writing Sikuli tests but we’ve got a long way to go. The more we automate, the more we can expand the scope of our daily releases. Until we have confidence that regressions are being caught, we likely have to stick with the simplest changes.

As far as the development process is concerned, we have two intertwined issues, continuous deployment and during the day deployment. I believe that by adapting to releasing more features during the day, we will naturally evolve a process that will allow us to do releases more often. To do continual deployments of features will require reworking a large part of the development process. We use a git-flow model, while nice for more traditional releases, is a bit complicated when doing hotfixes, maybe we want to switch to a GitHub-flow…

Baby Steps

Over the last year and a bit we’ve gone from deployments that took four to five hours with numerous manual steps to well defined and scripted bi-weekly release. It’s difficult to take an existing process and an existing culture that’s always done releases a certain way and move them towards something new.

Being aware of your deployment process and why you’ve built it up in such a way will help you find the way forward.

  • 7 months ago
  • 8
  • Permalink
  • Share
    Tweet

8 Notes/ Hide

  1. comminsgei01921 reblogged this from casestatement
  2. cicaleepb7836 reblogged this from casestatement
  3. qgifs liked this
  4. mangalcun liked this
  5. howardtharp liked this
  6. hiqus liked this
  7. hiqus reblogged this from casestatement
  8. consumerization liked this
  9. casestatement posted this
← Previous • Next →

About

I'm Case Nelson.
Customer Advocate, Leader, Evangelist, and Hacker.

Portrait/Logo

Follow Me,
I do cool things

  • @casestatement on Twitter
  • snoe on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr