Teatime: Continuous Integration

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: Oprah Chai. I expected this to be boring and gimmicky, but it was surprisingly bold, and a pleasant drink all-around. I tried it at a Starbucks before I bought some, which is a nice perk.

Today’s Topic: Continuous Integration

Today’s topic is continuous integration; much of it is adapted from a book called Continuous delivery by Jez Humble and David Farley. When I gave this talk, I gave a disclaimer that the book aims to start with the worst possible practices and walk them up to the best possible practices. Since my company is far from the worst possible state, a lot of the items were things we were already doing. I’d be interested to hear in the comments what you already do or don’t do.

The Problem

Here are some of the major problems in the industry that Humble and Farley saw when they sat down to write the book in 2011:

Delivering software is hard! Release day arrives; everyone’s tense, nervous. Nobody’s quite sure the product will even work, and nobody wants their part to be the bit that fails, putting them on the chopping block. We’ve got one shot at this launch, and if we botch it, we cost the company millions as customers go offline. Expect 3am phone calls from Operations — did development forget to write down a step in the process? Did someone forget to put a file in the build folder? Is all the SQL going to run first try? What if some of the data we thought was in prod has changed since we started developing? Are we really, really sure?
Manual Testing Sucks, Period. It takes forever to manually test a product, and nobody’s ever quite sure we covered everything. How many bugs do you get where you’re asking yourself, “Didn’t they test this? How did this ever work?” It’s so obvious in hindsight when these things come in, but it’s customers finding them. And it takes weeks, months maybe, to run a full test cycle on a brand new app if it does anything more than CRUD. Oh, and performance testing? Manually? Eh, it seems fast enough, release it. Security testing? Who has time for this crap?
“It worked in dev” syndrome. What are the differences between dev and prod? Can you name them all off the top of your head? When do you test in a production-like environment, and what are the differences between production-like and production? Who tested in dev? What did they test? Are you sure you understand how users will interact with your system? How many times do you get bugs where you ask yourself “Why did they even click that?!”
No way to test deployment. The only truly prod-like servers are prod; the only process is “a person does a thing”. You can’t test people, and there’s always going to be turnover. How do you know they did it right? How can you audit their process, or improve on it? People aren’t exactly reliable, that’s why we invented machines 😉

The Principles

So here’s what they came up with as guidelines to try and correct the system. These are necessary to pull yourself out of process hell and start building toward Continuous Integration:

Every Commit is a Release candidate. Every single one could potentially be released. If it adds value, and doesn’t break anything else, it’s ready to release. Whether it’s actually released is going to be up to the BA and/or PM, of course, but you don’t want to commit anything you know is broken, you’ll just waste everyone’s time. If you want the safety blanket of committing early and often, make a feature branch; when you merge that back in, it’s a release candidate.
Repeatable, Reliable Release Process. Once you have that commit, you want a standardized process, on paper, that can be repeated with every release, no exceptions. If there ARE exceptions, you document those too, so they’re not exceptions anymore; things like rolling back a failed deployment should be a standard, repeatable process as well. We had one week where we accidentally re-promoted a broken release because I forgot to pull it out of the QA branch after it failed in production the week before. Needless to say, after I made a round of apologies, I documented that step as well!
Automate all the things! Automate everything. The authors have never seen a release process that can’t be automated with sufficient work and ingenuity. After I gave this talk the first time, I embarked on a 6-month project to do just that, simplifying our convoluted multiple-branch SVN strategy into a flatter tree, and automating the deployment from Trunk. It took ages and it was painful to implement, but the new system is much more reliable, faster, and generally nicer to use.
Keep everything in source control. The goal is to allow a new team member to come in, sit down at a brand new workstation, run a checkout, run a build script, and have a working environment. Yes, that includes the database. Yes, that includes the version of Coldfusion or Node or whatnot. Yes, that includes the Apache or Nginx configuration. It should be possible to see at a glance what version of the application and dependencies are on the servers. Node’s package.json is a great step toward that ideal
If it hurts, do it more often. Example: Merging sucks. Merging more often, in smaller chunks, is easier than delaying until the end of the project to merge and resolve conflicts. Another example: releasing sucks. So instead of releasing huge products once a quarter, release them once a month, or once a week, or once a day, or once an hour…
Build Quality In This idea comes from Lean: the earlier you find a bug, the cheaper it is to fix. We QA folks tend to know that, and our mantra becomes Test Early, Test Often. If you find a bug in your code before you commit it, that’s maybe ten minutes time to fix it, max. If you find it in QA, now the person who found the bug has to write a ticket, the PM has to triage it, you have to read it and understand it, maybe some more clarification back and forth, then you have to hunt through the code to find the problem, and then you fix it. So now we’re looking at hours of time rather than minutes. And if the problem is found in production and we have to run through a whole release cycle? Plus the customers’ time lost trying to work around the bug? A disaster. This is where unit testing and integration testing is super important.
Done means released In Waterfall, “done” means “built, ready for testing”. In Agile, “done” means “ready to be released”, which means the developers don’t stop caring about something until it passes testing. DevOps goes one step beyond that: “done” means “released to production”. After all, what good is something that is beautifully crafted and passed all tests if the customer can’t use it yet? This ties into the next principle:
Everyone is responsible for delivery In the Waterfall way, the developer builds a thing, tosses it over the wall to QA, and walks away, expecting other people to be responsible for getting it into prod. In the DevOps world, we’re all on the same team together: it doesn’t matter whose fault it is or what went wrong, everyone’s responsible for helping get the code safely into production. The developer should be on hand to chime in with his intimate knowledge of the code while the operations folks are trying to get things running.
Continuous Improvement This is my favorite principle 🙂 The general flow of work should be: Plan, Do, Study, Act. Routinely, we should get together to ask “how could do this better next time?”. We should take controlled risks in order to improve our craft.

The Practices

In order to support the above principles, the following practices need to be in place:

Use CI Software. We use Atlassian’s Bamboo to build and deploy changes. It makes 0 sense to have people do this manually; people are good at creative tasks, while computers are good at repetitive, boring tasks.
Don’t break the build Run the tests before you commit; don’t commit if something’s broken. An intern once asked me, “404s aren’t real errors, right?”. He was so used to popping open the console and seeing a dozen 404 errors that he didn’t notice the one that mattered. We can’t just have errors sitting around in production that we ignore, or we train ourselves to ignore real errors too.
Don’t move on until commit tests pass The CI server should be fast enough to give you feedback before you move on to another task; you should wait until you’re sure your commit is good before changing gears, so that if there’s something broken, you still have all the information you need loaded into your metaphorical RAM and don’t have to metaphorically swap pages to get to it.
You break it, you fix it Take responsibility for your changes! If your commit breaks something else, it’s not the other author’s problem, it’s your problem, because you made the change. Pointing fingers is a bad habit to get into.
Fail fast The authors suggest failing the build for slow tests. I agree, sheepishly; my functional tests are slow as heck, but I’m always trying to tighten the feedback loop and get developers information as rapidly as possible. They also suggest failing for linting issues, because they can lead to weird bugs later on. They suggest failing for architectural breaches, things like inline SQL when you have a stored-proc architecture, or other problems like that. The more you fail in dev, the less you fail in Prod.
Deploy to prod-like environments You should be deploying to environments that mimic production before you get out of the testing cycle, to make sure it’s going to be a clean deploy. More importantly, what you deploy to that environment should be exactly, byte for byte, what you deploy to prod. With the new release process I’ve set up, that’s exactly what we do: we physically move the exact files, no building on the server anymore. Staging should be the exact same hardware, the same load balancing, the same OS configuration, the same application stack, with data in a known good state.

I know that was a lot of dense information this week, but hopefully it gave you a nice clear picture of the goal state you can work toward. Was it useful? Let me know in the comments!