A skeptic’s guide to continuous delivery, part 2: the nuts & bolts of CI

This is the second in our five-part series from guest blogger J. Paul Reed—build engineer, automation enthusiast, and host of The Ship Show podcast.

Jez Humble, author of Continuous Delivery and one of its founding fathers, has an informal survey he likes to give to audiences.

It starts with a simple question: “Raise your hand if you do continuous integration.” A sea of hands always rise. Then he says “Put them down if all of the developers on your team don’t check into the main code-line at least once a day.” A large number of hands usually fall. Then he probes: “Put your hands down unless every check-in triggers a build plus unit test run.” More hands go down. And finally: “Put your hands down if the build isn’t fixed within ten minutes of it going red.” Humble reports that more often than not, only a few hands remain raised.

What’s interesting to note: for someone who spends his days working with organizations on constructing software delivery pipelines to realize the benefits of continuous delivery, his survey gauntlet—which most organizations don’t make it through—isn’t about continuous delivery at all: it’s about continuous integration and the related behaviors required for CI to be successful.

He’s trying to illustrate the point that unless your organization has a stable, operationalized continuous integration environment and a culture of utilizing it effectively, moving toward continuous delivery will prove to be a painful waste of time and resources. So if you’re looking to move toward the “real-time,” continuously-delivered world we discussed in part 1, the first step is to take a good look at your business’ implementation of continuous integration and ensure it is as stable and reliable as you assume.

Masters and slaves

The first step to determining how operationalized your continuous integration infrastructure is is to look at its constituent parts: the master continuous integration server and the slaves that do the work. Some questions that will give you insight into the current state of your CI world.

masterrobot For the master:

Is the machine configuration under configuration management (CFEngine, Chef, Puppet, etc.) so it can be rebuilt from scratch in a totally automated fashion? Can this process run self-contained, or does it require downloading bits (plugins, etc.) from external websites? How long would a rebuild take?
Is the data contained within the CI master available elsewhere? Are items such as build logs (especially for shipped builds), artifacts, test results, and other build metadata recoverable, or would data as fundamental as build numbers be lost if the master disappeared?
Is there access control to the system configuration for the master server, or can anyone log in and change anything? How about the individual job configurations? When changes are made, is there an audit trail? Are stakeholders notified?
Are job configurations in version control or specified within the CI tool itself? Are job configuration changes tracked anywhere?

For the slaves: slaverobots

Are the slaves under total configuration management so they can be re-created in an automated fashion for all supported platforms? (Oddly, I often see Linux slave configuration automated, but Windows and Mac slaves are left in various degrees of manual configuration.)
Who has login access to the slaves? Are slaves considered “dirty” if logged into, and if so, what happens to that slave after?

These may seem like simplistic questions, but it is still incredibly common for jobs to fail on certain slaves, yet work on others. Or for developers to demand login access, make configuration changes, and those changes never get re-incorporated into the build environment process for the build/release teams or even developers’ local machines.

A general question that applies to both master and the slaves: who is responsible for maintenance of the machines that make up your CI infrastructure and backups of critical data? Often times, this is a function served by a separate IT team. If so, beware conflicting requirements: a client once had a frustrating problem where the CI server suffered intermittent failures during the day: it turns out a developer had been “helpfully” taking the server down in his free moments to make backups (killing every running build in the process).

In another situation, the QA and release teams were banding together to burn the midnight oil on a huge release. Just as they started the release process, the CI infrastructure went down. Turns out the IT team was adhering to their published backup and maintenance schedule for the CI servers, but no one had thought to communicate the critical release’s schedule. The escalation chain to get the backup process halted and the CI systems back up at 2 am so the company could meet its early morning deadline left everyone in a bad mood. Moral of the stories: no matter who serves these functions in your organization, they need to be part of the communication loop with the CI tool administrators and its users.

Despite sounding absurd, a good gauge of your CI infrastructure’s state is knowing the answer to “If we decided to completely switch CI tools, how long would it take us to move all of the configuration, and could we recover all of the metadata, logs, artifacts, and other information from past builds we care about.” If the answer is “a long time” and “no,” then there is work to do. (And you’d be surprised how often this actually happens in practice, with changing team members and opinions and new tools.)

Users of cloud-based CI services may assume they needn’t worry about any of these issues, but it is common as an organization grows or changes for CI functionality to be moved to other cloud infrastructure, perhaps to take advantage of bulk pricing or move to a private cloud. Sometimes, demands for faster builds may necessitate bringing CI infrastructure back in-house or to bare metal… in which case, you’re effectively “switching CI tools,” and must take all the above into consideration.

Baby steps for the QA and release teams

As the teams responsible for your CI infrastructure do their operational and stability reviews and address any issues, that’s a good time for other teams who will help the continuous delivery journey to start looking at the state of their worlds.

QAmeeple Since fully automated (unit and integration) testing and quality assessment is a requirement for continuous delivery to provide any business value—otherwise, you’re just shipping garbage quickly—QA teams can start assessing the work ahead of them. The best methodology I’ve seen is to tackle this on two different fronts. On one, start acculturating the teams to the necessity of writing automated unit or integration tests for each and every filed defect and then start writing them in your most critical components! This will start gnawing away at uncertainty and regression risks from bugs that have been found.

On the other front, integration and functional tests can be written that test not for aberrant behaviors, but for the intended ones. Many of these tests may currently be manual or require human intervention, so the focus is on making them fully automated in such a way that no humans are involved in the execution of the tests. In many cases, this requires the QA team to evaluate and communicate test environment requirements to the team responsible for the CI infrastructure or if they operate their own test infrastructure, go through the above operational exercises as well. For certain types of software, dedicating time to the creation of automated “fuzz tests” can also prove very beneficial and provide a lot of value in a continuous delivery context.

When starting to analyze what work may need to be done on your software’s build process, looking at the endpoints of the build/release process is particularly useful: buildengmeeple

Is the coupling between the source control system and the CI infrastructure, the root of the CD pipeline, stable? Sounds like a silly question, but in the world of cloud-based source control and many tiny Git repos instead of one monolithic repo, I run into all sorts of failure modes where a commit doesn’t reliably kick off a build when one would be expected.
On the other end of the build process, is the final packaging produced entirely consumable, by either customers or the deployment automation? Do those artifacts reside in a place that is easy to automate deployments out to a production environment or to the place where customers can get at it? What is the artifact retention and management story for those builds?

As a fan of Law and Order, I often dub these questions, collectively, as the “Build Chain of Evidence.” Environments still exist today where there is no clear way to figure out what commit(s) went into a particular artifact, where the test data that illustrated a critical regression is, or what build that data relates to, and no one can tell whether or not a particular artifact is important and should be kept. A continuous delivery pipeline relies on this chain containing all of the important (meta-)data and (obviously) that “chain of custody” not breaking at any point within the pipeline.

While all of these suggestions might sound like good practice, one might ask themselves why investing so much effort in CI is worth it. “We want continuous delivery! Why spend time on CI?!” The answer: because they build on each other and that old analogy about foundations is relevant when building your “CD house.”

fasterdelivery Since continuous delivery requires that the software release infrastructure be used… well… continuously, it requires a different perspective. If the CI infrastructure upon which your CD efforts are built is rushed or thrown together, you run the risk that parts of it will fail. Unlike traditional release models, if that infrastructure breaks, it severely impacts every part of the company. In a more waterfall release model, there is time and space for the crashy CI server to be rebooted, for artifacts to be moved around because the file server filled up again, or release engineers to log onto every slave and install that package a developer installed on only one slave during a debugging session. In continuous delivery, such problems bust the pipeline for everyone from developers committing to customers getting the new packages. And such breakages become incredibly obvious to the entire company.

That is why investment in making your organization’s continuous integration infrastructure a first-class citizen will not only pay dividends as you work toward continuous delivery, but is actually a requirement if you are to build a delivery pipeline that won’t spring leaks and burst open in times of increased pressure and development flow.

Once you have a good CI foundation built, you can start looking at how to move toward continuous delivery in your organization. In the next article, we’ll examine a couple of CD pipelines in two very different industries to give you some concrete ideas about what continuous delivery looks like.

Editor’s note: There’s more to CD that just tooling, as we’ve seen here. But great tooling doesn’t hurt! Check out Atlassian Bamboo–the first world-class continuous delivery tool on the market.