A skeptic’s guide to continuous delivery, part 5: a CD parable

This is the last in our five-part series from guest blogger J. Paul Reed—build engineer, automation enthusiast, and host of The Ship Show podcast.

We’re ending our “Skeptics Guide to Continuous Delivery” series with a parable about putting all the techniques we’ve discussed together to start your own CI/CD journey. The following story is based on real people and real companies. Names have been changed to protect the innocent. And the guilty.

“It was the worst of times…”

This release wasn’t going well. But then, that wasn’t new.

Erin joined Data Processing International about nine months ago as its lead build engineer. DPI was starting a giant foray into online services, but their bread and butter was still software that customers actually had to download and install. The role appealed to Erin precisely because of the challenges of shipping software the “old school” way. Plus, the integration work with the online services was going to keep things fresh and interesting. Or so she’d thought.

But here she was, stuck in the office on yet another Saturday, getting the release back on track. Trudging into the office, she tried to remember how she’d gotten here. She wasn’t new to DPI’s release process, but with hotfixes, quarterly patch releases, and betas for the next major version all piling up over the months since he’d joined, she never seemed to be able to make any lasting improvements.

As she sat down in the desolate office, she wasn’t looking forward to opening her terminal. All she knew, she knew from a frantic phone call from the project manager, Sophia. She had called Erin at 11 last night, asking where the build Erin had kicked off was. Erin didn’t know, but she knew she’d have to come in to find out. If there’s one thing she’d learned over the months with DPI’s build infrastructure, it was you never quite knew whether the cause of a missing (or broken, or hung) build was something trivial… or a frustratingly complex, bottomless rabbit hole of a problem.

Erin had seen just about everything in her short tenure at DPI. Like developers landing fixes on the wrong branch, causing conflicts and breaking the build. (This turned out to be a pretty common occurrence, since DPI created a branch for every release; branches numbered in the hundreds. The “good” news was that developers only had to keep track of about twenty active branches. If it wasn’t that, QA had a habit of making changes to the acceptance tests that required environment changes… except no one told Erin. And so the environment never got updated on the build machines. One time, this delayed a critical hotfix. Another time, it seemed like no matter how they tried, a security fix for a widely shared library component broke the various DPI apps that relied on it. Fixing one app’s usage of the library broke another. That game of whack-a-mole took almost a week to iron out, so each application on the various branches worked and the security problem got fixed.

But it wasn’t just the foibles of the development process that caused problems: the build infrastructure resembled something not unlike a game of Jenga, two turns from the end: the source control server had a tendency to fail at random intervals. At least twice a month, Erin had to navigate a gauntlet of angry developers first thing in the morning, because they couldn’t check out code. IT always did “some magic” to fix it, but kept brushing her off when she asked what it was. Erin never had time to care too much about the details anyway: she was too busy restarting the overnight builds that had failed due to the AWOL source control server as QA lined up at her desk, demanding constant status reports.

The build system itself was a set of shell scripts and batch files that used a twisted maze of telnet and ssh to kick the process off on different machines… but none of the machines communicated with each other and none of the logs were collated. Often times, builds failed because the scripts didn’t set the correct error statuses. And even when they did, success wasn’t assured. In these cases, Erin was left to log into at least ten different machines to sift through logs to find the culprit. Often times, she found multiple errors, but since they didn’t halt the build process, Erin assumed they were “expected.” (She’d never been able to get a straight answer on that, either.)

When it wasn’t one of these problems, it was often configuration skew in the build environments. All the machines used a shared password, which QA engineers, developers, and IT staff were all given their first day and made plentiful use of. There had been a number of customer escalations Erin eventually traced back to an innocuous tool being installed on the build machine. (By whom was anyone’s guess.) And then there was the ongoing issue of server capacity: Erin had been working with IT to get more build machines, but every time they provided her with new servers, she had to spend half a week reconfiguring and installing software—server by server, since it was by hand. Even she made the occasional mistake while doing this, since the versions of the fifty-some-odd necessary tools often changed. When Erin was looped in, it was often in the form of an email, pointing her to a Windows file share path and a request to “install this.” She often wondered where those installers came from, but such inquiries were only met with “when will it be installed by?”

Back to the dreary Saturday morning, peering through all the logs, Erin eventually found the offender: the licensing file. DPI generated user license keys with certificates and these certificates had to be generated and checked in for all the builds. Except… not really. Some branches had valid certificates that could be reused. Most of the time. Despite her tenure there, there were still parts of the process she didn’t quite understand and while she’d suggested making a build checklist for these sorts of things, her suggestion was repeatedly spurned: “No time,” Sophia always said. When running into this specific issue, it seemed as if Sophia took delight in brow-beating developers who forgot to generate the certificate on the right branch and QA leads who missed validating it. When they corrected the issue, it still caused at least a day’s delay. And even though Sophia said she’d resolve it for the next release, it kept happening somehow.

Erin called Sophia to give her the bad news. The first five minutes of the conversation were spent on a lecture about why Erin had missed this “major detail” again. Sophia promised to find the on-call developer, but warned that he was camping with his family, so it might take awhile to get a hold of him. Erin sighed as Sophia explained, in a speech Erin could almost recite now, how important this particular build was to the entire company and how it’d be “really great” if she could stay at the office to kick it off the second the certificate was checked in.

Each build was always “critical to the entire company,” Erin mused, as she sent off an email to her friends, canceling her Saturday plans yet again.

“It was the best of times…”

Bright-eyed and bushy-tailed, Erin sat down at her desk, ready to face the week. As she fired up email and sipped her coffee, a developer ran over to her desk. “Here we go,” Erin thought. “I can’t check out! The server must be down again,” the developer exclaimed. But after a few minutes of diagnosis, a minor configuration problem with his client was solved and Erin was back to her coffee, reviewing the continuous integration dashboard and planning her day.

Things has certainly changed in the last six months. DPI had missed its date for its next big product launch by a full two months; and the online service integrations weren’t going any better. DPI’s VP of Engineering, furious at the slips, started holding meetings to find answers. Team after team pointed to infrastructure problems and as the face of release engineering, Erin had to endure some very uncomfortable conversations. Initially, he didn’t buy Erin’s reasoning that the schedule slips were directly related to the obliviousness surrounding DPI’s release infrastructure. When Erin realized her explanations focused just on the issues in her part of the system, and thus sounded a little too “convenient,” she realized she needed to expand the scope of the conversation. She’d just finished reading a book called the Phoenix Project and strongly suggested he, too, take the time to read it. Shortly thereafter, changes became palpable. The VP immediately cut all features lacking tests from the new release scope. It was a hard sell to the business and product management (especially to those whose pet-feature was cut), but he demanded his product managers answer: “We’re already going to be shipping late. Do you want it to be garbage, too?”

The VP took an active interest in the path code took from commit, through to testing and into a shippable product. He added two headcounts to Erin’s team, which they’d been able to fill. One engineer started working with Erin immediately on replacing the home-grown, octopus-esque scripts with a proper continuous integration and orchestration tool. After explaining to the VP that even the source control server wasn’t backed up and failed almost weekly, Erin was immediately sent to training, with the intention of the server being transferred to her team. IT initially cried foul at the idea, until Erin suggested they send someone from IT with her; she suggested Alan, who’d always been interested in source control. The VP loved the idea of eliminating the so-called “bus factor” and having the two be able to cross-train their entire team.

In fact, that pattern played itself out all over the Engineering department: while unfamiliar at first, QA shifted how it worked so that it was pairing more with developers on testing specific features. A few engineers had always done this, but the VP mandated that everyone on the development and QA teams take ownership of quality and work together on tests, whether they be narrow, like unit tests or at a broad level, like integration tests. Unsure where to start, they tackled that big shared library every DPI app used. Developers eventually got to the point where they took good-natured glee in rejecting review requests for changes that lacked unit or integration tests.

Erin and her team set up a version-controlled artifact repository, so changes to the build and test environment had to be run through that repository, in code. No more shared-drive with bits from various corners of the Internet. This had a side-benefit no one expected: IT didn’t have to figure out what to put on new hires’ machines anymore to get them able to compile the product. Developers just ran the configuration management tools in “solo” mode on their desktops. Erin also started hacking away at branches, getting the number down to four active branches. A branch and merge policy was pondered for a few weeks by all the teams, and one of the projects Erin was more proud of resulted: an auto-merge tool automating the process, now that the flow of the code between releases was better understood and had been standardized.

Some didn’t survive the beginnings of the transformation. The VP asked Sohpia point-blank why the certificate problem had occurred more than once. Her answer revealed a collection of checklists and documents she’d been keeping and storing off the file share ever since she’d been given the responsibility of managing releases. She confidently explained that she used the details contained therein to keep various teams “in line and on their toes” and giving her needed visibility into the state of each release. It was announced she’d be “moving on” a week after that meeting.

As Erin prepared an email about a broken build, she thought about the work left to be done. “I suppose it’ll never end,” she thought. But at least it was working on DPI’s software delivery pipeline and all the tooling around it, instead of chasing down developers to fix their broken check-ins. Or spending endless Saturdays rerunning builds. She’d found herself more engaged in her work and actually felt like it wasn’t just an endless treadmill of brokenness and despair. Plus, she loved when she was able to introduce a new bell or whistle on the pipeline that delighted developers and wowed product management.

Just as she was about to click send on the email to Engineering asking if anyone had seen the failure, she noticed a notification on the CI dashboard: a developer had annotated the bustage with a note, checked in a fix, and promised to make sure this checkin passed tests before heading home for the day.

Erin sat back in her chair, smiling to herself, as she clicked cancel on her email. The compose window disappeared, revealing a text editor, with the source code for her next big software delivery pipeline addition.

 

Editor’s note: There’s more to CD that great tooling. But great tooling doesn’t hurt! Check out Atlassian Bamboo–the first world-class continuous delivery tool on the market.

Exit mobile version