Comprehensive CI/CD System Design
Continuous integration and delivery is finally becoming a common goal for teams of all sizes. After building a couple of these systems at small and medium scales, I wanted to write down ideas, design choices and lessons learned. This post is the first in a series that explores the design of a custom build system created around common development workflows, using off-the-shelf components where possible. You’ll get an understanding of the basic components, how they interact, and maybe an open source project with example code from which to start your own.
Defining Requirements
I don’t like reinventing the wheel, but experience shows that existing solutions do not allow for workflow deviations. If they do, it takes as much or more work to customize it than it does to write your own. Meaning, it’s important to clearly define your requirements before jumping into this type of exercise, even if you’re simply doing an evaluation.
Branching Strategy
While not the first thing that comes to mind when talking about builds, the code commit workflow is probably the most important. The build system must support the way you develop code, not hinder it. Some organizations can afford to be flexible and adjust their process, but most cannot. Remember that usually we add build solutions to existing teams, not the other way around.
A deep discussion of branching strategies is outside the scope of this article. I’ve seen quite a few systems work over the years, but the one I find the most stable is as follows:
- Establish two code branches, one for release candidate code (
master
), one for released code (production
). - Develop all code in a feature branch of
master
. - Merging to
master
requires code reviews and all automated tests to pass in the feature branch. This only works with infrastructure in place to supports it. - Only allow fast-forward merges onto
master
. This means the feature branch must be up-to-date with the tip ofmaster
. It can be an issue if your team moves really fast, however it puts the responsibility of having working code on the person making changes vs someone that doesn’t even know what changed. - Commits to
master
automatically deploy to a staging area. - Merging to
production
requires approvals from stakeholders and all tests running on staging must pass. - Commits to
production
can also automatically deploy.
This workflow adds a level of organization to your repositories that answers two common questions posed by everyone that interacts with your team:
- How do I get the most stable release of code? Use the tip of the
production
branch. - How do I get golden release candidate code with the most up-to-date features before they make it to production? Use the tip of the
master
branch.
It’s useful in discerning how things work, training new devs and checking on the state of a release. But I also find it adds tremendous integration value with other teams.
For example, let’s say your responsible for building a REST API as the backend to a GUI, while another organization owns a Python client library that uses that API. It’s very simple for that separate team to monitor the commits in your master branch to make sure that any changes to the API do not affect their client library.
In fact, it’s real easy to completely automate that check and report status back to the API repository, preventing a production release when a failure occurrs.
Requirements
Following are the main attributes that will make up this system:
- Simple build instructions - must be easy to explain both to a human and a computer how to build a particular deliverable.
- Short 3-stage pipeline - must cover automated tests, end-to-end tests and final deployment across a master / production branch strategy.
- Able to build and test more than just Python code.
- Deliverables in the form of PyPI packages or Docker containers. Consider Flatpak in the future.
- Initiate builds through webhooks - this allows for integration with lots of 3rd party software, as well as custom chatops interfaces and GUIs.
- Provide a way to read log output that facilitates debug.
- Ability to manage resource pools needed for testing.
- Status reporting in the form of events - this also lets us build integrations with other systems.
- Leave room for manual test efforts - let’s face it, you can’t automate everything, sometimes you need a good ol’ human to manually poke at an interface. This level of interaction must be supported.
Picking Technologies
With the goals now clear, let’s start making high-level design choices that help guide the next set of specifications.
Git and GitHub
Integrating with git as the version control system makes the solution available to the largest number of users. There are alternatives, but none of them have such a large ecosystem of tooling, support and documentation.
This is not your typical discussion point, but it’s important to note because of what we’re creating. A build system must complement the version control tools used to write the code it works with.
Speaking of tooling, the platform of choice for those integration points on top of git is GitHub. As you know, GitHub runs a large portion of open source projects, and its Enterprise version is already used by large corporations all over the world.
GitHub maximizes the number of users, but also gives us the following basic capabilities that tie directly into the requirements:
- Pull Requests - A way to document the changes coming into any branch, along with an interface for code review feedback, and a mechanism to track build and test status.
- WebHooks - The notifications that ping the build code when it’s time to start a new build. From pushing code, opening a pull request, opening an issue, or simply adding a comment, almost every repository event can trigger a webhook.
- Well documented APIs - The GitHub API allows us to perform every action needed to satisfy our goals. It’s possible to download specific files, commit code changes, create pull requests, report status, etc.
While GitHub seems like an obvious choice, I should point out that there are viable alternatives. The most common is GitLab, which provides very similar interfaces and workflows. But Phabricator and its support for mercurial repositories is also of note.
Containers
Another key decision is to determine the environment in which to build code. You want it fast, repeatable and simple to setup. Something that avoids comments like “it worked on my machine”, while also providing a short startup time and minimizing space used. It needs to scale horizontally, allowing you to decrease build time or increase the number of parallel builds easily. I find that Linux containers can deliver on all of these points.
I don’t wish to get into a religious war on container orchestration, but we do need a cluster over which to spread execution. Docker Swarm is good enough for this use case, with the added benefit of a simpler cluster setup with fewer dependencies. And the dockerpy Python module is complete enough to interface with it and manage all the services, tasks and containers.
We did say “Linux” containers, didn’t we? Yes, this does limit our environments to Linux, but it doesn’t mean we can’t use other operating systems. In cases where Windows or OSX build and test environments are a must, there’s always message queuing.
Build tasks can run inside the container environment but schedule OS-specific work to run remotely in a pool of workers. Think RabbitMQ, Redis, or other publish / subscribe systems wrapped by Python modules like celery, dramatiq or huey.
Anatomy of a Build System
One way or another, most systems implement some version of the following modules:
- Build Specification - describes how to perform a build. Typically implemented as a file in the root directory of the repository.
- Webhook Receiver - the code that receives notifications from repository tooling. Think of it as a REST API that receives JSON data.
- Execution Subsystem - the module and infrastructure that processes the build specification, performing the steps described in it. This is custom Python code running inside Docker in our case.
- Status Reporting - the mechanisms that provide feedback to the user. A lot of times it’s a bunch of REST API calls to other systems like GitHub itself, but also chatops.
- Resource Management - a non-trivial module and accompanying infrastructure that allows the execution code to check out resources,anage queues and track availability.
- Logging - in order to debug problems, you’ll need a place to collect, or at least display, build and test execution logs. An API and interface to retrieve the info greatly improves the user experience.
- Interfaces - GitHub is the primary interface in this case, but chatops systems like Slack or Zulip can greatly improved usability.
What’s Next?
The next post will expand into more details and the design choices for the subsystems outlined in the last section. If you have any questions or any suggestions on what you’d like to see in this series, feel free to discuss with me on Twitter.