Best Tips for Running Enterprise CI/CD in the Cloud
There are many solutions to building code, some of them are available as cloud services, others run on your own infrastructure, on private clouds or all of the above. They make it easy to create custom pipelines, as well as simple testing and packaging solutions. Some even offer open source feature-limited “community editions” to download and run on-premises for free.
Following is my experience on the important aspects to consider when deciding on a build system for your organization that depends on cloud services. The original intent was to include a detailed review of various online solutions, but I decided to leave that for another time. Instead, I’m speaking more from an enterprise viewpoint, which is closer to reality in a large organization than the usual how-to’s. We’re here to discuss the implications and the practicality of doing so.
Implementing a build pipeline can leverage building blocks from several open source or paid services at different stages. It’s more practical to do so, especially for smaller teams. Of course, this is just another trade-off where you exchange complexity of code for complexity of infrastructure. For those that are not familiar, here’s a list of choices and the integration points they provide, almost all of them offer a cloud and an on-premise solution:
- GitHub - code repository and version control, lifecycle management through Pull Requests and Issues, stores artifacts in the form of Releases or Pages, tracks test or build results through Status or Checks API.
- GitLab - also a code repository and version control solution, lifecycle management through Merge Requests and Issues, includes a CI / CD build and test pipeline system, stores artifacts as part of the builds or pages, provides deployment solution in the form of Docker registries.
- CircleCI - build and test pipelines with integrations to GitHub, Docker registries and other common deployment solutions. Supports multiple operating systems and pipeline stages.
- TravisCI - another multi-os and multi-stage build and test pipeline integrating to GitHub, GitLab and others.
- Codeship - cloud-only solution to a build and test pipeline with simple integrations. Provides dedicated build machines and local debugging.
- Azure Pipelines - cloud-only solution, build and test pipelines, integrates with code repositories. Provides deployment solutions within Azure in the form of VMs, Kubernetes, and packaging systems like NuGet or NPM.
- AWS and Google Cloud Platform also provide similar services to Azure.
Any portion of the managed services mentioned above also serve as building blocks for a custom solution. One which can run locally on your lab infrastructure, fully remote with VMs inside cloud providers (Digital Ocean, AWS, Azure, GCP, etc.), or as a hybrid solution with the most sensitive pieces in the local datacenter and an optimized smattering of cloud VMs and managed services.
Can I build in the cloud?
When examining which cloud solutions to use in your organization (build systems or otherwise) the first and most fundamental question to ask is: Does the business process allow it? While this can sound like a silly question, the reality is quite complicated. Using a cloud service implies several things that may be counter-productive to the company responsible for the software. Some might even prevent it from doing business. Following are a few reasons why this could be the case.
Source code travels over the public internet
Unless an organization connects to a cloud provider over private links, software will travel over the public internet. Even if private links are available, some providers don’t guarantee that the data moving through them is encrypted unless you’re the one doing the encryption (VPN, SSH, HTTPS, etc.) Examples of private links include Amazon’s DirectConnect, Azure’s ExpressRoute or Google’s Interconnect services, all of which are very expensive.
Even with encrypted channels, there’s always the risk of poor implementation, social engineering, day-0 bugs, and others, that require regular maintenance and vigilance. There are many bad actors on the public internet, all of whom are constantly trying to break into any systems they can find, by using all software or social exploits at their disposal. Remember, it’s not just about trusting the code or process that does the build, but also the cloud provider, their tools, network and their people, as well as the build service you integrate with.
If you don’t believe me, just bring up a free-tier server in any provider, open up the default SSH port to the internet, enable password authentication, and watch your authentication log (that’s /var/logs/auth.log
on Linux). In less than a day you’ll see dozens of login attempts, and if you google their IP addresses, you’ll find they show up in several lists of known bots.
Another example to help explain the importance of using trusted 3rd-parties is this recent article about the ‘Cloud Hopper’ Hackers.
If the intent is to build open source software, then having your source, test or build code stolen or leaked is less of a concern. However, organizational rules around these interactions could still exist for various reasons beyond intellectual property - more on this later. It’s important to check for these before embarking on your next build adventure.
When the rules are not permissive, don’t just get annoyed and give up, ask questions and find out why. You may be surprised at the amount of legal or contractual complexities and guarantees that your company has to deal with. At the very least, you’ll learn more about the value your organization provides, and may even find the value proposition that a new build system proposal must overcome: save more money than the deals you risk loosing.
Infrastructure and data used for testing or building may live in the cloud
It’s good practice to test software with datasets that closely mirror production systems. Sticking to this principle, many folks decide to clone production data as part of their test setup. And while doing so can lead to better software, using the cloud means that you need more awareness of the laws and regulations that governments built to protect their citizens’ data (like GDPR) when it crosses borders. Not just your government, but all governments in countries where your service is available, are in scope.
In some cases, if data crosses over country lines - an easy thing to do with globally distributed services - you are now required to warn users during signup. Failing to do so could mean you broke the law, making your company - and in some cases yourself - liable for steep penalties. These are sometimes quantified as a percentage of revenue vs specific dollar amounts, so they can get quite high.
It’s not just about where the data lives, but also where it’s processed. Maybe instead of cloning the data, you’re simply getting it from a read-only production mirror in a specific datacenter every time a test runs. But if that test executes in a different country, you may also be in breach of data processing regulations.
Given that build and test infrastructure is created and destroyed often and quickly, care must be taken that any shortcuts favoring speed will not compromise security. Not just security of data, but of the infrastructure as well. You must consider that someone could gain access to these test servers and maybe modify the code being tested, adding exploits to the source, or giving you false-positive results. Depending on the type of status reporting and merge gates used, it’s also possible that someone could mark builds as passed when they don’t.
Extra special care should go into packaging systems. Add checks that verify the packaged code actually came from the source uploaded, especially if you’re using a 3rd party build service. Cryptographic hashes can help with validating that the source is unmodified, while crypto-signing each commit in the code repository will verify where it came from. Any packaging systems should sign the final package to guarantee no one tampers with it. They should produce hashes that download clients can use to check that the bits transferred came from the correct server. If you’re pushing to a public repository (like PyPI or NPM) that same hash also validates that the public servers didn’t change the package.
Compliance
Any organization - public, private, open source or closed - that wishes to do business in certain industries like healthcare, government, aerospace, etc., will need to comply with a set of certifications. They are very expensive to obtain, time consuming and detailed. Requirements vary based on industry, security, data sources, processing, and many other factors. You may have already heard of some of them by their acronyms like HIPAA or SOC.
Complying with most of these certifications will require audit trails that prove the claims made by a build process. Whichever system you choose, must answer questions like: Who made changes to the code? Who approved these changes? Do you have a validation process? Do you have proof that you follow it? Did you test the intended code instead of a version modified by someone else? Where are the test results stored? Who approved the deployment? Did you package the validated code? Did you push the packaged code? Can I download the same code I pushed?
Providing those answers is easy if you know ahead of time that you need to. The build system must support this and make it simple to retrieve. You don’t just pass a compliance audit once, you have to keep repeating it multiple times a year. Failing one of these audits could mean you loose the certification, which usually means you’re unable to do business with the companies or agencies that require them, regardless of how good your product is.
Whether you’re building in the cloud or not, these requirements are likely to be the same. You should keep them in mind when choosing a provider or service. Some systems have audit trails built-in and configured for you, some require you to specifically setup and enable them, others leave you on your own. The difference with on-premises software (custom or not) is that your internal network and security may reduce the number of things your specific organization is responsible for proving.
Services like GitHub or GitLab already bring some audit capabilities to the table. Things like commits signed by PGP keys and the pull request process itself will take care of the first few proofs. There’s also the Status and Checks API which integrate very well with 3rd parties, allowing them to record a history of build steps, test execution and results for every commit. Travis CI is an example of an integration that takes advantage of it.
Cloud services are not available all the time
Don’t expect that a cloud service will be up 100% of the time - not that running locally will be much better. It’s like attempting to accelerate to the speed of light, the closer you get to reaching it, the more energy you need to get there.
Moving to the cloud will not solve availability problems, and while it does solve who owns the resolution, it’s still up to you to design a system that can handle failures, just like running your own lab. Default settings will not guarantee high availability, but they can provide some guidance for folks that don’t have any lab admin background. Be aware though that this transfer of roles to a 3rd party may not be acceptable risk to your business.
Here are some examples of failures I’ve experienced over the years, maybe they’ll help you plan:
- The service you need is down.
- Provisioned resources are inaccessible.
- The cloud provider has a planned maintenance, but you somehow missed the notification.
- The API you’ve been using for years all of sudden changed its meaning or behavior and now your code is broken.
- There’s a massive security bug that requires immediate maintenance.
- Someone in the cloud service deleted your compute or data. Yes, it happens!.
- Your application hit a new peak in usage - yay! - and auto-scaling consumed 80% of your budget in 2% of the time because you didn’t configure that correctly - bummer!.
- The backup snapshots of your build servers or packaged artifacts take hours to restore, bringing your entire service down while doing so.
Why integrate with cloud build services?
The discussion so far has focused on the complexities of managing a process that uses cloud services, now let’s go over the advantages.
Rapid Prototyping
Managed services not only provide the eventing system, compute power, storage, logging and network needed, but they also give you access to proven environments for working with different languages and operating systems. The cloud enables testing infrastructure changes, process changes, new deployments, distribution systems, new deliverables and making demo projects. All of this is easier when provisioning and managing infrastructure is minimized or eliminated.
Testing the OS variations that code supports can be daunting, especially when it includes obscure releases or restricted licensing. Some smaller organizations simply will not have a budget to do that on-premises, but a cloud service can easily take care of it for you with pre-configured compute.
Scale
The large advantage of cloud build systems is easy scaling of infrastructure. If using a custom system that runs inside a cloud provider, adding more compute power to handle a peak of fixes that need building is not complicated. The same applies when running out of storage space for artifacts and packages. Another popular use case is to decrease test or compilation time with more compute, or implementing pipeline enhancements by spinning up multiple pre-production environments.
If you’re on a managed build service like TravisCI, building more code, more often and faster, can be even easier because you don’t handle infrastructure at all. Of course that may vary based on the dev, build and test process itself.
Distribution
Usually the last stage of the pipeline, deciding how to distribute built artifacts is a major piece of the puzzle. Doing this locally within an organization sounds simple, but greatly varies with internal structure.
If the artifacts are meant for internal consumption only, but there are multiple geographical sites (a common case) now you have to worry about replication across those sites. This is not only important for availability, but also because latency becomes a major factor if the distance between sites is large. It also implies that the organization now has to run and maintain high uptime on its internal network.
While a managed build service may not store your artifacts, it usually brings easy integration with other cloud systems that work as content distribution networks (CDNs) or object storage. CloudFlare, CloudFront, S3, Azure Blob, etc. are all good examples. You won’t have to manage the storage, it’s easy to expand, and can automatically distribute the files globally so your users can consume them locally.
Along the same lines, there are public repositories for packages like DockerHub, NPM, PyPI, Conda, Cargo, etc. Uploading into these systems is usually pretty easy and already automated by the packaging systems themselves. However, it’s also possible - and in some cases recommended - to run private versions of these systems internally in an organization. Some on-premises build services offer this as part of their solution like GitLab. For those that don’t, there’s always systems like Pulp, Artifactory and others.
Summarizing
It’s easy to cobble together a build system out of the many free and paid services out there. Several developer resources focus on how to do just that. But the real world is much more complicated than pointing a GitHub webhook to TravisCI, running python setup.py sdist upload
to push a package into PyPI and calling it a day. I hope this article brought to your attention some design points that may have otherwise been an afterthought. It may save you time and money in the future.