Effortless Parsing of Build Specifications - Designing CI/CD Systems

Every code repository is different. The execution environment, the framework, the deliverables, or even the linters, all need some sort of customization. Creating a flexible build system requires a mechanism that specifies the steps to follow at different stages of a pipeline.

As the next chapter in the Comprehensive CI/CD Pipeline and System Design series, this article examines which instructions you’ll want to convey to your custom system and how to parse them. The focus is around a common solution, adding a file into the repository’s root directory that’s read by your execution engine when receiving new webhooks.

Defining the specification

Build instructions should be human readable. Meaning it must be easy for anyone to determine how to put a build together, regardless of whatever knowledge they have of the languages used inside the repo. Even better, it should be possible to manually follow the information in this file and complete a build on your own.

I can’t stress the value of this enough, I’ve been in situations where something happens to the infrastructure and no one knows how to package and distribute the code by hand. Figuring this out by rumaging through obscure build code and hidden directives, all while trying to meet a deadline, is not fun!

I find that using a common language like YAML or TOML to write the specification is best. Some systems go as far as making domain-specific languages and force its users to learn how to talk “build”. Others don’t even put that info in the repository, they maintain separate scripts written in different languages to run the builds.

While this can add value for large and complex implementations, it also adds confusion and frustration for everyone involved. It sucks when you finish a coding assignment and realize, it may take a week of learning new things to understand how to incorporate it into a build.

Directives

After choosing a file in the root directory to hold the instructions for this pipeline, now it’s time to decide on directives. Keeping in mind the workflow defined in the first article, plus the use of Python and Docker to isolate builds and distribute compute, the file should contain sections as follows.

`image`

The name of a base Docker image to start building with. For example python:3-slim.

`environment`

A dictionary with entries used as the OS environment when starting a new build container through the dockerpy package.

`install`

The list of installation instructions in the form of shell commands that build and prep the execution environment for testing.

These will execute as child shell processes using the subprocess module. If anything here fails to complete with a good return code, then the build failed.

`linting`

Assuming you don’t have pre-commit hooks to take care of this for you, this is an array of shell commands that run one or many linters. The section executes after the source code is setup inside the container and reports status separately.

It’s a good place to throw flake8, bandit or even black if you want to use a formatter as well. If this section fails, then only the linters failed, but not necessarily the build or the tests.

`pypi`

A dictionary to specify PyPI configuration details for use with pip and setup tools. These settings go in .pypirc and pip.conf files inside the container. You’ll need them to enable pulling and pushing package artifacs to private repos, mirrors or PyPI itself.

`execute`

This is where you test the build. It’s another array of shell command to run with subprocess and will use background as an extra keyword to indicate if a command should run non-blocking and in the background. This section runs on every pull request opened against the master branch.

If any of the commands defined here fail, then your code might be packaged correctly, but it’s not fully functional.

`docker-registry`

Dictionary with the location of a docker registry (like DockerHub) and the credentials necessary to push images to it. Similar to the pypi section, it’s used for pushing artifacts to private repos.

`staging`

Also a list of shell commands used for deploying to a staging environment that describe how to package and publish the code. But also could run higher level integration tests as needed.

This section only executes when opening a new pull request against the production branch.

`production`

Same idea as above but for a production environment. It could be as simple as adding image tags to existing staging deployments, or removing ‘rc’ candidate suffixes from python pacakages.

It only executes after successfully merging a pull request to production and is the final stage of the build.

Parsing the configuration

As mentioned earlier, the file format selected for our configs is YAML. Why not TOML? While I don’t want to start some long internet argument of YAML vs TOML, I do find that YAML is still more human readable.

Maybe it’s because I’ve been writing python code for too long and enjoy indentation, maybe it’s because TOML reminds me of Windows ini files and my distate for them. Either way, the choice is made, but I do want to spend some time discussing the complexities of serialization with some of these formats.

While I find that JSON is the leading standard for computer-computer data exchange, YAML and TOML seem to do much better in terms of human-computer communications. The main reason, of course, is the aforementioned readability. A human can get lost easily in the braces hierarchy that comes with JSON, whereas the indentation of YAML or bracket sections of TOML make the hierarchy considerably clearer to follow.

Parsing rules makes this discussion more interesting. YAML and JSON standards allow for a variety of constructs intended specifically for serialization - some of which involve code evaluation - making them inherently insecure for use in a general setting. This is why JSON and YAML parsers tend to implement safe loading functions that parse that syntax into simple strings - in python, that’s loads() for JSON and safe_load() for YAML.

The way I see it, both formats are fulfilling a different use case today than the original problem they intended to solve, so only a subset of their specifications apply. However, if a parser wants to claim official support for any standard, it must include the entire syntax as defined in its spec. This leads us back to requiring safe functions that avoid the “gotchas”.

Think of it like SQL injection vulnerabilities, you must always use the correct functions when parsing strings open for anyone to define.

TOML on the other hand, is designed as a config file format that maps into dictionary structures. Since it doesn’t cover the serialization / deserialization use cases, it’s much simpler and easier to parse while still providing better readability than JSON. If you’re new to TOML and don’t know what it looks like, here’s a short example:

[user]
name = "Your Name"
email = "[email protected]"

[account]
name = "Account Name"

The above is equivalent to the following YAML:

user:
    name: Your Name
    email: [email protected]

account:
    name: Account Name

Both of which parse into this python dictionary:

{
    'user': {
        'name': 'Your Name',
        'email': '[email protected]'
    },
    'account': {
        'name': 'Account Name'
    }
}

Since the PyYAML python package provides a yaml.safe_load() function that avoids security problems, regardless of the standard’s complexities, I still chose to go with YAML for readability.

It’s fine if you don’t agree, this is one of those situations where you can substitute the parser or format of your choice. Just make sure you’re aware of any potential security risks and mitigate them appropriately.

If you still have some concerns, there’s a few more options for you. Have a look at the StrictYAML python package. It’s a custom parser that only implements a subset of the YAML spec, raises exceptions when it finds funky syntax and allows for a schema definition that defines strict type checking.

Getting the configuration file

Now that you know in which format to write the file, it’s time to decide on a strategy to pull the file from a repository that needs building, but avoid grabbing the entire repo contents.

Remember that you’re designing the initial stage of a pipeline, the only thing that’s happened so far is that the build system received a webhook indicating there might be work to do. No point in wasting time or resources if you don’t have to.

Since GitHub is the chosen repository management system, the solution is quite simple. A quick look at their REST API documentation provides the info needed.

Download files with a GET /repos/:owner/:repo/contents/:config_filename, as defined in the Repositories -> Content section. This will work for anything under 1 MB in size. If your build file is larger, then something is not right and you should reconsider the format or directives.

Executing REST API calls against GitHub is easy to do. They use HTTP Basic Auth for authorization and that integrates well with the requests module.

There are two systems for getting permissions to access your repositories: use an auth token generated in the Settings -> Developer settings -> Personal access tokens page, or use a GitHub App (OAuth apps will be deprecated soon). This example uses tokens:

import yaml
import requests

from base64 import b64decode

# Request the file contents
response = requests.get(f'https://api.github.com/repos/{username}/{repo}/contents/{config_filename}', auth=(username, YOUR_TOKEN))

# File content is encoded as base64 inside the 'content' field
raw_config = b64decode(response.json()['content']).decode('utf-8')

# Parse the YAML
config = yaml.safe_load(raw_config)

To generate the token needed in the example above, click the Generate new token button in the personal access tokens page mentioned previously, and select the permissions you need (repo is enough in this case).

Clicking the Generate token button at the bottom of that page will submit the form and show the new token. You’ll only see the it once, so make note of it, otherwise you’ll have to create a new one.

If your curious about how this works with a GitHub App, check out Integrating Pytest Results with GitHub. There’s a section on how to make an app and how to authenticate with one.

What’s next?

It doesn’t seem like much, but you just completed the design of the most important piece of a custom pipeline and evaluated how to implement it. Most of the choices ahead are consequences of the decisions made here. Although like any other large software project, those future design points may still bring you back here to improve or iterate on this definition.

Coming up next we’ll dive into listening for webhooks and parsing their payloads to kick off the different stages of a build.

python docker ci cd builds parsing yaml toml github REST