Using GitHub as a Flat Data Store and AWS Lambda to Update it

I spend most of my day, every day, knee deep in code. Optimizing, building, fixing and thinking through workflows can be taxing. This means that the last thing I want to do when I come home is deal with more programming. But I also like learning new things and communicating my experiences so they can help others. I do that through the posts in this website.

Maintaining a web presence without dealing with code means you get to use as many off-the-shelf components as possible. You consider things like WordPress or static site generators that let you concentrate on content, while handling the user interface for you. Write in markdown, build the website, rinse repeat with updates. It’s all very easy, until you need a little more interactivity, like a comments section, or a newsletter signup.

There are lots of services out there that help put together those functions. Some paid and some “free”, but selecting them is not so easy. With free solutions, the choice becomes to either host the system yourself - which you wanted to avoid - or use a popular service that does it for you while including advertisements (like disqus).

My personal opinion is that if anyone advertises in my site, that will be me - or at least through a mechanism where I have final say - otherwise there’s no control over what users see. I don’t want a system that collects all kinds of data from my users. I want one that doesn’t even need user accounts. I’d hate to see some random ad for a hotel in Ibiza while I’m at a tech blog, simply because I googled Mediterranean vacations last night after exhaustion settled in.

A “low-maintenance” solution

Assuming we want to build a system that requires some data backend like a comments section, we’ll need a place to store records and a way to retrieve them. Databases can’t be part of the solution because we don’t want to manage them and we really don’t need anything fancy, not even indexing. All I’m looking for is a straight-up list of records with 3 or 4 fields per record. A simple file will do, but we do want to track history and have some sort of backup mechanism. We can get the last part from a version control system like git.

These days there are plenty of ways to store structured data. The most universal one with built-in libraries across almost all languages is JSON. Others like YAML - I actually started down this path - and TOML may simplify appending to a file, but the universality of JSON in the web makes it more relevant in this situation.

Storing comments in a JSON file isn’t new or different. What’s this proposed “low-maintenance” solution? Stick it in GitHub, use JavaScript to read it, and sprinkle some AWS Lambda to make changes to it. Yes, you still have to write a little code, but it’s not complicated and once you get it going, it needs little to no maintenance.

Configuring GitHub

First you have to choose whether to keep your information in a private or public repository. I went with a private repo because I have the account for it and wanted the option of hiding any base files from the peering eyes of the interwebs. However, the method described below uses GitHub Pages, which works for both.

GitHub Pages is a feature that enables a regular repository to host static files on the internet. The files are located in a specific branch, typically gh-pages. This is excellent for static generators because you track the config and content files in the master branch, then run a build that commits the generated HTML into the gh-pages branch. This keeps the auto-generated files separate from the actual content. A few minutes after committing, the updated website is available at the URL assigned by GitHub (unless you use DNS).

Getting this configured is straight forward:

Identify or create the branch where you’ll host the files (the default is gh-pages).
Access the GitHub repository settings and scroll down to the GitHub Pages section.
Select the branch in the Source section
Configure DNS in the Custom domain section, if needed.

GitHub is now ready to host your files. Test it out by committing an index.html page with some “hello world” text and load it on your browser with the URL that GitHub gave you.

Reading the Comments

For now, let’s assume we already have a comments.json file with the following structure added into the gh-pages branch:

[
    {
        "user": "ABCDEFG",
        "comment": "12345"
    }
]

It’s on you to edit your site generator so it includes JavaScript code that performs an HTTP GET to the URL that’s hosting this file. There are hundreds of ways of doing that, most of which use 3rd party libraries (like jQuery). If you don’t want to complicate the dependency tree more, here’s some sample JavaScript that does it with the basic XmlHttpRequest object:

divcomments = document.getElementById("#the_div_where_you_add_comments");
xhr.open('GET', "https://your_user.github.io/your_repo/comments.json")
xhr.onreadystatechange = function(data, err) {
    if (xhr.readyState === XMLHttpRequest.DONE) {
        if (xhr.status > 200) {
            // There are no comments
            divcomments.innerHTML = '<p>Be the first to comment on this article.</p>';
        }
        else {
            // Parse the comments into an object that we can iterate
            var data = JSON.parse(xhr.response);

            // Iterate through the comments and add them to the div
            for (var i=0; i < data.length; i++) {
                divcomments.innerHTML = divcomments.innerHTML + '<blockquote><p>' + data[i]['comment'] + '</p></blockquote>';
            }
        }
    }
}

Notice that all I parse the response as JSON. This gives us a full JavaScript object for updating the HTML element already in the page.

However, our plan has a flaw. The comment field in our data structure is in plain text. This means that if someone uses special characters in their message, then it can break the entire section. Usually you need to escape the string before placing it into the field, but the most flexible solution is to base64 encode it. JavaScript has built-in functions for doing so: atob() and btoa().

Adding comments with AWS Lambda

I’ve been looking for a reason to use AWS Lambda for a while. It’s an Amazon service that provides a way of mapping the execution of a function to an event. Functions can be written in a number of languages, one of which is python 3.6, and they execute inside docker containers. Events are internal triggers from various AWS systems, including the API Gateway service. This means that you’re essentially mapping a function to a URL provided to you by the service.

There are no service charges for the first 1 million hits on the API endpoint each month, so the Lambda is basically free for an average site. But I’ll add a word of caution: if you’re running a high traffic site and expecting a lot of submissions, it’s important to look through the pricing page and do the math. Cloud services can come with steep surprises.

To get things going, it’s easy enough to hop on the Lambda website and make a new function based on the default python hello world example.

Click Create Function.
Select Blueprints.
Search for hello-world-python3.
Give it a name and type in your code changes.
The handler function receives two parameters: event and context. The event is a dictionary with a body field that contains the HTTP request body. If you’re using an HTML form to submit the comments, this is where you’ll find the form-encoded data.
Parsing event['body'] is easy enough using urllib to decode the values:
```
from urllib.parse import unquote_plus
params = [param.split('=') for param in event['body'].split('&')]
params = {k: unquote_plus(v) for k, v in params}
```
Lambda functions plug into Amazon’s CloudWatch system, so you’ll be able to view output from any print() statements by going into the CloudWatch service.

Now let’s call the GitHub REST API to get the current list of comments, but do it without using 3rd party libraries. This keeps the function minimal, otherwise you go through a different process to add those libraries to the container that runs the function. Lambda also charges for GB-hours of usage. The bigger the container, the higher the changes you’ll see a bill. This means we’re stuck with good ol’ urllib.urlopen and urllib.Request.

The documentation for the API that manipulates files in a GitHub repo is available here. You can use HTTP Basic Authentication with a personal access token obtained from GitHub user settings to login. It’s simple to add the token to the http headers of the request and point it to the correct branch and file:
```
from urllib import urlopen, Request
from base64 import b64encode, b64decode

headers = {
'content-type':'application/json',
'authorization': f"Basic {str(b64encode(bytes('username:access_token', 'utf-8')), 'utf-8')}"
}
resp = urlopen(Request(f"https://api.github.com/repos/{github_user}/{github_repo}/contents/{filename}.json?ref=gh-pages"))
```

The API response contains the base64 encoded file in a content field, so you’ll want to b64decode(resp['content'], 'utf-8') to get your comments and then use json.loads() to stick it in a dictionary that’s easy to change. It also gives us the SHA hash of the latest commit in that branch, which is information required for the next step: modifying the file.

Committing a change is also simple, except we need to pass in metadata describing the modifications along with the new base64 encoded file contents (in its entirety).

data = {
    'message': 'A commit message that makes it easy to figure out what happened',
    'branch': 'gh-pages',
    'sha': sha_value_from_previous_GET_response,
    'content': str(b64encode(bytes(json.dumps(updated_comments), 'utf-8')), 'utf-8')
}

This encode and decode stuff is a bit ugly. We serialize the updated comments back to JSON. But since we need to base64 encode them to pass them to GitHub, we call b64encode(). This function takes bytes not plain strings, so we have to change the serialized contents into bytes. On top of that, the output of b64encode() is also bytes, which means we need to turn them back into a utf-8 string before sticking them in the dictionary that we’ll send along with our HTTP request. Phew!

The HTTP request to GitHub is a PUT and the authentication is the same as before, but urllib takes data in byte strings (yes, this again!). The call now looks like this:

urlopen(Request(
    f"https://api.github.com/repos/{github_user}/{github_repo}/contents/{filename}.json",
    headers=headers,
    data=bytes(json.dumps(data), 'utf-8'),
    method='PUT'
))

That’s it. Once the request completes, the gh-pages branch will update with a new commit that contains the message you just passed it, and the new file is available to the public. It’s not the fastest solution, but the user will see his comment added to the list and you’ll have a commit history of all the changes over time. You can also add email integration with GitHub so you’ll get a notification of the commit.

Triggering with API Gateway

As mentioned earlier, Lambda can trigger on a number of things within the AWS ecosystem, some of those are internal services, others are external. To make your function accessible over the public internet, you have to plug it into API Gateway. Doing so can be a little confusing the first time, so here’s some info on that.

Once you create an API with the Gateway service, add the URL path and method that you want routing to the function. You can use /, I decided to do it on a specific resource like /comments. You do this by clicking the Actions drop-down and selecting Create resource, then type in the name and path you chose.

I do recommend also selecting the Enable API Gateway CORS checkbox, which will pre-configure a response to an OPTIONS request. If you decide to use pure JavaScript to call the API (not form submissions), you can specify the CORS headers that browsers look for in the pre-flight requests sent before the actual API requests. Perform the edit by clicking on Integration Response after selecting the OPTIONS method it created.

We have a gateway endpoint and the resource path, now we add the method that calls our Lambda function into the base resource: a POST. Do this by selecting the resource path, clicking on the Actions drop-down and choosing Create Method. Select the POST method and use Lambda Function for the Integration Type field, then chose the AWS region where the function exists and type the function name in the Lambda Function text box.

The API Gateway Dashboard provides the URL for this Lambda function. Use it as the value for the action field in the HTML form of your website.

The workflow is now ready. Here’s a short review:

Load a webpage with the new comments section.
JavaScript performs an HTTP GET to list existing comments from a publicly available file hosted in GitHub Pages.
Form submission for adding a new comment performs an HTTP POST to an AWS API Gateway URL that triggers a Lambda function.
The Lambda function uses the GitHub REST API to retrieve the latest list of comments, inserts the new comment and commits an updated list to GitHub.

Final Notes

A complete solution has to consider that one flat file for an entire website is going to be too large. Unless you’re like me and have little traffic. To break things up, I recommend identifying articles or pages where you want to add separate sections by some form of uuid, hash, or a simple numerical identifier that stored somewhere in the page. Static site generators usually provide functions that do this for you. Use this identifier as the filename that you commit in the gh-pages branch. This yields a list that’s easy to manage both for editing and reading.

It’s also possible to split it up by using a dictionary structure, instead of a list, in one large file. But it still doesn’t solve the problem of loading the entire file before getting to the section that applies to the current page the user is visiting.

When a consumer submits the HTML form, the API Gateway pipes the HTTP request to the Lambda function which then returns a response. This means the user now navigated away from the site that originated the request. To stay in the site or move somewhere else that has actual content, you can add a hidden redirect field to the HTML form and ask API Gateway to return that URL in an HTTP redirect back to the page that submitted the request.

To do this, you need to add a mapping into the Method Response section for the POST method in API Gateway, so that it returns an actual 302 with a Location header when the Lambda function answers with something like this:

return {
    "statusCode": 302,
    "headers": {
        "Location": params['redirect']
    }
}

One last note if you don’t care about privacy, is that you can bypass Lambda and perform all the changes in JavaScript. Just keep in mind that any credentials used to log into GitHub will be visible in the JavaScript code. Depending on the application though, this may not be an issue.

While this workflow is a tad complicated, it’s not too hard to create and easy to maintain. I hope that it serves you in the future, if anything, it’s a good exercise on AWS Lambda workflows and GitHub REST APIs.

python github javascript lambda aws serverless