4 Attempts at Packaging Python as an Executable
A few years back I researched how to create a single-file executable of a Python application. Back then, the goal was to make a desktop interface that included other files and binaries in one bundle. Using PyInstaller I built a single binary file that could execute across platforms and looked just like any other application.
Fast forward until today and I have a similar need, but a different use case. I want to run Python code inside a Docker container, but the container image cannot require a Python installation.
Instead of blindly repeating what I tried last time, I decided to investigate more alternatives and discuss them here.
Note: Please don’t read this as “project x is best” or “solution y sucks”, instead try to learn from the journey. The info here may help you push past problems in your own attempts.
The Environment
All projects have constraints, and this is definitely not an exception. It’s important to understand these because it helps focus problem resolution as you work through it. Let’s jump into our limitations.
The final packaged file will run on Docker, which means the images are almost guaranteed to be some Linux derivative. There’s no need to worry about Windows yet, though all of the packaging systems I try do support building for Windows, Mac and Linux - it’s good to keep possibilities open.
The build executes from an Ubuntu 16.04 virtual machine, but it’s not a clean install. It’s using Python 3.7.4 compiled from source. There are web proxies, blocked ports, ssh problems and network issues. Compilation environments mix gcc
and clang
. There’s at least a dozen separate Python virtual environments. It’s using both Git and Mercurial repos - yep, they do exist! Plugins that translate between hg and git - yes, people do this as well. Custom docker images, nginx reverse proxies, the works.
I don’t know about you, but this mess is the reality of working as a software developer in a large company.
I’m sure that trying all this stuff would be infinitely easier with pristine virtual machines only accessible to me. Running on network services and hardware with a steady state configuration. And without dependencies on VPNs, SSH tunnels or cross-geo connectivity.
However, in almost two decades of a professional career, I’ve never experienced such an environment. At least not for more than a month or two before the IT nazis find your rogue VM and kill it, along with all your hopes and dreams :D.
The Attempts
There wasn’t a lot of premeditation in choosing the following four systems. I went with solutions that could meet my needs based on popular answers to questions about packaging.
Cython
Cython is very popular as a method and language for compiling python code into C modules. Giving you the ability to integrate with other C code, along with the typical speed gains of compilation vs running an interpreter. It can integrate both ways, not making Python code callable from C, but also producing C that directly imports as any other Python module.
At first I tested a simple hello-world script that printed something to the screen, as well as imported subprocess
and made a call to run()
- something I needed to do in my final solution.
Cython doesn’t solve the entire pipeline for you. It’s main function is the translation of a .pyx
file that contains the modified Python code (because Cython defines more language constructs) into a .c
file. It doesn’t compile it for you, but there’s an option for embedding the Python interpreter into that file.
If you’ve never compiled C code, this is a two-step process: Compile and Link.
One produces a C module .o
file and the other makes an executable. These are the steps to do it:
cython --embed -3 -o your_app.c your_app.pyx
gcc -o your_app.o -c your_app.c `python-config --cflags`
gcc -o your_app your_app.o `python-config --ldflags`
The first line is the actual cython call that translates to C. We’re telling it to --embed
the interpreter and use -3
to indicate the Python version. -o
is the name of the output file it writes.
Both of the compilation steps also require extra information to help the compiler find the necessary pieces of code that makeup Python itself and include them into the resulting file. This gets tricky when you’re using virtual environments, amd more complicated with custom flags if using a Python installation from source.
In determining which flags to use, I discovered the python-config
command. Running it inside your virtual environment will point to the appropriate information. Specifically, you’ll need python-config cflags
and python-config ldflags
. The output of which can directly pass into the gcc
compiler command in lines 2 and 3 above.
I was able to prodce a standalone one-file package. It worked very well, until I started to split the Python code into individual modules. It seems Cython will not crawl your imports to figure out what it needs to compile. You have to manually do that yourself, or at least I couldn’t figure out a way to do it automatically in the timespan I had available.
Cython isn’t really built to package code the way I need it. It’s main purpose, as mentioned earlier, is integration between C and Python. Enter our next method: nuitka!
Nuitka
Nuitka has been around for quite a while, almost as long as Cython itself. This system exists primarily for making Python modules into executables - which is exactly our goal. It also compiles your code, like Cython, but using its own algorithms. Which means there’s still a possibility of gaining some execution speed.
I’ve used Nuitka in the past as a proof of concept, compiling a flask app into a single file that ran just fine. The command I worked with on this attempt was:
python -m nuitka --standalone --follow-imports your_application.py
Unfortunately, this go around I had a bunch of trouble trying to add the necessary C includes. I saw it crawl all imports in my code, as well as the modules imported by those modules, and so on. But I was just unable to get past a set of “undefined” errors from missing standard lib functions.
I suspect problems with my virtual environment setup, or just missing environment variables that point to the correct paths. Either way, I ran out of time while researching solutions.
PyOxidizer
I’ve made a point of mentioning the compilation steps in these past few packaging options because the next two are slightly different.
Compiling a module or package along with the Python interpreter (the attempts so far), is not the same as “bundling” an interpreter along with your Python code into one file. This last piece is what the next two mechanisms do.
PyOxidizer is the newcomer. It leverages a packaging system developed for the Rust programming language. Just like everything else, it can produce packages for any operating system, but it works by bundling a Python interpreter into the executable, along with your code and its dependencies from a fresh virtual environment.
Executing the binary extracts relevant files into memory. And as you know, reading from memory is considerably faster than the file system, so you’ll have faster import statements and loadtimes. Other solutions like PyInstaller (described in more detail below) do something similar but they extract code into the filesystem.
It can package things in 3 different modes:
repl
- this is really neat, and I consider it a “deliverable” category of its own. It allows you to distribute a one-file interactive Python REPL pre-configured with any and all dependencies. Think of Jupyter Notebook, but from the command line as one executable.eval
- runs a one-liner string with the Python command of your choice. Their documentation usesimport uuid; print(uuid.uuid4())
as an example. It works well with code that does simple things or makes calls into modules with simple interfaces.module
- loads a module as__main__
and executes it. I expect this is the common choice for larger applications.
I ran a number of tests that worked well in my adventures with PyOxidizer. Packaging choices available in the TOML config file allow for a lot of customization.
I especially loved the 4 options that handle module dependencies: single-file pip installs, using requirements files, providing root directory includes or simply pointing to a virtual environment with all dependencies configured.
As I started to do more complicated things I ran into this problem when trying to import requests
. The gist of it: the __file__
attribute isn’t set when running packaged code.
The PyOxidizer folks explain it in detail here. They make the point that this attribute is not required by Python environments, so anyone writing a module that expects it, will limit where their code can execute.
I’ve actually dealt with __file__
issues before when bundling external resources with PyInstaller, so it wasn’t surprising to me.
It seems that several libraries depend on that attribute. Or at least a few basic ones that are used by a lot more higher-level ones, making them dependent on it as well. Remember when we talked about how modularization and dependency trees can make open source complicated?
This problem made the whole effort feel wasted. I looked around for solutions but didn’t find any within my alotted time. However, while writing this post I did run into this section of their documentation that describes how to sidestep the situation by adding some extra settings in the config file. So hope is not lost.
PyInstaller
PyInstaller has been available for several years. I’ve used it many times before to successfully produce single-file desktop application binaries for both Linux and OSX. Even able to include other software in the bundle, like Chromium or a built Unity3D game.
As mentioned previously, the methodology is very similar to PyOxidizer. Its final executable contains a bundled interpreter and all the files you need in order to run your application. Executing it extracts the bundled files into a directory and loads the rest from there.
In my tests, the code required access to Python shared libraries. This was slightly more complicated in my situation because I’m working from a Python source distribution.
I had to make the conscious decision when configuring the Python source to enable these libraries. You do this by adding the --enable-shared
option when running the ./configure
step of the source build.
After installing Python in this way, you must set an LD_LIBRARY_PATH
environment variable pointing to the shared libaries. Alternatively, you can configure the path in a file under /etc/ld.so.conf.d
and run ldconfig
.
A separate complication came from the fact that I still like to use virtualenv
and virtualenvwrapper
to manage virtual environments.
It turns out there’s a bug with how PyInstaller interacts when running inside. The issue is with identifying the correct path for distutils
, but I found a quick fix to apply in this issue comment. Nothing too complicated.
Producing the final executable was done through the following command, plus a couple options that I don’t list here because they were about specifying output and build directories:
pyinstalller --one-file your_application.py --clean
A few other items of note:
- Environment variables - in order to guarantee a correct load, PyInstaller automatically updates the
LD_LIBRARY_PATH
variable within its process to a temporary path. - Paths (again) - you may find some issues similar to PyOxidizer when determining the path from which code is executing, solutions for that are available in the article on desktop applications linked earlier.
The LD_LIBRARY_PATH
problem isn’t an issue for almost all cases, but my application was using subprocess
to execute some commands that depend on that path. Since spawned processes inheret the parent environment, you’ll have to adjust that variable and pass-in the correct one as part of the env
parameter of the subprocess call. The original value is available as a different env var called LD_LIBRARY_PATH_ORIG
.
Moving forward
Cython was not built to solve my use case, but Nuitka was. I’m sure with more time I would’ve made the Nuitka compilation process work.
However, PyInstaller solved all my problems again, so it was my chosen solution. I’ve been using it repeatedly in automation for a couple of weeks now without issue.
PyOxidizer shows a lot of promise. It’s documentation is very good, even including a comparison to other tools with a short overview of each one, along with their differences. It’s worth a look if nothing more than to know the names of all the different possibilities.