Practicality Beats Purity - Modularity

2018-03-04 python engineering packaging Cristian Medina

Continuing on the Practicality Beats Purity series, today we’re talking about modularity. While written with python in mind, the discussion here applies to any language that’s highly modular and with a large ecosystem.

As is touted frequently, python is quite famous for being a “batteries included” language with a vast ecosystem of modules and packages that provide almost every possible utility or function you’ll ever need. When building large applications, it’s a great idea to make use of this environment and not reinvent the wheel. This makes rapid development and prototyping real easy.

However, you must keep in mind that every new dependency added is one more variable that you have little to no control over. While you may not write the code yourself, there’s still cost incurred in keeping up with the most recent versions of your dependency and watching for security flaws and their respective fixes. It’s also important to pay attention to the size of the community around those dependencies, their interaction with other modules, responsiveness to reported bugs, and the size of supporting documentation both official (like read-the-docs) and unofficial (like stack overflow).

Following we discuss some of the costs.

The overall dependency tree

adding one python package might add a couple more dependent packages which you know have to manage, but some packages might add dozens! A great example here is the azure python module. If you want to do something with Microsoft’s Azure cloud services and you don’t want to write your own REST API wrappers, you will need to pip install azure. Well, at the end of that install pip says:

Successfully installed PyJWT-1.5.3 adal-0.4.7 asn1crypto-0.24.0 azure-2.0.0
azure-batch-3.0.0 azure-common-1.1.8 azure-datalake-store-0.0.17
azure-graphrbac-0.30.0 azure-keyvault-0.3.7 azure-mgmt-1.0.0
azure-mgmt-authorization-0.30.0 azure-mgmt-batch-4.0.0 azure-mgmt-cdn-0.30.3
azure-mgmt-cognitiveservices-1.0.0 azure-mgmt-compute-1.0.0
azure-mgmt-containerregistry-0.2.1 azure-mgmt-datalake-analytics-0.1.6
azure-mgmt-datalake-nspkg-2.0.0 azure-mgmt-datalake-store-0.1.6
azure-mgmt-devtestlabs-2.0.0 azure-mgmt-dns-1.0.1 azure-mgmt-documentdb-0.1.3
azure-mgmt-iothub-0.2.2 azure-mgmt-keyvault-0.31.0 azure-mgmt-logic-2.1.0
azure-mgmt-monitor-0.2.1 azure-mgmt-network-1.0.0 azure-mgmt-nspkg-2.0.0
azure-mgmt-rdbms-0.1.0 azure-mgmt-redis-4.1.1 azure-mgmt-resource-1.1.0
azure-mgmt-scheduler-1.1.3 azure-mgmt-sql-0.5.3 azure-mgmt-storage-1.0.0
azure-mgmt-trafficmanager-0.30.0 azure-mgmt-web-0.32.0 azure-nspkg-2.0.0
azure-servicebus-0.21.1 azure-servicefabric-5.6.130
azure-servicemanagement-legacy-0.20.6 azure-storage-0.34.3 certifi-2017.11.5
cffi-1.11.2 chardet-3.0.4 cryptography-2.1.4 idna-2.6 isodate-0.6.0
keyring-10.5.1 msrest-0.4.23 msrestazure-0.4.19 oauthlib-2.0.6 pycparser-2.18
python-dateutil-2.6.1 requests-2.18.4 requests-oauthlib-0.8.0 six-1.11.0
urllib3-1.22

That’s 53 separate libraries for the one that you actually wanted to install. Yes, most of these are the azure wrappers and supporting crypto / HTTP request systems, but what is six doing here? Why do I need cffi for a REST wrapper?

Does this matter to you? Maybe not, but maybe it does and you don’t even know it (as we’re about to find out).

Dependency conflicts

Some modules depend on different versions of packages that some of the other modules you may have already installed. While our pip installer is usually pretty good at determining whether you meet the requirements for the new modules you’re adding, it will go ahead and uninstall the previous version that the existing modules required. Maybe that’s ok, maybe it’s not, but you have to be aware of it happening so that you can test extensively and maybe make your own decisions on which versions to run.

The most interesting example in this arena is requests, a package that’s referenced by almost everyone doing things over HTTP. Now, let’s say you’re still working with cloud environments, and you have a multi-cloud setup, which means now you need Amazon’s boto3 to talk to your resources in AWS. There was a time earlier this year in which boto3 required a different version of requests than the one azure was using, and in order for the installation to succeed, you needed to install them separately, and furthermore, in order to guarantee functionality, you had to do azure first and then boto3. If you tried to do them both at the same time: pip install boto3 azure, the installation would fail because it couldn’t find a version of requests that could satisfy both modules.

This was a “simple” example because requests isn’t generally tend to break functionality across versions, but other modules might not be so backwards compatible. Even then, what happens if there’s a serious security flaw in the current version X of requests which is fixed in version X+1, but one of your modules can’t support it? Are you stuck with the flaw?

Logging

It’s a common development practice in most, if not all, languages for developers to print messages to the screen in order to keep track of what’s happening in the application. The simplest of the cases is to use the basic print function to do so, but in most larger applications, the use of the built-in logging module is more common. This allows the developer to apply standard formatting across multiple subsystems, as well as include multiple logging levels which are filterable during initial configuration. Awesome! What’s wrong with that?

When you start adding external libraries, you’ll find that each of them have their own logging methodology, some of which is relevant to you, some of which is not. Especially if you’re wanting to run at lower DEBUG levels for your own application loggers. In other words, you likely don’t want to know that requests-oauth properly formatted a header for inclusion in the HTTP POST that requests is executing within the azure authentication module, used to log you into Azure so that the azure compute management module can retrieve a list of virtual machines.

This is yet another item that you’ll have to keep in mind when managing dependencies, it’s likely you’ll have to add separate filters for each of their loggers, and if you get something as complicated as the azure module we discussed previously, it’s not as easy to determine which loggers to enable and which ones to disable. It all depends on the nature of your application.

A handy way of listing all the loggers is to import your modules and use: logging.Logger.manager.loggerDict. This will give you a dictionary with all the loggers that your modules have currently registered. For example:

Python 3.6.1 (default, Apr  4 2017, 09:40:51)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import boto3, logging
>>> logging.Logger.manager.loggerDict
{'botocore': <Logger botocore (WARNING)>,
'botocore.vendored.requests.packages.urllib3.util.retry': <Logger
botocore.vendored.requests.packages.urllib3.util.retry (WARNING)>,
'botocore.vendored.requests.packages.urllib3.util': <logging.PlaceHolder object
at 0x10a0ac2b0>, 'botocore.vendored.requests.packages.urllib3': <Logger
botocore.vendored.requests.packages.urllib3 (WARNING)>,
'botocore.vendored.requests.packages': <logging.PlaceHolder object at 0x10a0ac320>,
'botocore.vendored.requests': <Logger botocore.vendored.requests (WARNING)>,
'botocore.vendored': <logging.PlaceHolder object at 0x10a0ac390>,
'botocore.vendored.requests.packages.urllib3.connectionpool': <Logger
botocore.vendored.requests.packages.urllib3.connectionpool (WARNING)>,
'botocore.vendored.requests.packages.urllib3.poolmanager': <Logger
botocore.vendored.requests.packages.urllib3.poolmanager (WARNING)>,
'botocore.compat': <Logger botocore.compat (WARNING)>, 'botocore.utils': <Logger
botocore.utils (WARNING)>, 'botocore.credentials': <Logger botocore.credentials
(WARNING)>, 'bcdocs': <Logger bcdocs (WARNING)>, 'botocore.waiter': <Logger
botocore.waiter (WARNING)>, 'botocore.auth': <Logger botocore.auth (WARNING)>,
'botocore.awsrequest': <Logger botocore.awsrequest (WARNING)>, 'botocore.hooks':
<Logger botocore.hooks (WARNING)>, 'botocore.paginate': <Logger botocore.paginate
(WARNING)>, 'botocore.parsers': <Logger botocore.parsers (WARNING)>,
'botocore.response': <Logger botocore.response (WARNING)>, 'botocore.history':
<Logger botocore.history (WARNING)>, 'botocore.endpoint': <Logger botocore.endpoint
(WARNING)>, 'botocore.args': <Logger botocore.args (WARNING)>, 'botocore.client':
<Logger botocore.client (WARNING)>, 'botocore.retryhandler': <Logger
botocore.retryhandler (WARNING)>, 'botocore.handlers': <Logger botocore.handlers
(WARNING)>, 'botocore.loaders': <Logger botocore.loaders (WARNING)>,
'botocore.regions': <Logger botocore.regions (WARNING)>, 'botocore.session': <Logger
botocore.session (WARNING)>, 'boto3.resources.model': <Logger boto3.resources.model
(WARNING)>, 'boto3.resources': <logging.PlaceHolder object at 0x10a60c3c8>, 'boto3':
<Logger boto3 (WARNING)>, 'boto3.resources.action': <Logger boto3.resources.action
(WARNING)>, 'boto3.resources.base': <Logger boto3.resources.base (WARNING)>,
'boto3.resources.collection': <Logger boto3.resources.collection (WARNING)>,
'boto3.resources.factory': <Logger boto3.resources.factory (WARNING)>}

That’s 36 individual loggers!

Debugging

In recent work I was using a very famous python framework for producing REST APIs (which one is kind of irrelevant). Like most HTTP service frameworks, this one included a routing system for mapping URLs and arguments to functions. Since it’s highly extensible, we were also trying to avoid reinventing the wheel by installing supporting modules from its vast ecosystem. These included enhanced routing based on “industry standard” (this is in quotes for a reason) specifications defined in YAML, serializers, session managers, database access, validators, etc. This was all great, in principle.

What happened? Well, once you scaled to more than a few endpoints with complicated request arguments and responses, you start running into problems. The routers wouldn’t route to the correct functions, and because of how they worked, they did a little serialization in their returns, which meant both our endpoint functions and the serializers needed massaging in order to accept inputs that would produce the output we wanted, but the validators didn’t agree with what we wanted based on how that “industry standard” spec was written.

Trying to debug a failed request meant going through a 20-deep stack trace of code that we DID NOT own and - because of sparse documentation in some of the modules - we had to resort to browsing through GitHub repos in order to figure out what was happening, so that we could finally determine WHAT was the bug. Figuring out how to fix it was a whole other problem, though the issue was usually due to code we wrote in order to massage the data into these individual modules, mostly because our interpretation of how these modules functioned (based on their docs) was incorrect.

When you’re building simple systems this doesn’t matter too much, but as soon as you start getting into large applications, this gets pretty ridiculous. Our ultimate fix was to throw away all of these modules and their dependency trees, only keeping the base framework. We wrote our own version of what was needed for our custom application. The end result was a lower startup time - less packages to load -, less code to maintain - yes, the massagers were more complicated than a striaghtforward implementation of what was needed -, complete control over the request pipeline, and best of all! a 300ms reduction in request responses (which we were attributing to infrastructure configuration problems early on).

Summarizing

Pay close attention to your dependencies. Reusing code is fantastic and you should do it wherever possible, but don’t get carried away by the fame of a given module or ecosystem. It’s important to understand that in order to cater to wider audiences, some packages create several layers of abstraction that could be a detriment to your implementation. You have to weigh your requirements, the cost to develop our own, the validation costs, the infrastructure costs and the maintenance costs going into the future. It’s also important to keep reevaluating these as your project progresses and becomes more complex. What worked for you early on when you had a small codebase and a short list of customers, will not always work when those scale x10 or x100. On the other hand, if you know that you’ll stay forever at a certain target, then don’t complicate your life by writing more code than what you have to.

python pypi practicality logging