The Trusted Packaging Index
A proposal for funding PyPI infrastructure and development
A few days ago, I was listening to the latest episode of Talk Python To Me: Are we failing to fund Python’s core infrastructure?, which had a panel of guests from the Python Software Foundation, PyPI and Read The Docs. As someone that writes open source code, the topic of sustainability is always floating around in my mind. Being able to mostly work on the things that tickle my brain would definitely be awesome, but even if you had a fantastically successful project — which I don’t — it still is extraordinarily difficult to achieve.
I always wondered how organizations like the PSF made it all work, especially with infrastructure and systems that have the level of traffic we see in PyPI. The closest parallel I can draw is to research projects, where a considerable amount of time is dedicated towards finding the right kind of funding.
The problem
There’s a lot of open source software, tooling and infrastructure out there providing a considerable number of capabilities that form essential parts of a larger ecosystem — Python in this particular case — without which many other software projects would be unable to function.
Don’t believe me? Let’s take the NPM craziness from earlier this year as an example, where the owner of a widely used package decided to take it down for a few interesting reasons, in essence breaking a large percentage of all packages, most of which were unaware of their dependency on it.
There was another problem a few years back with NPM SSL certificates — not trying to pick on these guys, it’s just fresh in my mind — where a simple change caused everyone’s installs and builds to fail. The fix was easy but the ecosystem so vast that it had a large impact.
If you listen to the podcast episode mentioned earlier, you’ll realize how critical the PyPI and Read The Docs systems really are. The packaging index alone is ferrying data around the vicinity of 40TB a month, with an infrastructure cost of $40k/month, donated by Rackspace. If Rackspace decided to stop funding this project, or PyPI were to go down in any way, what will the rest of us do? Yes there are contingencies but I don’t like resting one of the most basic Python features on the good will of just one donor. No offense to Rackspace, but there are always unforeseen circumstances, and I’m sure they would appreciate some partners.
These folks have been looking at possible sources of donations and income for a while, but there are no silver bullets. Even more so when you consider that some of the typical solutions for these types of projects (like advertisements) could be detrimental to the main use cases of the systems themselves.
If a company could pay for having their packages featured or prioritized in some way, or if you went to PyPI looking for an AirBnB module and found a set of ads that follow you around the internet trying to sell you on a Caribbean vacation, you may find it annoying and avoid visiting the site in the future. Any possible solutions will require careful consideration of their impacts on the users.
Given they are open to suggestions, I wanted to add what I could to the discussion, so below is an idea to help monetize PyPI without the use of advertising.
The Idea
I’ve worked in large corporations for the majority of my career, becoming acutely aware not only of the infrastructure needs of these businesses, but also of the security and compartmentalization requirements that keep trade secrets and intellectual property private and secure, as well as protect the business from ill-intended intruders that want to wreak havoc wherever they can.
In my experience over the many years of provisioning, testing and deploying, a lot of work and effort has gone into a number of package management systems — especially within operating systems — that provide varying degrees of functionality, all with the same purpose of centralizing distribution of software (usually in compiled binary form) across large swaths of systems, along with some implementation of dependency management.
However, it wasn’t until a different episode from Talk Python — one of the earlier ones, I think Episode 23 — that I heard of someone using pip
for internal deployments. Ever since then, I’ve done my best to push that in my organizations, and boy oh boy has it made my life easier.
Spin up a docker container with a server — I use the pypiserver package — and after a few home directory config files, just python setup.py sdist upload -r your_server
to push a package and pip install --extra-index-url your_server
to install it wherever you’re deploying.
Of course, this is fairly simple, insecure and limiting. In big business, it’s usually only accepted for smaller segregated teams and less important / non-critical projects. If you can get away with it at all.
Enter what I’m calling the Trusted Package Index. Businesses always have a need for an on-premises, secure, encrypted and highly available distribution mechanism of compiled binaries. Together with setuptools
providing various install capabilities that can cover non-python code just as well, it seems like we could put together a decent product. Something like a Docker Trusted Registry.
If you take warehouse
, add LDAP and Active Directory support for authenticating both package uploads and downloads, provide relevant monitoring statistics, backend storage encryption, high availability, a replication mechanism, plus the ability to integrate webhooks from GitHub Enterprise, GitLab or any other source control system that can trigger package rebuilds, I think you could put together a fairly decent appliance with enough appeal to a number of businesses willing to pay a modest yearly fee for it, especially if you can find a way to provide a support plan along with it.
This is just a suggestion, developing it will take time and effort that may be hard to find, but not only would it serve as a source of funding, it may also help develop a few infrastructure advancements to distribute network load, like balancing the real PyPI across multiple volunteer sites running replication.
Supporting these systems and infrastructure is not an easy task, so I’d like to encourage anyone that has some extra time to pitch in where they can. If you have any ideas to help solve this, don’t be afraid to make them public by adding them in the comments below.