Martin Strobel

Husband, Software Developer, Cyclist


Managing Python Packages

Python is an interpreted programming language that has become incredibly popular over the last decade or so. It’s focus on readability and simplicity has created a language that newbies to the programming world can make use of quickly, and old-pros find to be a feature rich, modern alternative to older scripting languages like Perl or bash. I decided to continue my series on package management with Python for a few reasons:

  1. The details vary a lot from C/C++.
  2. There are a lot of different tools and pieces to the Python ecosystem; it can be hard to get your head around when you’re just starting with the language.
  3. This month is a time of significant change for Python’s package management story.

Baseline Knowledge: A Typical Python Installation

While Python is a wildly popular language, its internals aren’t preached about at universities the same way as others are. So before we get into package management, let’s do a quick high-level overview of a vanilla Python footprint. We can’t say typical, because that’s more like this comic from Randall Munroe:

The Python environmental protection agency wants to seal it in a cement chamber, with pictorial messages to futurecivilizations warning them about the danger of using sudo to install random Pythonpackages.

When you install Python, either from your package manager or downloaded from https://www.python.org, at least the two following essential things will happen:

  1. A binary executable named python, python3.8, or similar is added to your machine.
  2. A baseline set of Python files expected to be on every computer with Python installed are copied to your machine. This is referred to as the standard library.

The executable is a program which reads Python source code from *.py files, and executes them on your computer.

The standard library gets copied into an easily discovered directory on your computer, like /usr/lib/python3.8 or similar. All the default Python modules (similar to a namespace in other languages) live in a subdirectory called site-packages, with each Python module getting its own sub-folder filled with its Python source code in *.py files.

There are a lot more details here, but after typing up this section, it became pretty clear it would need its own blog post.

Using Python with Traditional Package Managers

In last week’s post on traditional package managers, I mentioned that traditional package managers are just as capable of distributing Python programs as anything else; and it’s true! Most Linux distributions publish guidelines for how to package python programs and libraries for their platform. Especially for libraries, these guidelines usually require applying a prefix to the package name. For example, running dnf search python- on my Fedora 32 machine brings up a long list of Python libraries.

Package definitions in .rpm and .deb files specify exactly where each new file they’ll add to your computer will live. For that reason, installing Python packages this way will only work if you used exactly the Python setup that was used by the person building the package. Using Fedora 32 which distributes Python 3.8 but trying to install an .rpm built for Red Hat 8 which uses Python 3.6? Because the Python installations live in different directories (/usr/lib/python3.8 vs /usr/lib/python3.6) on those platforms, trying to install a package from the wrong OS is a recipe for pain. This just reinforces a key take-away about using Traditional Package Managers: They’re most valuable if you yield all version control to them, and work exclusively with dependencies they provide for your OS.

When running in a homogenous environment like a fleet of web-servers, that’s easily done. But what about when you’re a developer banging out an open source client tool or library? Announcing that you will limit your audience to just people who picked one of a few Linux distributions would be a death knell. So what do those people do?

Using PyPI

One can, and frequently does, bypass traditional package managers when working in Python. Instead, it’s common to use the “Package Installer for Python” (pip for short). pip downloads either libraries or applications from a Python repository and places them in the currently active python environment. The most popular repository by far is an official, public one called the Python Package Index (PyPI, pronounced pie-pee-eye), found at https://pypi.org. Though, folks can create their own public or private repositories so long as they conform to an agreed upon format. The big advantage of this approach? All computers with internet access can download whatever version of a dependency they need from PyPI.

This ubiquity means that a library or application developer doesn’t need to sweat figuring out which version of a dependency is the lowest common denominator between Debian, Ubuntu, Arch, Fedora, and a million other operating systems. The developer is liberated from the additional cognitive load or time commitment that comes with maintaining separate SKUs based on target OS. Instead, that work is offloaded to liasons for each operating system who are already curating an ecosystem of packages.

But without the testing rigor and formality of a big project like Debian or Red Hat, how does pip make sure that the correct version of dependencies are installed? There’s actually a big change happening at time of posting on that front. Historically, pip has been an example of how incredibly useful even the most naive possible package manager could be.

Essentially, each Python library and application provides a setup.py file which specifies a compatible range of versions for each dependency. To install your dependencies, each library is evaluated and calls for pip to immediately fetch whatever the most recent compatible version is from PyPI.

Doing this alone has a couple of drawbacks. First, you don’t have a reproducible build. Everytime your setup.py is run, there’s a chance a new version of a dependency gets pulled down. If that new version is ostensibly compatible, but has a bug that breaks you, your users will be broken. In other words, you don’t have the ability to protect your users. So why not have setup.py point at a particular version? Programmatically keeping that list of dependencies up-to-date is pretty clumsy (yes, I’ve seen it done), and if you’re a library developer it may pin your users to a version that doesn’t work for them.

Second, I already mentioned pip is naive, but until this month it did not consider whether libraries installed before or after the one currently being evaluated have different requirements. That means if you have two or more libraries with a shared dependency that call for different versions, pip will download and install that dependency multiple times. Worst of all, you get whichever version of the shared dependency is called for last, even if it contradicts a former version requirement.

Even before pip introduced a dependency solver, Python users could get around these problems using a much more stripped down and focused file called requirments.txt. This is passed into pip and acts as a lock file, where each line specifies an exact version of a dependency needed directly by you and/or indirectly as a transitive dependency from another library. This allows for a reproducible build, because only the packages in the requirements.txt are pulled in, and any dependency conflicts could be worked out ahead of time, either manually or with an auxiliary tool. Using the two files in tandem by having an abstract set of requirements in your setup.py and a concrete set of dependencies in a requirements.txt allows for a reproducible build for your tests and flexbility for library consumers.

However, the story is not complete. Nothing we’ve discussed above would allow for different Python applications to run with different versions of a library they depend on. pip and traditional package managers both pull down one version of a package and place its source code in module directories alongside the standard library. What is a developer to do if they don’t want to be tied to whichever versions other applications are using?

Virtual Environments

When you start python, it reads a couple of environment variables: PYTHONHOME and PYTHONPATH. Using these variables, the interpreter will look in non-default places for the standard library or other modules respectively. Taking advantage of this, tools like virtualenv, and the more modern venv, create alternate site-packages directories that will hold independent sets of Python modules. pip also respects these environment variables, and will install dependencies directly into the alternate folders.

This simple mechanism allows for bulkheads to be created, where each application or project can have its own set of dependencies. This can create a situation where multiple copies of the same version of the same library end up on your machine.

pipenv is a cool project which ties together the pieces we’ve discussed so far. When you create a new project with pipenv, it initializes a new virtual environment, and will manage the libraries installed there for you. Instead of using a setup.py file to specify an abstract set of dependencies and a requirments.txt file for a concrete set of dependencies, it uses declarative toml files: Pipfile and Pipfile.lock. Unlike pip, it’s had a dependency solver from day one. For application projects, pipenv is a great choice. However, for libraries, I prefer to stick with the vanilla tools to broaden the audience that can consume them.

What about Anaconda?

Anaconda is a tool suite for Data Scientists and Machine Leaning. Among many other things, it provides its own package management solution. Similar to a traditional package manager, the tool used to download packages (conda) will read from a curated set of packages hosted in Anaconda specific repositories. Unlike other package managers we’ve talked about, Anaconda is truly cross-platform. Because the project has chosen its scope to be about the problems it wants to solve, not the platforms it’d like to support; conda ships packages to computers running Mac, Windows, and Linux.

Conclusion

Python’s cross-platform nature and popularity has long made it a language that needed a package manager separate and apart from traditional package managers. Its answer alone, pip, was incredibly naive but allowed PyPI to flourish into an awesome repository of packages. It has created an ecosystem where tinkerers and pros alike can find libraries for just about any domain they’re looking for. Traditionally, as you ran into more advanced problems, like managing transitive dependencies, auxiliary tools like pipenv or conda stepped in. However, by adding a dependency solver to pip, Python is improving a first-party tool that may add a sense of maturity to this already incredibly versatile language.