Managing Python Packages
Python is an interpreted programming language that has become incredibly popular over the last decade or so. It’s focus on readability and simplicity has created a language that newbies to the programming world can make use of quickly, and old-pros find to be a feature rich, modern alternative to older scripting languages like Perl or bash. I decided to continue my series on package management with Python for a few reasons:
- The details vary a lot from C/C++.
- There are a lot of different tools and pieces to the Python ecosystem; it can be hard to get your head around when you’re just starting with the language.
- This month is a time of significant change for Python’s package management story.
Baseline Knowledge: A Typical Python Installation
While Python is a wildly popular language, its internals aren’t preached about at universities the same way as others are. So before we get into package management, let’s do a quick high-level overview of a vanilla Python footprint. We can’t say typical, because that’s more like this comic from Randall Munroe:
When you install Python, either from your package manager or downloaded from https://www.python.org, at least the two following essential things will happen:
- A binary executable named
python3.8, or similar is added to your machine.
- A baseline set of Python files expected to be on every computer with Python installed are copied to your machine. This is referred to as the standard library.
The executable is a program which reads Python source code from
*.py files, and executes them on your computer.
The standard library gets copied into an easily discovered directory on your computer, like
similar. All the default Python modules (similar to a namespace in other languages) live in a subdirectory called
site-packages, with each Python module getting its own sub-folder filled with its Python source code in
There are a lot more details here, but after typing up this section, it became pretty clear it would need its own blog post.
Using Python with Traditional Package Managers
In last week’s post on traditional package managers, I mentioned that traditional
package managers are just as capable of distributing Python programs as anything else; and it’s true! Most Linux
distributions publish guidelines for how to package
python programs and libraries for their platform. Especially for libraries, these guidelines usually require applying a
prefix to the package name. For example, running
dnf search python- on my Fedora 32 machine brings up a long list of
Package definitions in
.deb files specify exactly where each new file they’ll add to your computer will
live. For that reason, installing Python packages this way will only work if you used exactly the Python setup that
was used by the person building the package. Using Fedora
32 which distributes Python 3.8 but
trying to install an
.rpm built for Red Hat 8 which
uses Python 3.6? Because the Python installations live in different directories (/usr/lib/python3.8 vs
/usr/lib/python3.6) on those platforms, trying to install a package from the wrong OS is a recipe for pain. This just
reinforces a key take-away about using Traditional Package Managers: They’re most valuable if you yield all version
control to them, and work exclusively with dependencies they provide for your OS.
When running in a homogenous environment like a fleet of web-servers, that’s easily done. But what about when you’re a developer banging out an open source client tool or library? Announcing that you will limit your audience to just people who picked one of a few Linux distributions would be a death knell. So what do those people do?
One can, and frequently does, bypass traditional package managers when working in Python. Instead, it’s common to use
the “Package Installer for Python” (
pip for short).
pip downloads either libraries or applications from a
Python repository and places them in the currently active python environment. The most popular repository by far is an
official, public one called the Python Package Index (PyPI, pronounced pie-pee-eye), found at https://pypi.org. Though,
folks can create their own public or private repositories so long as they conform to an
agreed upon format. The big advantage of this approach? All computers with
internet access can download whatever version of a dependency they need from PyPI.
This ubiquity means that a library or application developer doesn’t need to sweat figuring out which version of a dependency is the lowest common denominator between Debian, Ubuntu, Arch, Fedora, and a million other operating systems. The developer is liberated from the additional cognitive load or time commitment that comes with maintaining separate SKUs based on target OS. Instead, that work is offloaded to liasons for each operating system who are already curating an ecosystem of packages.
But without the testing rigor and formality of a big project like Debian or Red Hat, how does
pip make sure that the
correct version of dependencies are installed? There’s actually a big change happening at time of posting on
that front. Historically,
pip has been an example of how incredibly useful even the most naive possible package
manager could be.
Essentially, each Python library and application provides a
setup.py file which specifies a compatible range of
versions for each dependency. To install your dependencies, each library is evaluated and calls for
pip to immediately
fetch whatever the most recent compatible version is from PyPI.
Doing this alone has a couple of drawbacks. First, you don’t have a reproducible build. Everytime your
run, there’s a chance a new version of a dependency gets pulled down. If that new version is ostensibly compatible, but
has a bug that breaks you, your users will be broken. In other words, you don’t have the ability to protect your users.
So why not have
setup.py point at a particular version? Programmatically keeping that list of dependencies up-to-date
is pretty clumsy (yes, I’ve seen it done), and if you’re a library developer it may pin your users to a version that
doesn’t work for them.
Second, I already mentioned
pip is naive, but until this
month it did not consider whether libraries
installed before or after the one currently being evaluated have different requirements. That means if you have two or
more libraries with a shared dependency that call for different versions,
pip will download and install that
dependency multiple times. Worst of all, you get whichever version of the shared dependency is called for last, even if
it contradicts a former version requirement.
pip introduced a dependency solver, Python users could get around these problems using a much more
stripped down and focused file called
requirments.txt. This is passed into
pip and acts as a lock file, where each
line specifies an exact version of a dependency needed directly by you and/or indirectly as a transitive dependency from
another library. This allows for a reproducible build, because only the packages in the
requirements.txt are pulled
in, and any dependency conflicts could be worked out ahead of time, either manually or with an auxiliary tool. Using the
two files in tandem by having an abstract set of requirements in your
setup.py and a concrete set of dependencies in a
requirements.txt allows for a reproducible build for your tests and flexbility for library consumers.
However, the story is not complete. Nothing we’ve discussed above would allow for different Python applications to run
with different versions of a library they depend on.
pip and traditional package managers both pull down one version
of a package and place its source code in module directories alongside the standard library. What is a developer to do
if they don’t want to be tied to whichever versions other applications are using?
When you start
python, it reads a couple of environment variables:
PYTHONPATH. Using these
variables, the interpreter will look in non-default places for the standard library or other modules respectively.
Taking advantage of this, tools like
virtualenv, and the more modern
venv, create alternate
directories that will hold independent sets of Python modules.
pip also respects these environment variables, and will
install dependencies directly into the alternate folders.
This simple mechanism allows for bulkheads to be created, where each application or project can have its own set of dependencies. This can create a situation where multiple copies of the same version of the same library end up on your machine.
pipenv is a cool project which ties together the pieces we’ve discussed so far.
When you create a new project with
pipenv, it initializes a new virtual environment, and will manage the
libraries installed there for you. Instead of using a
setup.py file to specify an abstract set of dependencies and a
requirments.txt file for a concrete set of dependencies, it uses declarative
pip, it’s had a dependency solver from day one. For application projects,
pipenv is a great
choice. However, for libraries, I prefer to stick with the vanilla tools to broaden the audience that can consume them.
What about Anaconda?
Anaconda is a tool suite for Data Scientists and Machine Leaning. Among many other things,
it provides its own package management solution. Similar to a traditional package manager, the tool used to download
conda) will read from a curated set of packages hosted in Anaconda specific repositories. Unlike other
package managers we’ve talked about, Anaconda is truly cross-platform. Because the project has chosen its scope to be
about the problems it wants to solve, not the platforms it’d like to support;
conda ships packages to computers
running Mac, Windows, and Linux.
Python’s cross-platform nature and popularity has long made it a language that needed a package manager separate and
apart from traditional package managers. Its answer alone,
pip, was incredibly naive but allowed PyPI to flourish
into an awesome repository of packages. It has created an ecosystem where tinkerers and pros alike can find libraries
for just about any domain they’re looking for. Traditionally, as you ran into more advanced problems, like managing
transitive dependencies, auxiliary tools like
conda stepped in. However, by adding a dependency solver to
pip, Python is improving a first-party tool that may add a sense of maturity to this already incredibly versatile