Distributing Large Files with PyPI Packages

This post is part of a collection on Code and Computers.

Sometimes you want to distribute a package on PyPI that has large data files associated with it. However, PyPI has a rather small limit on the size of packages, and while it can be raised only some packages will qualify for higher space usage. This is a guide to your options when you need to distribute large files for PyPI packages.

Data, data, everywhere... via OldBookIllustrations

Just Distribute the Data

The default size limit on PyPI is 60MB, but that limit isn't for your raw data, it's for the size of the package after compression. So if your data compresses well you can just upload the package without any special preparation. For example, unidic-lite, a Japanese dictionary package I created, is around 250MB on disk, but compressed it's only 50MB, so I was able to upload it to PyPI as a normal package.

Do note that if you have large data it may make sense to compress it before uploading it to PyPI anyway. With unidic-lite the dictionary needs to be in uncompressed binary form at runtime, but for posuto, a library with Japanese postal code data, I distribute a compressed json file. That keeps the package to 3MB on disk even though the json is 75MB unzipped. Python's built-in gzip module makes it easy to extract archives on the fly on any platform.

Get a Size Limit Increase

There's always the option of requesting an increase to your package size limits from the PyPI maintainers. However, while it never hurts to ask, limit increases are generally only granted for large binaries. The rules about where and how to ask are confusing and seem to have changed over time, but right now it seems like the right place to make a request is pypi-support on Github.

Host Files Elsewhere

If you don't get a size limit increase and your files don't fit in PyPI's default limit you'll have to host your files somewhere else. The best thing is to use a proper server to host the files, but for open source projects you might not have anyone willing to foot the bill. In that case you can use Github Releases. Releases allow you to upload files after creating them; this feature is mainly intended for binary builds, but anything is fine. The only limit is that a single file can't be over 2GB. Bandwidth is unlimited, though note downloads will be slow.

Download During Install

Now that you have your files hosted somewhere, you'll need to decide when to download them. One option is to download files during install via Pip. This has the advantage that it doesn't require your user to run any extra commands - after pip install everything will just work.

There are disadvantages to this approach though. First, it means you'll have a lot of code in your setup.py, which can be hard to keep organized. It also means that your package can't be installed offline. You also can't prompt the user for configuration options; you can use environment variables for configuration, though that's unwieldy for anything beyond the most basic options, and awkward even in the simple case.

Provide a Download Command

As an alternative to downloading data during install, you can provide a command to download the data. This is the option I went with for unidic, a very large Japanese dictionary package, and it has several advantages.

One is that you can give users various options relating to the data. For example, perhaps you have small, medium, and large variants of the data. If you use one PyPI package with a download command, you can let the user choose which one to use.

Another is that you can allow users to update the data without updating their PyPI package. This uses a trick I learned from spacy-models - rather than putting a URL for data to download directly in your package code, have your package download a list of URLs that point to the data. This way you can provide new data to existing installs just by updating a metadata file. This is especially useful if your data is updated on a frequent basis but your code isn't. (Some people may be upset that this allows non-reproducible behavior by introducing a dependency on external metadata. While that's true, it's a reasonable tradeoff when you're dealing with large data, which is difficult enough to version already.)

This approach does have the downside that you have to ask your users to run an extra command, but hopefully that isn't too much of a burden. As an alternative, you can do what torchvision does with pretrained models and download the data the first time it's needed; this is pretty easy to use, but I prefer to avoid it since it can trigger unexpected delays if your user doesn't understand when the data will be downloaded.

That's all for this round. Do you have another approach to distributing large files? If so, tell me about it. Ψ

2020-05-25T14:25:35.347+0900