Joel Kleier's Electric Froth

up

A nifty way to have a short-term pypi cache

by Joel Kleier on 2018-04-17


Why run a pypi cache?

  1. Insulate your deployments against changes to upstream (IE forced https, forced redirects to new service locations, etc)

  2. Be friendlier to upstream — especially in development and large scale updates

Is this the only way?

Nope, probably not even the best way for most circumstances.

Take a look at devpi, or maybe step it up to JFrog (and the like) if you need something a bit more thorough with more features.

The main benefit of using the reverse proxy method, in my opinion, is that it’s some config added to existing software I already manage — making the installation and management just that much easier.

So what to do?

I found this gist on github initially, and for the most part it works fine. I wanted the cache to be hosted on a subpath, though, and there are some adjustments necessary for the recent pypi.python.org move to pypi.org.

First, you need to add this to your nginx.conf in the http block:

proxy_cache_path /var/cache/nginx/pypi levels=1:2 keys_zone=pypi:16m inactive=1M max_size=50G;

Adjust as necessary, but this will keep things around, generally for a month before being removed if it hasn’t been used, and only stores up to 50GB total in the cache. The zone is set to 16m, so probably somewhere on the order of 128,000 packages will be cached at maximum. I recommend referring to the nginx documentation for more details about proxy_cache_path.

Second, add the upstreams for pypi.org and files.pythonhosted.org:

# having the same upstream server listed twice will force nginx to retry
# connections, and not fail the request immediately. It's a bit hacky, but
# works!
#
# the new https://pypi.org site now splits data between two domains --
# pypi.org is for the package meta data, and files.pythonhosted.org is
# exclusively for the pacage data itself.
#
upstream pypi {
    server pypi.org:443;
    server pypi.org:443;
    keepalive 16;
}
upstream pythonhosted {
    server files.pythonhosted.org:443;
    server files.pythonhosted.org:443;
    keepalive 16;
}

Third, if you’re like me and want to proxy from a subpath ("/pypi/" in this case), you’ll want a site configuration that looks something similar to the following:

server {
    listen 80;

    # configure the correct cache to use, set the key, make sure only one
    # request at a time will update the cached object, and add 'updating'
    # to the cases where stale cached objects can be used
    proxy_cache pypi;
    proxy_cache_key $uri;
    proxy_cache_lock on;
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;

    # set the appropriate headers for dealing with pypi.org, and make sure to clear
    # out the Connection and Accept-Encoding headers that may be coming from the
    # client -- don't want to pass them on because there's no need for the client
    # to control things like keep-alive to the upstream
    proxy_http_version 1.1;
    proxy_set_header Host pypi.org;
    proxy_set_header Connection "";
    proxy_set_header Accept-Encoding "";

    # this makes sure to make Location redirects use the subpath we want. Not sure
    # this is strictly necessary for the new https://pypi.org -- there may be
    # another value that should be here.
    proxy_redirect /simple /pypi/simple;

    # the index page for the pypi repository is found at pypi.org/simple -- it's
    # an html document with a link to every package page, and every package page
    # is made up of html links to packages found on files.pythonhosted.org
    location ^~ /pypi/simple {
        # redirect /pypi/simple to /pypi/simple/ -- otherwise upstream pypi.org will redirect
        # anyway, but the redirect will be to pypi.org/simple/
        rewrite ^(.*[^/])$ $1/ permanent;

        # replace urls to pypi.org and files.pythonhosted.org with the url your
        # server is using instead. See http://nginx.org/en/docs/http/ngx_http_sub_module.html#sub_filter
        # for documentation -- these require the 'full' or 'extras' build of nginx
        # in the ubuntu/debian repository
        sub_filter 'https://pypi.org' $scheme://$host/pypi;
        sub_filter 'https://files.pythonhosted.org/packages' $scheme://$host/pypi/packages;
        sub_filter '/simple' /pypi/simple;
        sub_filter_once off;

        # add a header to let us know whether nginx is returning a cached object, or had to retrieve it
        add_header X-Cache2 $upstream_cache_status;

        # the repository index and package pages should be cached for just 5 minutes,
        # as these change often enough that we want fairly regular fresh versions.
        proxy_cache_valid any 5m;

        # and proxy to the pypi.org upstream
        proxy_pass https://pypi/simple;
    }

    # this url is for maintaining a reverse proxy to files.pythonhosted.org --
    # the file hosting service for the new pypi.org
    location ^~ /pypi/packages/ {
        # add a header to let us know whether nginx is returning a cached object, or had to retrieve it
        add_header X-Cache2 $upstream_cache_status;

        # reset the host to correctly deal with files.pythonhosted.org
        proxy_set_header Host files.pythonhosted.org;

        # an individual package should change much less frequently, if at all, than
        # the index pages, so keeping them around for a month is probably fine.
        proxy_cache_valid any 1M;

        # and proxy to the files.pythonhosted.org upstream
        proxy_pass https://pythonhosted/packages/;
    }

    # this can be whatever other configuration you have
    location / {
        root /var/www;
        autoindex  on;
    }
}

And that is it! If you’re having problems, I’d suggest turning on debug logging and seeing if anything in that info can help troubleshoot what files are coming from cache and what’s not:

error_log /var/log/nginx/debug.log debug;