Sunday, August 23, 2015

Dissecting Crates.io: Bare Minimum Mirror

Recently at $job I spent a Hackweek making Rust ready for production deployments. Part of that was figuring out what builds were going to look like. One of the really nice things about developing Rust code is the included package manager, cargo, and all of the crates available at crates.io. One especially nice feature of crates.io, compared to other package repositories I have experience with, is the fact that they will never allow a user to overwrite or delete a version of published code1. They do support an action called yank2 which allows you do prevent new projects from depending on a version but it will still be available to projects that have a Cargo.lock. This is great because I have much less to worry about in regard to repeatable builds when using cargo with crates.io.

Unfortunately we're not going to want to rely on crates.io being up to push code to production. While crates.io has been mostly stable there have been a few issues with DNS resolution in the past and I can't rely on that stability when I need to push an emergency bug fix. Also, I might have development and/or build servers that don't have access to the internet to begin with. With other languages I'd just throw up a caching mirror and point my package manager at that. Rust, however, is still pretty young and I wasn't able to find anything in regard to mirroring crates.io other than a single issue3 on GitHub. With that I set out to learn more about how cargo and crates.io interact and what it would take to setup a bare minimum mirror.

Index

The first thing I need to do is figure out how to point cargo at my "mirror" but I don't even know what protocol cargo speaks. The docs on crates.io mention updating the index attribute under the [registry] section of your .cargo/config but it doesn't give an example url or describe what is expected. In the issue I mentioned above it was stated that you'd need to clone the official index4 to create a mirror. Browsing through the cargo source5 also confirms that this expects a git repository. Navigating over to the repo and right away you'll notice at the top level there are a lot of two letter directories, directories for 1, 2, and 3 letter packages, and a config.json. This is a fairly common strategy to keep the number of files in a directory limited. I'd elaborate more on the layout but there's a fantastic doc comment6 in the cargo repo that you can refer to.

An example of a file for the sysconf crate, found under sy/sc/sysconf, looks as follows:

{
    "name": "sysconf",
    "vers": "0.1.0",
    "deps": [
        {
            "name": "libc",
            "req": "*",
            "features": [""],
            "optional": false,
            "default_features": true,
            "target": null,
            "kind": "normal"
        }
    ],
    "cksum": "be72e7128262fbd8dcb2e3eecee6135ed1611e0a9d63feca600b4a19f297eb49",
    "features": {},
    "yanked": false
}

This translates pretty directly to the options available to you in a Cargo.toml file. These files don't appear to be themselves valid json documents but a newline delimited list of json documents, one line/document per version. Since it's obvious no package data itself is going to be located here I decided to take a look at the config.json file next.

Downloading Packages

Jumping into the file you can see it's pretty basic.

{
  "dl": "https://crates.io/api/v1/crates",
  "api": "https://crates.io/"
}

I could dig through more code real quick but my hunch is that I only need to provide dl for a basic mirror. I cloned the mirror, pointed my .cargo/config at it, and checked in the following as my new config.json.

{
  "dl": "http://localhost:8080/api/v1/crates",
  "api": "http://localhost:8080/"
}

I threw up a mitmproxy just to watch the traffic flow through and ran a few cargo commands.

$ mitmproxy -b localhost -p 8080 -R http2https://crates.io/ --setheader="/~q/Host/crates.io"

At first I wasn't seeing anything come through but I realized that cargo will do a decent amount of caching of package downloads. Once I purged out the cache data from ~/.multirust/toolchains/beta/cargo/registry, (~/.cargo/registry if you don't use multirust) I was able to see the traffic flow through the proxy. For the example I was able to confirm that only the download url was being used at all. This means mirroring should be pretty simple since I don't have to provide a bunch of fancy APIs.

With that information I should have enough information to throw together the worlds most basic mirror.

Bare Minimum Mirror

At this point I'm going to get out of the way and let the commands speak for themselves. I'm going to create a new project with a single simple dependency that uses a custom index and download url. Let's start backwards.

File Server

gary@pixel:~/mirror$ mkdir fileserver
gary@pixel:~/mirror$ cd fileserver/
gary@pixel:~/mirror/fileserver$ dl_path="api/v1/crates/libc/0.1.10"
gary@pixel:~/mirror/fileserver$ mkdir -p "$dl_path"
gary@pixel:~/mirror/fileserver$ wget -o /dev/null -O $dl_path/download https://crates.io/$dl_path/download
gary@pixel:~/mirror/fileserver$ tree
.
└── api
    └── v1
        └── crates
            └── libc
                └── 0.1.10
                    └── download

5 directories, 1 file
gary@pixel:~/mirror/fileserver$ python -m SimpleHTTPServer 8080
Serving HTTP on 0.0.0.0 port 8080 ...

Index

gary@pixel:~/mirror$ git clone git@github.com:rust-lang/crates.io-index.git
Cloning into 'crates.io-index'...
remote: Counting objects: 75586, done.
remote: Compressing objects: 100% (137/137), done.
remote: Total 75586 (delta 67), reused 0 (delta 0), pack-reused 75441
Receiving objects: 100% (75586/75586), 9.73 MiB | 2.26 MiB/s, done.
Resolving deltas: 100% (41211/41211), done.
Checking connectivity... done.
gary@pixel:~/mirror$ cd crates.io-index/
gary@pixel:~/mirror/crates.io-index$ cat << EOF > config.json
> {
>   "dl": "http://localhost:8080/api/v1/crates",
>   "api": "http://localhost:8080/"
> }
> EOF
gary@pixel:~/mirror/crates.io-index$ git commit -a -m "Point at local fileserver"
[master 3089a25] Point at local fileserver
 1 file changed, 2 insertions(+), 2 deletions(-)

Project

gary@pixel:~/mirror$ cargo new --bin use-mirror
gary@pixel:~/mirror$ cd use-mirror/
gary@pixel:~/mirror/use-mirror$ mkdir .cargo
gary@pixel:~/mirror/use-mirror$ cat << EOF > .cargo/config
> [registry]
> index = "file:///home/gary/mirror/crates.io-index"
> EOF
gary@pixel:~/mirror/use-mirror$ cat << EOF >> Cargo.toml
> [dependencies]
> libc = "0.1.10"
> EOF
gary@pixel:~/mirror/use-mirror$ cargo build
    Updating registry `file:///home/gary/mirror/crates.io-index`
 Downloading libc v0.1.10 (registry file:///home/gary/mirror/crates.io-index)
   Compiling libc v0.1.10 (registry file:///home/gary/mirror/crates.io-index)
   Compiling use-mirror v0.1.0 (file:///home/gary/mirror/use-mirror)

and back over in the fileserver terminal

localhost - - [23/Aug/2015 23:16:24] "GET /api/v1/crates/libc/0.1.10/download HTTP/1.1" 200 -

Conclusion

As you can see it's actually not much work at all to setup a simple mirror. One pain point that I had to tackle was keeping the index up to date while I have a local commit. For now I'm using fancy git merge strategies to deal with potential future changes to config.json but I'd love to see better support worked out here over time that didn't require such hacks.