The new PostgreSQL 17 make dist

When the PostgreSQL project makes a release, the primary artifact of that is the publication of a source code tarball. That represents the output of all the work that went into the creation of the PostgreSQL software up to the point of the release. The source tarball is then used downstream by packagers to make binary packages (or file system images or installation scripts or similar things), or by some to build the software from source by hand.

Creating a source code tarball is actually quite tricky to do by hand. Of course, you could just run “tar” over a currently checked out source tree. But you need to be careful that the checkout is clean and not locally modified. You need to ensure that all the files that belong in the tarball end up there and no other files. There are certain source control files that you don’t want to include in the tarball. File permissions and file ownership need to be sane. And all this should work consistently across platforms and across time. Fortunately, this has all been scripted and has been pretty reliable over the years.

Additionally, a PostgreSQL source code tarball has included various prebuilt files. These are files that are not actually part of the source code checked into Git, but they would be built as part of a normal compilation. For example, a source code tarball has included prebuilt Bison and Flex files, various .c and .h files generated by Perl scripts, and HTML and man page documentation built from DocBook XML files. The reason for this is a mix of convenience and traditional practice. All these output files are platform-independent and independent of build options. So everyone will get the same ones anyway, so we might as well prebuild them. Also, that way, users of the source tarball won’t need the tools to build these files. For example, you didn’t actually need Perl to build PostgreSQL from a source tarball, because all the files generated from Perl scripts were already built. Also, historically (very historically), PostgreSQL was pushing the limits of Bison and Flex (or various other yacc and lex implementations, when those were still supported), so it was convenient and less error-prone to give everyone the same prebuilt Bison and Flex output files.

This system has two major problems that have now led to the point that we got rid of it in PostgreSQL 17.

First, implementing and maintaining this arrangement in the build system is quite tricky. You need to carefully maintain the different states of “clean source code”, “partially built source code”, and “fully built source code”, and the commands to transition between them. (This was make distprep and make all, and then make clean and make maintainer-clean to move the other way.) Making it work with out-of-tree (“vpath”) builds was extremely weird: If you built from a source tarball, the Bison (etc.) output files were in the source directory, but if you build from a Git checkout, the Bison output files were in the build directory, and you need to support both of these sanely. Finally, the new Meson build system is extremely allergic against writing build output into the source directory.

Some of the historical reasons are also obsolete. It’s not a problem anymore to get a good version of Bison and Flex installed. Everybody can easily get Perl installed nowadays. The documentation build can still be a bit tricky, but it’s generally much easier and robust than a few decades ago.

Second, a lot more attention is nowadays paid to the software supply chain. There are security and legal reasons for this. When users install software, they want to know where it came from, and they want to be sure that they got the right thing, not some fake version or some version of dubious legal provenance.

The downstream packaging practice has already paid attention to this for many years. Packages or package repositories are cryptographically signed, so you can be sure that what you install came from a trustworthy source. There have also been efforts to make binary builds reproducible, so that you can be sure that the binary files in your binary package are what you’d expect them to be. Some packagers have policies that everything needs to be built from source, so they’d just delete and rebuild the prebuilt files anyway.

At the other end of the software production pipeline, using Git as the source control system gives some integrity guarantees that you are getting the same source code that everybody else is getting. So if I’m looking at commit b18b3a8150dbb150124bd345e000d6dc92f3d6dd and I see that same commit on various public servers, that’s probably the same commit that everybody else is seeing. And I can check what the parents of that commit are and how the code got to that point and so on.

But what we didn’t have until now is a transparent and reproducible way to get from that commit to the release tarball.

The way the tarball creation works is that the person who prepares the release runs make dist on a machine that is specially kept “clean” for that purpose. How can a third party verify this process? The produced tarball was not perfectly reproducible. If you run make dist yourself, you’ll get a similar but not identical tarball. A way to verify whether a tarball was sane was to unpack it and diff the contents against a source directory or an unpacked tarball that you made yourself. But this requires manual judgment. The Bison and Flex files won’t be the same unless you used the identical versions. The built documentation also won’t be perfectly identical. File timestamps will be different. Also you’d need to carefully check manually whether files are missing or too many. This was not fully satisfactory.

(This work overlapped with the discovery of the XZ Utils backdoor, which exploited (among other things) exactly this non-reproducible tarball creation process. But I want to be clear that this is a complete coincidence, and this work was neither done as a response to that nor are there are any suspicions that any PostgreSQL tarballs might have been compromised.)

Anyway, with PostgreSQL 17, this is changed. The tarball generation is still invoked by calling make dist, but that internally now calls git archive. git archive packs the files belonging to a given Git commit into a tar (or other) archive in a reproducible and verifiable way. Therefore, if I now run make dist on a given commit (such as a release tag), then I will get the exact same (bit-identical) tarball as the next person. A packager can now trace the tarball back to the Git repository, and in turn an end-user can trace a binary package back to the Git repository as well (assuming reproducible builds, which is an ongoing struggle).

(To be clear, this change is only in major version PostgreSQL 17 and (presumably) future major versions. The maintenance releases for older major versions (16 and back) will continue to be published using the old method until they go out of support.)

Getting this to work was also not entirely straightforward. You need to carefully calibrate the git archive options to make sure this works consistently across platforms and local Git configurations. Which is why we’re keeping the make dist invocation as a wrapper. Also, you need a new enough version of Git for this (2.38.0 or newer). Currently, the Git version used to produce the release tarballs (on the above-mentioned “clean” box) is too old to create reproducible .tar.gz tarballs, but it will create reproducible .tar.bz2 tarballs. The latter is what most users and packagers use anyway. If you care about this, avoid the .tar.gz for now.

(Update 2024-09-25: This has been addressed for the 17.0 final release. The Git version on the “clean” box was upgraded, and the .tar.gz tarballs are also reproducible now.)

I think this is progress, if you care about software supply chain integrity. (It’s also a relief if you care about maintaining the build system.) There are certainly more things that could be done. One thing mentioned above is that reproducible builds don’t work for PostgreSQL in all situations. My understanding is that this needs to be fixed elsewhere, though. Another topic is more traceability about how things get into the Git repository. The make dist change only ensures that once code is in the Git repository, you can trace it from there, ideally all the way to the end user installation. There are, of course, various technical and social processes in the PostgreSQL developer community that monitor the integrity of the source code, but there is nothing currently that checks in a computerized, cryptographic way the origin of what goes into the Git repository. So something like signed commits might be worth looking into in the future in order to improve this further.