The new PostgreSQL 17 make dist
When the PostgreSQL project makes a release, the primary artifact of that is the publication of a source code tarball. That represents the output of all the work that went into the creation of the PostgreSQL software up to the point of the release. The source tarball is then used downstream by packagers to make binary packages (or file system images or installation scripts or similar things), or by some to build the software from source by hand.
Creating a source code tarball is actually quite tricky to do by hand. Of course, you could just run “tar” over a currently checked out source tree. But you need to be careful that the checkout is clean and not locally modified. You need to ensure that all the files that belong in the tarball end up there and no other files. There are certain source control files that you don’t want to include in the tarball. File permissions and file ownership need to be sane. And all this should work consistently across platforms and across time. Fortunately, this has all been scripted and has been pretty reliable over the years.
Additionally, a PostgreSQL source code tarball has included various prebuilt files. These are files that are not actually part of the source code checked into Git, but they would be built as part of a normal compilation. For example, a source code tarball has included prebuilt Bison and Flex files, various .c and .h files generated by Perl scripts, and HTML and man page documentation built from DocBook XML files. The reason for this is a mix of convenience and traditional practice. All these output files are platform-independent and independent of build options. So everyone will get the same ones anyway, so we might as well prebuild them. Also, that way, users of the source tarball won’t need the tools to build these files. For example, you didn’t actually need Perl to build PostgreSQL from a source tarball, because all the files generated from Perl scripts were already built. Also, historically (very historically), PostgreSQL was pushing the limits of Bison and Flex (or various other yacc and lex implementations, when those were still supported), so it was convenient and less error-prone to give everyone the same prebuilt Bison and Flex output files.
This system has two major problems that have now led to the point that we got rid of it in PostgreSQL 17.
First, implementing and maintaining this arrangement in the build
system is quite tricky. You need to carefully maintain the different
states of “clean source code”, “partially built source code”, and
“fully built source code”, and the commands to transition between
them. (This was make distprep
and make all
, and then make clean
and make maintainer-clean
to move the other way.) Making it work
with out-of-tree (“vpath”) builds was extremely weird: If you built
from a source tarball, the Bison (etc.) output files were in the
source directory, but if you build from a Git checkout, the Bison
output files were in the build directory, and you need to support both
of these sanely. Finally, the new Meson build system is extremely
allergic against writing build output into the source directory.
Some of the historical reasons are also obsolete. It’s not a problem anymore to get a good version of Bison and Flex installed. Everybody can easily get Perl installed nowadays. The documentation build can still be a bit tricky, but it’s generally much easier and robust than a few decades ago.
Second, a lot more attention is nowadays paid to the software supply chain. There are security and legal reasons for this. When users install software, they want to know where it came from, and they want to be sure that they got the right thing, not some fake version or some version of dubious legal provenance.
The downstream packaging practice has already paid attention to this for many years. Packages or package repositories are cryptographically signed, so you can be sure that what you install came from a trustworthy source. There have also been efforts to make binary builds reproducible, so that you can be sure that the binary files in your binary package are what you’d expect them to be. Some packagers have policies that everything needs to be built from source, so they’d just delete and rebuild the prebuilt files anyway.
At the other end of the software production pipeline, using Git as the source control system gives some integrity guarantees that you are getting the same source code that everybody else is getting. So if I’m looking at commit b18b3a8150dbb150124bd345e000d6dc92f3d6dd and I see that same commit on various public servers, that’s probably the same commit that everybody else is seeing. And I can check what the parents of that commit are and how the code got to that point and so on.
But what we didn’t have until now is a transparent and reproducible way to get from that commit to the release tarball.
The way the tarball creation
works is that the
person who prepares the release runs make dist
on a machine that is
specially kept “clean” for that purpose. How can a third party verify
this process? The produced tarball was not perfectly reproducible.
If you run make dist
yourself, you’ll get a similar but not
identical tarball. A way to verify whether a tarball was sane was to
unpack it and diff the contents against a source directory or an
unpacked tarball that you made yourself. But this requires manual
judgment. The Bison and Flex files won’t be the same unless you used
the identical versions. The built documentation also won’t be
perfectly identical. File timestamps will be different. Also you’d
need to carefully check manually whether files are missing or too
many. This was not fully satisfactory.
(This work overlapped with the discovery of the XZ Utils backdoor, which exploited (among other things) exactly this non-reproducible tarball creation process. But I want to be clear that this is a complete coincidence, and this work was neither done as a response to that nor are there are any suspicions that any PostgreSQL tarballs might have been compromised.)
Anyway, with PostgreSQL 17, this is changed. The tarball generation
is still invoked by calling make dist
, but that
internally
now calls git archive
. git
archive
packs the files belonging to a given Git commit into a tar
(or other) archive in a reproducible and verifiable way. Therefore,
if I now run make dist
on a given commit (such as a release tag),
then I will get the exact same (bit-identical) tarball as the next
person. A packager can now trace the tarball back to the Git
repository, and in turn an end-user can trace a binary package back to
the Git repository as well (assuming reproducible builds, which is an
ongoing struggle).
(To be clear, this change is only in major version PostgreSQL 17 and (presumably) future major versions. The maintenance releases for older major versions (16 and back) will continue to be published using the old method until they go out of support.)
Getting this to
work
was also not entirely straightforward. You need to carefully
calibrate the git archive
options to make sure this works
consistently across platforms and local Git configurations. Which is
why we’re keeping the make dist
invocation as a wrapper. Also, you
need a new enough version of Git for this (2.38.0 or newer).
Currently, the Git version used to produce the release tarballs (on
the above-mentioned “clean” box) is too old to create reproducible
.tar.gz
tarballs, but it will create reproducible .tar.bz2
tarballs. The latter is what most users and packagers use anyway. If
you care about this, avoid the .tar.gz
for now.
(Update 2024-09-25: This has been addressed for the 17.0 final
release. The Git version on the “clean” box was upgraded, and the
.tar.gz
tarballs are also reproducible now.)
I think this is progress, if you care about software supply chain
integrity. (It’s also a relief if you care about maintaining the
build system.) There are certainly more things that could be done.
One thing mentioned above is that reproducible builds don’t work for
PostgreSQL in all situations. My understanding is that this needs to
be fixed elsewhere, though. Another topic is more traceability about
how things get into the Git repository. The make dist
change only
ensures that once code is in the Git repository, you can trace it from
there, ideally all the way to the end user installation. There are,
of course, various technical and social processes in the PostgreSQL
developer community that monitor the integrity of the source code, but
there is nothing currently that checks in a computerized,
cryptographic way the origin of what goes into the Git repository. So
something like signed commits might be worth looking into in the
future in order to improve this further.