October 12 2025

How we incidentally uncovered a 7-year old bug in gentoo-ci

Michał Górny (mgorny) • October 12, 2025, 9:14

“Gentoo CI” is the service providing periodic linting for the Gentoo repository. It is a part of the Repository mirror and CI project that I’ve started in 2015. Of course, it all started as a temporary third-party solution, but it persisted, was integrated into Gentoo Infrastructure and grew organically into quite a monstrosity.

It’s imperfect in many ways. In particular, it has only some degree of error recovery and when things go wrong beyond that, it requires a manual fix. Often the “fix” is to stop mirroring a problematic repository. Over time, I’ve started having serious doubts about the project, and proposed sunsetting most of it.

Lately, things have been getting worse. What started as a minor change in behavior of Git triggered a whole cascade of failures, leading to me finally announcing the deadline for sunsetting the mirroring of third-party repositories, and starting ripping non-critical bits out of it. Interesting enough, this whole process led me to finally discover the root cause of most of these failures — a bug that has existed since the very early version of the code, but happened to be hidden by the hacky error recovery code. Here’s the story of it.

Repository mirror and CI is basically a bunch of shell scripts with Python helpers run via a cronjob (repo-mirror-ci code). The scripts are responsible for syncing the lot of public Gentoo repositories, generating caches for them, publishing them onto our mirror repositories, and finally running pkgcheck on the Gentoo repository. Most of the “unexpected” error handling is set -e -x, with a dumb logging to a file, and mailing on a cronjob failure. Some common errors are handled gracefully though — sync errors, pkgcheck failures and so on.

The whole cascade started when Git was upgraded on the server. The upgrade involved a change in behavior where git checkout -- ${branch} stopped working; you could only specify files after the --. The fix was trivial enough.

However, once the issue was fixed I’ve started periodically seeing sync failures from the Gentoo repository. The scripts had a very dumb way of handling sync failures: if syncing failed, they removed the local copy entirely and tried again. This generally made sense — say, if upstream renamed the main branch, git pull would fail but a fresh clone would be a cheap fix. However, the Gentoo repository is quite big and when it gets removed due to sync failure, cloning it afresh from the Gentoo infrastructure failed.

So when it failed, I did a quick hack — I’ve cloned the repository manually from GitHub, replaced the remote and put it in place. Problem solved. Except a while later, the same issue surfaced. This time I kept an additional local clone, so I wouldn’t have to fetch it from server, and added it again. But then, it got removed once more, and this was really getting tedious.

What I have assumed then is that the repository is failing to sync due to some temporary problems, either network or Infrastructure related. If that were the case, it really made no sense to remove it and clone afresh. On top of that, since we are sunsetting support for third-party repositories anyway, there is no need for automatic recovery from issues such as branch name changes. So I removed that logic, to have sync fail immediately, without removing the local copy.

Now, this had important consequences. Previously, any failed sync would result in the repository being removed and cloned again, leaving no trace of the original error. On top of that, a logic stopping the script early when the Gentoo repository failed meant that the actual error wasn’t even saved, leaving me only with the subsequent clone failures.

When the sync failed again (and of course it did), I was able to actually investigate what was wrong. What actually happened is that the repository wasn’t on a branch — the checkout was detached at some commit. Initially, I assumed this was some fluke, perhaps also related to the Git upgrade. I’ve switched manually to master, and that fixed it. Then it broke again. And again.

So far I’ve been mostly dealing with the failures asynchronously — I wasn’t around at the time of the initial failure, and only started working on it after a few failed runs. However, finally the issue resurfaced so fast that I was able to connect the dots. The problem likely happened immediately after gentoo-ci hit an issue, and bisected it! So I’ve started suspecting that there is another issue in the scripts, perhaps another case of missed --, but I couldn’t find anything relevant.

Finally, I’ve started looking at the post-bisect code. What we were doing is calling git rev-parse HEAD prior to bisect, and then using that result in git checkout. This obviously meant that after every bisect, we ended up with detached tree, i.e. precisely the issue I was seeing. So why didn’t I notice this before?

Of course, because of the sync error handling. Once bisect broke the repository, next sync failed and the repository got cloned again, and we never noticed anything was wrong. We only started noticing once cloning started failing. So after a few days of confusion and false leads, I finally fixed a bug that was present for over 7 years in production code, and caused the Gentoo repository to be cloned over and over again whenever any bad commit happened.

How we incidentally uncovered a 7-year old bug in gentoo-ci

mgorny (mgorny ) • October 12, 2025, 9:14

Repository mirror and CI is basically a bunch of shell scripts with Python helpers run via a cronjob (repo-mirror-ci code). The scripts are responsible for syncing the lot of public Gentoo repositories, generating caches for them, publishing them onto our mirror repositories, and finally running pkgcheck on the Gentoo repository. Most of the “unexpected” error handling is set -e -x, with a dumb logging to a file, and mailing on a cronjob failure. Some common errors are handled gracefully though — sync errors, pkgcheck failures and so on.

The whole cascade started when Git was upgraded on the server. The upgrade involved a change in behavior where git checkout -- ${branch} stopped working; you could only specify files after the --. The fix was trivial enough.

However, once the issue was fixed I’ve started periodically seeing sync failures from the Gentoo repository. The scripts had a very dumb way of handling sync failures: if syncing failed, they removed the local copy entirely and tried again. This generally made sense — say, if upstream renamed the main branch, git pull would fail but a fresh clone would be a cheap fix. However, the Gentoo repository is quite big and when it gets removed due to sync failure, cloning it afresh from the Gentoo infrastructure failed.

When the sync failed again (and of course it did), I was able to actually investigate what was wrong. What actually happened is that the repository wasn’t on a branch — the checkout was detached at some commit. Initially, I assumed this was some fluke, perhaps also related to the Git upgrade. I’ve switched manually to master, and that fixed it. Then it broke again. And again.

Finally, I’ve started looking at the post-bisect code. What we were doing is calling git rev-parse HEAD prior to bisect, and then using that result in git checkout. This obviously meant that after every bisect, we ended up with detached tree, i.e. precisely the issue I was seeing. So why didn’t I notice this before?

July 26 2025

EPYTEST_PLUGINS and other goodies now in Gentoo

Michał Górny (mgorny) • July 26, 2025, 13:29

If you are following the gentoo-dev mailing list, you may have noticed that there’s been a fair number of patches sent for the Python eclasses recently. Most of them have been centered on pytest support. Long story short, I’ve came up with what I believed to be a reasonably good design, and decided it’s time to stop manually repeating all the good practices in every ebuild separately.

In this post, I am going to shortly summarize all the recently added options. As always, they are all also documented in the Gentoo Python Guide.

The unceasing fight against plugin autoloading

The pytest test loader defaults to automatically loading all the plugins installed to the system. While this is usually quite convenient, especially when you’re testing in a virtual environment, it can get quite messy when you’re testing against system packages and end up with lots of different plugins installed. The results can range from slowing tests down to completely breaking the test suite.

Our initial attempts to contain the situation were based on maintaining a list of known-bad plugins and explicitly disabling their autoloading. The list of disabled plugins has gotten quite long by now. It includes both plugins that were known to frequently break tests, and these that frequently resulted in automagic dependencies.

While the opt-out approach allowed us to resolve the worst issues, it only worked when we knew about a particular issue. So naturally we’d miss some rarer issue, and learn only when arch testing workflows were failing, or users reported issues. And of course, we would still be loading loads of unnecessary plugins at the cost of performance.

So, we started disabling autoloading entirely, using PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable. At first we only used it when we needed to, however over time we’ve started using it almost everywhere — after all, we don’t want the test suites to suddenly start failing because of a new pytest plugin installed.

For a long time, I have been hesitant to disable autoloading by default. My main concern was that it’s easy to miss a missing plugin. Say, if you ended up failing to load pytest-asyncio or a similar plugin, all the asynchronous tests would simply be skipped (verbosely, but it’s still easy to miss among the flood of warnings). However, eventually we started treating this warning as an error (and then pytest started doing the same upstream), and I have decided that going opt-in is worth the risk. After all, we were already disabling it all over the place anyway.

EPYTEST_PLUGINS

Disabling plugin autoloading is only the first part of the solution. Once you disabled autoloading, you need to load the plugins explicitly — it’s not sufficient anymore to add them as test dependencies, you also need to add a bunch of -p switches. And then, you need to keep maintaining both dependencies and pytest switches in sync. So you’d end up with bits like:

BDEPEND="
  test? (
    dev-python/flaky[${PYTHON_USEDEP}]
    dev-python/pytest-asyncio[${PYTHON_USEDEP}]
    dev-python/pytest-timeout[${PYTHON_USEDEP}]
  )
"

distutils_enable_tests pytest

python_test() {
  local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
  epytest -p asyncio -p flaky -p timeout
}

Not very efficient, right? The idea then is to replace all that with a single EPYTEST_PLUGINS variable:

EPYTEST_PLUGINS=( flaky pytest-{asyncio,timeout} )
distutils_enable_tests pytest

And that’s it! EPYTEST_PLUGINS takes a bunch of Gentoo package names (without category — almost all of them reside in dev-python/, and we can special-case the few that do not), distutils_enable_tests adds the dependencies and epytest (in the default python_test() implementation) disables autoloading and passes the necessary flags.

Now, what’s really cool is that the function will automatically determine the correct argument values! This can be especially important if entry point names change between package versions — and upstreams generally don’t consider this an issue, since autoloading isn’t affected.

Going towards no autoloading by default

Okay, that gives us a nice way of specifying which plugins to load. However, weren’t we talking of disabling autoloading by default?

Well, yes — and the intent is that it’s going to be disabled by default in EAPI 9. However, until then there’s a simple solution we encourage everyone to use: set an empty EPYTEST_PLUGINS. So:

EPYTEST_PLUGINS=()
distutils_enable_tests pytest

…and that’s it. When it’s set to an empty list, autoloading is disabled. When it’s unset, it is enabled for backwards compatibility. And the next pkgcheck release is going to suggest it:

dev-python/a2wsgi
  EPyTestPluginsSuggestion: version 1.10.10: EPYTEST_PLUGINS can be used to control pytest plugins loaded

EPYTEST_PLUGIN* to deal with special cases

While the basic feature is neat, it is not a golden bullet. The approach used is insufficient for some packages, most notably pytest plugins that run a pytest subprocesses without appropriate -p options, and expect plugins to be autoloaded there. However, after some more fiddling we arrived at three helpful features:

EPYTEST_PLUGIN_LOAD_VIA_ENV that switches explicit plugin loading from -p arguments to PYTEST_PLUGINS environment variable. This greatly increases the chance that subprocesses will load the specified plugins as well, though it is more likely to cause issues such as plugins being loaded twice (and therefore is not the default). And as a nicety, the eclass takes care of finding out the correct values, again.
EPYTEST_PLUGIN_AUTOLOAD to reenable autoloading, effectively making EPYTEST_PLUGINS responsible only for adding dependencies. It’s really intended to be used as a last resort, and mostly for future EAPIs when autoloading will be disabled by default.
Additionally, EPYTEST_PLUGINS can accept the name of the package itself (i.e. ${PN}) — in which case it will not add a dependency, but load the just-built plugin.

How useful is that? Compare:

BDEPEND="
  test? (
    dev-python/pytest-datadir[${PYTHON_USEDEP}]
  )
"

distutils_enable_tests pytest

python_test() {
  local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
  local -x PYTEST_PLUGINS=pytest_datadir.plugin,pytest_regressions.plugin
  epytest
}

…and:

EPYTEST_PLUGINS=( "${PN}" pytest-datadir )
EPYTEST_PLUGIN_LOAD_VIA_ENV=1
distutils_enable_tests pytest

Old and new bits: common plugins

The eclass already had some bits related to enabling common plugins. Given that EPYTEST_PLUGINS only takes care of loading plugins, but not passing specific arguments to them, they are still meaningful. Furthermore, we’ve added EPYTEST_RERUNS.

The current list is:

EPYTEST_RERUNS=... that takes a number of reruns and uses pytest-rerunfailures to retry failing tests the specified number of times.
EPYTEST_TIMEOUT=... that takes a number of seconds and uses pytest-timeout to force a timeout if a single test does not complete within the specified time.
EPYTEST_XDIST=1 that enables parallel testing using pytest-xdist, if the user allows multiple test jobs. The number of test jobs can be controlled (by the user) by setting EPYTEST_JOBS with a fallback to inferring from MAKEOPTS (setting to 1 disables the plugin entirely).

The variables automatically add the needed plugin, so they do not need to be repeated in EPYTEST_PLUGINS.

JUnit XML output and gpy-junit2deselect

As an extra treat, we ask pytest to generate a JUnit-style XML output for each test run that can be used for machine processing of test results. gpyutils now supply a gpy-junit2deselect tool that can parse this XML and output a handy EPYTEST_DESELECT for the failing tests:

$ gpy-junit2deselect /tmp/portage/dev-python/aiohttp-3.12.14/temp/pytest-xml/python3.13-QFr.xml
EPYTEST_DESELECT=(
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_nonzero_passed
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_passed_to_create_connection
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_zero_not_passed
)

While it doesn’t replace due diligence, it can help you update long lists of deselects. As a bonus, it automatically collapses deselects to test functions, classes and files when all matching tests fail.

hypothesis-gentoo to deal with health check nightmare

Hypothesis is a popular Python fuzz testing library. Unfortunately, it has one feature that, while useful upstream, is pretty annoying to downstream testers: health checks.

The idea behind health checks is to make sure that fuzz testing remains efficient. For example, Hypothesis is going to fail if the routine used to generate examples is too slow. And as you can guess, “too slow” is more likely to happen on a busy Gentoo system than on dedicated upstream CI. Not to mention some upstreams plain ignore health check failures if they happen rarely.

Given how often this broke for us, we have requested an option to disable Hypothesis health checks long ago. Unfortunately, upstream’s answer can be summarized as: “it’s up to packages using Hypothesis to provide such an option, and you should not be running fuzz testing downstream anyway”. Easy to say.

Well, obviously we are not going to pursue every single package using Hypothesis to add a profile with health checks disabled. We did report health check failures sometimes, and sometimes got no response at all. And skipping these tests is not really an option, given that often there are no other tests for a given function, and even if there are — it’s just going to be a maintenance nightmare.

I’ve finally figured out that we can create a Hypothesis plugin — now hypothesis-gentoo — that provides a dedicated “gentoo” profile with all health checks disabled, and then we can simply use this profile in epytest. And how do we know that Hypothesis is used? Of course we look at EPYTEST_PLUGINS! All pieces fall into place. It’s not 100% foolproof, but health check problems aren’t that common either.

Summary

I have to say that I really like what we achieved here. Over the years, we learned a lot about pytest, and used that knowledge to improve testing in Gentoo. And after repeating the same patterns for years, we have finally replaced them with eclass functions that can largely work out of the box. This is a major step forward.

EPYTEST_PLUGINS and other goodies now in Gentoo

mgorny (mgorny ) • July 26, 2025, 13:29

In this post, I am going to shortly summarize all the recently added options. As always, they are all also documented in the Gentoo Python Guide.

The unceasing fight against plugin autoloading

So, we started disabling autoloading entirely, using PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable. At first we only used it when we needed to, however over time we’ve started using it almost everywhere — after all, we don’t want the test suites to suddenly start failing because of a new pytest plugin installed.

For a long time, I have been hesitant to disable autoloading by default. My main concern was that it’s easy to miss a missing plugin. Say, if you ended up failing to load pytest-asyncio or a similar plugin, all the asynchronous tests would simply be skipped (verbosely, but it’s still easy to miss among the flood of warnings). However, eventually we started treating this warning as an error (and then pytest started doing the same upstream), and I have decided that going opt-in is worth the risk. After all, we were already disabling it all over the place anyway.

EPYTEST_PLUGINS

Disabling plugin autoloading is only the first part of the solution. Once you disabled autoloading, you need to load the plugins explicitly — it’s not sufficient anymore to add them as test dependencies, you also need to add a bunch of -p switches. And then, you need to keep maintaining both dependencies and pytest switches in sync. So you’d end up with bits like:

BDEPEND="
  test? (
    dev-python/flaky[${PYTHON_USEDEP}]
    dev-python/pytest-asyncio[${PYTHON_USEDEP}]
    dev-python/pytest-timeout[${PYTHON_USEDEP}]
  )
"

distutils_enable_tests pytest

python_test() {
  local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
  epytest -p asyncio -p flaky -p timeout
}

Not very efficient, right? The idea then is to replace all that with a single EPYTEST_PLUGINS variable:

EPYTEST_PLUGINS=( flaky pytest-{asyncio,timeout} )
distutils_enable_tests pytest

And that’s it! EPYTEST_PLUGINS takes a bunch of Gentoo package names (without category — almost all of them reside in dev-python/, and we can special-case the few that do not), distutils_enable_tests adds the dependencies and epytest (in the default python_test() implementation) disables autoloading and passes the necessary flags.

Going towards no autoloading by default

Okay, that gives us a nice way of specifying which plugins to load. However, weren’t we talking of disabling autoloading by default?

Well, yes — and the intent is that it’s going to be disabled by default in EAPI 9. However, until then there’s a simple solution we encourage everyone to use: set an empty EPYTEST_PLUGINS. So:

EPYTEST_PLUGINS=()
distutils_enable_tests pytest

…and that’s it. When it’s set to an empty list, autoloading is disabled. When it’s unset, it is enabled for backwards compatibility. And the next pkgcheck release is going to suggest it:

dev-python/a2wsgi
  EPyTestPluginsSuggestion: version 1.10.10: EPYTEST_PLUGINS can be used to control pytest plugins loaded

EPYTEST_PLUGIN* to deal with special cases

While the basic feature is neat, it is not a golden bullet. The approach used is insufficient for some packages, most notably pytest plugins that run a pytest subprocesses without appropriate -p options, and expect plugins to be autoloaded there. However, after some more fiddling we arrived at three helpful features:

EPYTEST_PLUGIN_LOAD_VIA_ENV that switches explicit plugin loading from -p arguments to PYTEST_PLUGINS environment variable. This greatly increases the chance that subprocesses will load the specified plugins as well, though it is more likely to cause issues such as plugins being loaded twice (and therefore is not the default). And as a nicety, the eclass takes care of finding out the correct values, again.
EPYTEST_PLUGIN_AUTOLOAD to reenable autoloading, effectively making EPYTEST_PLUGINS responsible only for adding dependencies. It’s really intended to be used as a last resort, and mostly for future EAPIs when autoloading will be disabled by default.
Additionally, EPYTEST_PLUGINS can accept the name of the package itself (i.e. ${PN}) — in which case it will not add a dependency, but load the just-built plugin.

How useful is that? Compare:

BDEPEND="
  test? (
    dev-python/pytest-datadir[${PYTHON_USEDEP}]
  )
"

distutils_enable_tests pytest

python_test() {
  local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
  local -x PYTEST_PLUGINS=pytest_datadir.plugin,pytest_regressions.plugin
  epytest
}

…and:

EPYTEST_PLUGINS=( "${PN}" pytest-datadir )
EPYTEST_PLUGIN_LOAD_VIA_ENV=1
distutils_enable_tests pytest

Old and new bits: common plugins

The eclass already had some bits related to enabling common plugins. Given that EPYTEST_PLUGINS only takes care of loading plugins, but not passing specific arguments to them, they are still meaningful. Furthermore, we’ve added EPYTEST_RERUNS.

The current list is:

EPYTEST_RERUNS=... that takes a number of reruns and uses pytest-rerunfailures to retry failing tests the specified number of times.
EPYTEST_TIMEOUT=... that takes a number of seconds and uses pytest-timeout to force a timeout if a single test does not complete within the specified time.
EPYTEST_XDIST=1 that enables parallel testing using pytest-xdist, if the user allows multiple test jobs. The number of test jobs can be controlled (by the user) by setting EPYTEST_JOBS with a fallback to inferring from MAKEOPTS (setting to 1 disables the plugin entirely).

The variables automatically add the needed plugin, so they do not need to be repeated in EPYTEST_PLUGINS.

JUnit XML output and gpy-junit2deselect

As an extra treat, we ask pytest to generate a JUnit-style XML output for each test run that can be used for machine processing of test results. gpyutils now supply a gpy-junit2deselect tool that can parse this XML and output a handy EPYTEST_DESELECT for the failing tests:

$ gpy-junit2deselect /tmp/portage/dev-python/aiohttp-3.12.14/temp/pytest-xml/python3.13-QFr.xml
EPYTEST_DESELECT=(
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_nonzero_passed
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_passed_to_create_connection
  tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_zero_not_passed
)

hypothesis-gentoo to deal with health check nightmare

Hypothesis is a popular Python fuzz testing library. Unfortunately, it has one feature that, while useful upstream, is pretty annoying to downstream testers: health checks.

I’ve finally figured out that we can create a Hypothesis plugin — now hypothesis-gentoo — that provides a dedicated “gentoo” profile with all health checks disabled, and then we can simply use this profile in epytest. And how do we know that Hypothesis is used? Of course we look at EPYTEST_PLUGINS! All pieces fall into place. It’s not 100% foolproof, but health check problems aren’t that common either.

Summary

April 30 2025

Urgent - OSU Open Source Lab needs your help

GentooNews (https://www.gentoo.org/feeds/news.xml ) • April 30, 2025, 5:00

Oregon State University’s Open Source Lab (OSL) has been a major supporter of Gentoo Linux and many other software projects for years. It is currently hosting several of our infrastructure servers as well as development machines for exotic architectures, and is critical for Gentoo operation.

Due to drops in sponsor contributions, OSL has been operating at loss for a while, with the OSU College of Engineering picking up the rest of the bill. Now, university funding has been cut, this is not possible anymore, and unless US$ 250.000 can be provided within the next two weeks OSL will have to shut down. The details can be found in a blog post of Lance Albertson, the director of OSL.

Please, if you value and use Gentoo Linux or any of the other projects that OSL has been supporting, and if you are in a position to make funds available, if this is true for the company you work for, etc … contact the address in the blog post. Obviously, long-term corporate sponsorships would here serve best - for what it’s worth, OSL developers have ended up at almost every big US tech corporation by now. Right now probably everything helps though.

February 20 2025

Bootable Gentoo QCOW2 disk images - ready for the cloud!

Gentoo News (GentooNews) • February 20, 2025, 6:00

♦ We are very happy to announce new official downloads on our website and our mirrors: Gentoo for amd64 (x86-64) and arm64 (aarch64), as immediately bootable disk images in qemu’s QCOW2 format! The images, updated weekly, include an EFI boot partition and a fully functional Gentoo installation; either with no network activated but a password-less root login on the console (“no root pw”), or with network activated, all accounts initially locked, but cloud-init running on boot (“cloud-init”). Enjoy, and read on for more!

Questions and answers How can I quickly test the images?

We recommend using the “no root password” images and qemu system emulation. Both amd64 and arm64 images have all the necessary drivers ready for that. Boot them up, use as login name “root”, and you will immediately get a fully functional Gentoo shell. The set of installed packages is similar to that of an administration or rescue system, with a focus more on network environment and less on exotic hardware. Of course you can emerge whatever you need though, and binary package sources are already configured too.

What settings do I need for qemu?

You need qemu with the target architecture (aarch64 or x86_64) enabled in QEMU_SOFTMMU_TARGETS, and the UEFI firmware.

app-emulation/qemu
sys-firmware/edk2-bin

You should disable the useflag “pin-upstream-blobs” on qemu and update edk2-bin at least to the 2024 version. Also, since you probably want to use KVM hardware acceleration for the virtualization, make sure that your kernel supports that and that your current user is in the kvm group.

For testing the amd64 (x86-64) images, a command line could look like this, configuring 8G RAM and 4 CPU threads with KVM acceleration:

qemu-system-x86_64 \
        -m 8G -smp 4 -cpu host -accel kvm -vga virtio -smbios type=0,uefi=on \
        -drive if=pflash,unit=0,readonly=on,file=/usr/share/edk2/OvmfX64/OVMF_CODE_4M.qcow2,format=qcow2 \
        -drive file=di-amd64-console.qcow2 &

For testing the arm64 (aarch64) images, a command line could look like this:

qemu-system-aarch64 \
        -machine virt -cpu neoverse-v1 -m 8G -smp 4 -device virtio-gpu-pci -device usb-ehci -device usb-kbd \
        -drive if=pflash,unit=0,readonly=on,file=/usr/share/edk2/ArmVirtQemu-AARCH64/QEMU_EFI.qcow2 \
        -drive file=di-arm64-console.qcow2 &

Please consult the qemu documentation for more details.

Can I install the images onto a real harddisk / SSD?

Sure. Gentoo can do anything. The limitations are:

you need a disk with sector size 512 bytes (otherwise the partition table of the image file will not work), see the “SSZ” value in the following example:

pinacolada ~ # blockdev --report /dev/sdb
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0   4000787030016   /dev/sdb

your machine must be able to boot via UEFI (no legacy boot)
you may have to adapt the configuration yourself to disks, hardware, …

So, this is an expert workflow.

Assuming your disk is /dev/sdb and has a size of at least 20GByte, you can then use the utility qemu-img to decompress the image onto the raw device. Warning, this obviously overwrites the first 20Gbyte of /dev/sdb (and with that the existing boot sector and partition table):

qemu-img convert -O raw di-amd64-console.qcow2 /dev/sdb

Afterwards, you can and should extend the new root partition with xfs_growfs, create an additional swap partition behind it, possibly adapt /etc/fstab and the grub configuration, …

If you are familiar with partitioning and handling disk images you can for sure imagine more workflow variants; you might find also the qemu-nbd tool interesting.

So what are the cloud-init images good for?

Well, for the cloud. Or more precisely, for any environment where a configuration data source for cloud-init is available. If this is already provided for you, the image should work out of the box. If not, well, you can provide the configuration data manually, but be warned that this is a non-trivial task.

Are you planning to support further architectures?

Eventually yes, in particular (EFI) riscv64 and loongarch64.

Are you planning to support legacy boot?

No, since the placement of the bootloader outside the file system complicates things.

How about disks with 4096 byte sectors?

Well… let’s see how much demand this feature finds. If enough people are interested, we should be able to generate an alternative image with a corresponding partition table.

Why XFS as file system?

It has some features that ext4 is sorely missing (reflinks and copy-on-write), but at the same time is rock-solid and reliable.

Bootable Gentoo QCOW2 disk images - ready for the cloud!

GentooNews (https://www.gentoo.org/feeds/news.xml ) • February 20, 2025, 6:00

Larry the Qcow2 We are very happy to announce new official downloads on our website and our mirrors: Gentoo for amd64 (x86-64) and arm64 (aarch64), as immediately bootable disk images in qemu’s QCOW2 format! The images, updated weekly, include an EFI boot partition and a fully functional Gentoo installation; either with no network activated but a password-less root login on the console (“no root pw”), or with network activated, all accounts initially locked, but cloud-init running on boot (“cloud-init”). Enjoy, and read on for more!

Questions and answers

How can I quickly test the images?

What settings do I need for qemu?

You need qemu with the target architecture (aarch64 or x86_64) enabled in QEMU_SOFTMMU_TARGETS, and the UEFI firmware.

app-emulation/qemu
sys-firmware/edk2-bin

For testing the amd64 (x86-64) images, a command line could look like this, configuring 8G RAM and 4 CPU threads with KVM acceleration:

qemu-system-x86_64 \
        -m 8G -smp 4 -cpu host -accel kvm -vga virtio -smbios type=0,uefi=on \
        -drive if=pflash,unit=0,readonly=on,file=/usr/share/edk2/OvmfX64/OVMF_CODE_4M.qcow2,format=qcow2 \
        -drive file=di-amd64-console.qcow2 &

For testing the arm64 (aarch64) images, a command line could look like this:

qemu-system-aarch64 \
        -machine virt -cpu neoverse-v1 -m 8G -smp 4 -device virtio-gpu-pci -device usb-ehci -device usb-kbd \
        -drive if=pflash,unit=0,readonly=on,file=/usr/share/edk2/ArmVirtQemu-AARCH64/QEMU_EFI.qcow2 \
        -drive file=di-arm64-console.qcow2 &

Please consult the qemu documentation for more details.

Can I install the images onto a real harddisk / SSD?

Sure. Gentoo can do anything. The limitations are:

you need a disk with sector size 512 bytes (otherwise the partition table of the image file will not work), see the “SSZ” value in the following example:

pinacolada ~ # blockdev --report /dev/sdb
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0   4000787030016   /dev/sdb

your machine must be able to boot via UEFI (no legacy boot)
you may have to adapt the configuration yourself to disks, hardware, …

So, this is an expert workflow.

Assuming your disk is /dev/sdb and has a size of at least 20GByte, you can then use the utility qemu-img to decompress the image onto the raw device. Warning, this obviously overwrites the first 20Gbyte of /dev/sdb (and with that the existing boot sector and partition table):

qemu-img convert -O raw di-amd64-console.qcow2 /dev/sdb

Afterwards, you can and should extend the new root partition with xfs_growfs, create an additional swap partition behind it, possibly adapt /etc/fstab and the grub configuration, …

If you are familiar with partitioning and handling disk images you can for sure imagine more workflow variants; you might find also the qemu-nbd tool interesting.

So what are the cloud-init images good for?

Are you planning to support further architectures?

Eventually yes, in particular (EFI) riscv64 and loongarch64.

Are you planning to support legacy boot?

No, since the placement of the bootloader outside the file system complicates things.

How about disks with 4096 byte sectors?

Well… let’s see how much demand this feature finds. If enough people are interested, we should be able to generate an alternative image with a corresponding partition table.

Why XFS as file system?

It has some features that ext4 is sorely missing (reflinks and copy-on-write), but at the same time is rock-solid and reliable.

February 01 2025

Tinderbox shutdown

ago (ago ) • February 01, 2025, 7:08

Due to the lack of hardware, the Tinderbox (and CI) service is no longer operational.

I would like to take this opportunity to thank all the people who have always seen the Tinderbox as a valuable resource and who have promptly addressed bugs, significantly improving the quality of the packages we have in Portage as well as the user experience.

January 05 2025

2024 in retrospect & happy new year 2025!

Gentoo News (GentooNews) • January 05, 2025, 6:00

♦ Happy New Year 2025! Once again, a lot has happened over the past months, in Gentoo and otherwise. Our fireworks were a bit early this year with the stabilization of GCC 14 in November, after a huge amount of preparations and bug fixing via the Modern C initiative. A lot of other programming language ecosystems also saw significant improvements. As always here we’re going to revisit all the exciting news from our favourite Linux distribution.

Gentoo in numbers

The number of commits to the main ::gentoo repository has remained at an overall high level in 2024, with a 2.4% increase from 121000 to 123942. The number of commits by external contributors has grown strongly from 10708 to 12812, now across 421 unique external authors.

The importance of GURU, our user-curated repository with a trusted user model, as entry point for potential developers, is clearly increasing as well. We have had 7517 commits in 2024, a strong growth from 5045 in 2023. The number of contributors to GURU has increased a lot as well, from 158 in 2023 to 241 in 2024. Please join us there and help packaging the latest and greatest software. That’s the ideal preparation for becoming a Gentoo developer!

Activity has picked up speed on the Gentoo bugtracker bugs.gentoo.org, where we’ve had 26123 bug reports created in 2024, compared to 24795 in 2023. The number of resolved bugs shows the same trend, with 25946 in 2024 compared to 22779 in 2023!

New developers

In 2024 we have gained two new Gentoo developers. They are in chronological order:

Matt Jolly (kangie): ♦ Matt joined us already in February from Brisbane, Australia - now finally pushing his commits himself, after already taking care of, e.g., Chromium for over half a year. In work life a High Performance Computing systems administrator, in his free time he enjoys playing with his animals, restoring retro computing equipment and gaming consoles (or using them), brewing beer, the beach, or the local climbing gym.
Eli Schwartz (eschwartz): ♦ In July, we were able to welcome Eli Schwartz from the USA as new Gentoo developer. A bookworm and big fan of Python, and also an upstream maintainer for the Meson Build System, Eli caught the Linux bug already in highschool. Quoting him, “asking around for recommendations on distro I was recommended either Arch or Gentoo. Originally I made a mistake ;)” … We’re glad this got fixed now!

Featured changes and news

Let’s now look at the major improvements and news of 2024 in Gentoo.

Distribution-wide Initiatives

♦ SPI associated project: As of March 2024, Gentoo Linux has become an Associated Project of Software in the Public Interest (SPI). SPI is a non-profit corporation founded to act as a fiscal sponsor for organizations that develop open source software and hardware. It provides services such as accepting donations, holding funds and assets, … and qualifies for 501(c)(3) (U.S. non-profit organization) status. This means that all donations made to SPI and its supported projects are tax deductible for donors in the United States. The intent behind becoming an SPI associated project is to gradually wind down operations of the Gentoo Foundation and transfer its assets to SPI.
♦ GCC 14 stabilization: After a huge amount of work to identify and fix bugs and working with upstreams to modernize the overall source code base, see also the Modern C porting initiative, GCC 14 was finally stabilized in November 2024. Same as Clang 16, GCC 14 by default drops support for several long-deprecated and obsolete language constructs, turning decades-long warnings on bad code into fatal errors.
Link time optimization (LTO): Lots of progress has been made supporting LTO all across the Gentoo repository.
♦ 64bit time_t for 32bit architectures: Various preparations have begun to keep our 32-bit arches going beyond the year 2038. While the GNU C library is ready for that, the switch to a wider time_t data type is an ABI break between userland programs and libraries and needs to be approached carefully, in particular for us as a source-based distribution. Experimental profiles as well as a migration tool are available by now, and will be announced more widely at some point in 2025.
New 23.0 profiles: A new profile version 23.0, i.e. a collection of presets and configurations, has become the default setting; the old profiles are deprecated and will be removed in June 2025. The 23.0 profiles fix a lot of internal inconsistencies; for the user, they bring more toolchain hardening (specifically, CET on amd64 and non-lazy runtime binding) and optimization (e.g., packed relative reolcations where supported) by default.
♦ Expanded binary package coverage: The binary package coverage for amd64 has been expanded a lot, with, e.g., different use-flag combinations, Python support up to version 3.13, and additional large leaf packages beyond stable as for example current GCC snapshots, all for baseline x86-64 and for x86-64-v3. At the moment, the mirrors hold over 60GByte of package data for amd64 alone.
Two additional merchandise stores: We have licensed two additional official merchandise stores, both based in Europe: FreeWear (clothing, mugs, stickers; located in Spain) and BadgeShop (Etsy, Ebay; badges, stickers; located in Romania).
♦ Handbook improvements and editor role: The Gentoo handbook has once again been significantly improved (though there is always still more work to be done). We now have special Gentoo handbook editor roles assigned, which makes the handbook editing effectively much more community friendly. This way, a lot of longstanding issues have been fixed, making installing Gentoo easier for everyone.
♦ Event presence: At the Free and Open Source Software Conference (FrOSCon) 2024, visitors enjoyed a full weekend of hands-on Gentoo workshops. The workshops covered a wide range of topics, from first installation to ebuild maintenance. We also offered mugs, stickers, t-shirts, and of course the famous self-compiled buttons.
Online workshops: Our German support, Gentoo e.V., is grateful to the inspiring speakers of the 6 online workshops in 2024 on various Gentoo topics in German and English. We are looking forward to more exciting events in 2025.
♦ Ban on NLP AI tools: Due to serious concerns with current AI and LLM systems, the Gentoo Council has decided to embrace the value of human contributions and adopt the following motion: “It is expressly forbidden to contribute to Gentoo any content that has been created with the assistance of Natural Language Processing artificial intelligence tools. This motion can be revisited, should a case been made over such a tool that does not pose copyright, ethical and quality concerns.”

Architectures

♦ MIPS and Alpha fully supported again: After the big drive to improve Alpha support last year, now we’ve taken care of MIPS keywording all across the Gentoo repository. Thanks to renewed volunteer interest, both arches have returned to the forefront of Gentoo Linux development, with a consistent dependency tree checked and enforced by our continuous integration system. Up-to-date stage builds and the accompanying binary packages are available for both, in the case of MIPS for all three ABI variants o32, n32, and n64 and for both big and little endian, and in the case of Alpha also with a bootable installation CD.
♦ 32bit RISC-V now available: Installation stages for 32bit RISC-V systems (rv32) are now available for download, both using hard-float and soft-float ABI, and both using glibc and musl.
End of IA-64 (Itanium) support: Following the removal of IA-64 (Itanium) support in the Linux kernel and in glibc, we have dropped all ia64 profiles and keywords.

Packages

♦ Slotted Rust: The Rust compiler is now slotted, allowing multiple versions to be installed in parallel. This allows us to finally support packages that have a maximum bounded Rust dependency and don’t compile successfully with a newer Rust (yes, that exists!), or ensure that packages use Rust and LLVM versions that fit together (e.g., firefox or chromium).
Reworked LLVM handling: In conjunction with this, the LLVM ebuilds and eclasses have been reworked so packages can specify which LLVM versions they support and dependencies are generated accordingly. The eclasses now provide much cleaner LLVM installation information to the build systems of packages, and therefore, e.g., also fix support for cross-compilation
Python: In the meantime the default Python version in Gentoo has reached Python 3.12. Additionally we have also Python 3.13 available stable - again we’re fully up to date with upstream.
♦ Zig rework and slotting: An updated eclass and ebuild framework for the Zig programming language has been committed that hooks into the ZBS or Zig Build System, allows slotting of Zig versions, allows Zig libraries to be depended on, and even provides some experimental cross-compilation support.
Ada support: We finally have Ada support for just about every architecture. Yay!
♦ Slotted Guile: The last but not least language that received the slotting treatment has been Guile, with three new eclasses, such that now Guile 1, 2, and 3 and their reverse dependencies can coexist in a Gentoo installation.
TeX Live 2023 and 2024: Catching up with our backlog, the packaging of TeX Live has been refreshed; TeX Live 2023 is now marked stable and TeX Live 2024 is marked testing.
♦ DTrace 2.0: The famous tracing tool DTrace has come to Gentoo! All required kernel options are already enabled in the newest stable Gentoo distribution kernel; if you are compiling manually, the DTrace ebuild will inform you about required configuration changes. Internally, DTrace 2.0 for Linux builds on the BPF engine of the Linux kernel, so the build installs a gcc that outputs BPF code (which, btw, also is very useful for systemd).
KDE Plasma 6 upgrade: Stable Gentoo Linux has upgraded to the new major version of the KDE community desktop environment, KDE Plasma 6. As of end of 2024, in Gentoo stable we have KDE Gear 24.08.3, KDE Frameworks 6.7.0, and KDE Plasma 6.2.4. As always, Gentoo testing follows the newest upstream releases (and using the KDE overlay you can even install from git sources). In the course of KDE package maintenance we have over the past months and years contributed over 240 upstream backports to KDE’s Qt5PatchCollection.
Microgram Ramdisk: We have added µgRD (or ugrd) as a lightweight initramfs generator alternative to dracut. As a side effect of this our installkernel mechanism has gained support for arbitrary initramfs generators.

Physical and Software Infrastructure

Mailing list archives: archives.gentoo.org, our mailing list archive, is back, now with a backend based on public-inbox. Many thanks to upstream there for being very helpful; we were even able to keep all historical links to archived list e-mails working.
♦ Ampere Altra Max development server: Arm Ltd. and specifically its Works on Arm team has sent us a fast Ampere Altra Max server to support Gentoo development. With 96 Armv8.2+ 64bit cores, 256 GByte of RAM, and 4 TByte NVMe storage, it is now hosted together with some of our other hardware at OSU Open Source Lab.

Finances of the Gentoo Foundation

♦ Income: The Gentoo Foundation took in approximately $20,800 in fiscal year 2024; the dominant part (over 80%) consists of individual cash donations from the community.
Expenses: Our expenses in 2024 were, as split into the usual three categories, operating expenses (for services, fees, …) $7,900, only minor capital expenses (for bought assets), and depreciation expenses (value loss of existing assets) $13,300.
Balance: We have about $105,000 in the bank as of July 1, 2024 (which is when our fiscal year 2024 ends for accounting purposes). The draft finanical report for 2024 is available on the Gentoo Wiki.
Transition to SPI: With the move of our accounts to SPI, see above, the web pages for individual cash donations now direct the funds to SPI earmarked for Gentoo, both for one time and recurrent donations. Donors of ongoing recurrent donations will be contacted and asked to re-arrange over the upcoming months.

Thank you!

As every year, we would like to thank all Gentoo developers and all who have submitted contributions for their relentless everyday Gentoo work. If you are interested and would like to help, please join us to make Gentoo even better! As a volunteer project, Gentoo could not exist without its community.

2024 in retrospect & happy new year 2025!

GentooNews (https://www.gentoo.org/feeds/news.xml ) • January 05, 2025, 6:00

Gentoo Fireworks Happy New Year 2025! Once again, a lot has happened over the past months, in Gentoo and otherwise. Our fireworks were a bit early this year with the stabilization of GCC 14 in November, after a huge amount of preparations and bug fixing via the Modern C initiative. A lot of other programming language ecosystems also saw significant improvements. As always here we’re going to revisit all the exciting news from our favourite Linux distribution.

Gentoo in numbers

The number of commits to the main ::gentoo repository has remained at an overall high level in 2024, with a 2.4% increase from 121000 to 123942. The number of commits by external contributors has grown strongly from 10708 to 12812, now across 421 unique external authors.

The importance of GURU, our user-curated repository with a trusted user model, as entry point for potential developers, is clearly increasing as well. We have had 7517 commits in 2024, a strong growth from 5045 in 2023. The number of contributors to GURU has increased a lot as well, from 158 in 2023 to 241 in 2024. Please join us there and help packaging the latest and greatest software. That’s the ideal preparation for becoming a Gentoo developer!

Activity has picked up speed on the Gentoo bugtracker bugs.gentoo.org, where we’ve had 26123 bug reports created in 2024, compared to 24795 in 2023. The number of resolved bugs shows the same trend, with 25946 in 2024 compared to 22779 in 2023!

New developers

In 2024 we have gained two new Gentoo developers. They are in chronological order:

Matt Jolly (kangie): Matt joined us already in February from Brisbane, Australia - now finally pushing his commits himself, after already taking care of, e.g., Chromium for over half a year. In work life a High Performance Computing systems administrator, in his free time he enjoys playing with his animals, restoring retro computing equipment and gaming consoles (or using them), brewing beer, the beach, or the local climbing gym.
Eli Schwartz (eschwartz): In July, we were able to welcome Eli Schwartz from the USA as new Gentoo developer. A bookworm and big fan of Python, and also an upstream maintainer for the Meson Build System, Eli caught the Linux bug already in highschool. Quoting him, “asking around for recommendations on distro I was recommended either Arch or Gentoo. Originally I made a mistake ;)” … We’re glad this got fixed now!

Featured changes and news

Let’s now look at the major improvements and news of 2024 in Gentoo.

Distribution-wide Initiatives

SPI associated project: As of March 2024, Gentoo Linux has become an Associated Project of Software in the Public Interest (SPI). SPI is a non-profit corporation founded to act as a fiscal sponsor for organizations that develop open source software and hardware. It provides services such as accepting donations, holding funds and assets, … and qualifies for 501(c)(3) (U.S. non-profit organization) status. This means that all donations made to SPI and its supported projects are tax deductible for donors in the United States. The intent behind becoming an SPI associated project is to gradually wind down operations of the Gentoo Foundation and transfer its assets to SPI.
GCC 14 stabilization: After a huge amount of work to identify and fix bugs and working with upstreams to modernize the overall source code base, see also the Modern C porting initiative, GCC 14 was finally stabilized in November 2024. Same as Clang 16, GCC 14 by default drops support for several long-deprecated and obsolete language constructs, turning decades-long warnings on bad code into fatal errors.
Link time optimization (LTO): Lots of progress has been made supporting LTO all across the Gentoo repository.
64bit time_t for 32bit architectures: Various preparations have begun to keep our 32-bit arches going beyond the year 2038. While the GNU C library is ready for that, the switch to a wider time_t data type is an ABI break between userland programs and libraries and needs to be approached carefully, in particular for us as a source-based distribution. Experimental profiles as well as a migration tool are available by now, and will be announced more widely at some point in 2025.
New 23.0 profiles: A new profile version 23.0, i.e. a collection of presets and configurations, has become the default setting; the old profiles are deprecated and will be removed in June 2025. The 23.0 profiles fix a lot of internal inconsistencies; for the user, they bring more toolchain hardening (specifically, CET on amd64 and non-lazy runtime binding) and optimization (e.g., packed relative reolcations where supported) by default.
Expanded binary package coverage: The binary package coverage for amd64 has been expanded a lot, with, e.g., different use-flag combinations, Python support up to version 3.13, and additional large leaf packages beyond stable as for example current GCC snapshots, all for baseline x86-64 and for x86-64-v3. At the moment, the mirrors hold over 60GByte of package data for amd64 alone.
Two additional merchandise stores: We have licensed two additional official merchandise stores, both based in Europe: FreeWear (clothing, mugs, stickers; located in Spain) and BadgeShop (Etsy, Ebay; badges, stickers; located in Romania).
Handbook improvements and editor role: The Gentoo handbook has once again been significantly improved (though there is always still more work to be done). We now have special Gentoo handbook editor roles assigned, which makes the handbook editing effectively much more community friendly. This way, a lot of longstanding issues have been fixed, making installing Gentoo easier for everyone.
Event presence: At the Free and Open Source Software Conference (FrOSCon) 2024, visitors enjoyed a full weekend of hands-on Gentoo workshops. The workshops covered a wide range of topics, from first installation to ebuild maintenance. We also offered mugs, stickers, t-shirts, and of course the famous self-compiled buttons.
Online workshops: Our German support, Gentoo e.V., is grateful to the inspiring speakers of the 6 online workshops in 2024 on various Gentoo topics in German and English. We are looking forward to more exciting events in 2025.
Ban on NLP AI tools: Due to serious concerns with current AI and LLM systems, the Gentoo Council has decided to embrace the value of human contributions and adopt the following motion: “It is expressly forbidden to contribute to Gentoo any content that has been created with the assistance of Natural Language Processing artificial intelligence tools. This motion can be revisited, should a case been made over such a tool that does not pose copyright, ethical and quality concerns.”

Architectures

MIPS and Alpha fully supported again: After the big drive to improve Alpha support last year, now we’ve taken care of MIPS keywording all across the Gentoo repository. Thanks to renewed volunteer interest, both arches have returned to the forefront of Gentoo Linux development, with a consistent dependency tree checked and enforced by our continuous integration system. Up-to-date stage builds and the accompanying binary packages are available for both, in the case of MIPS for all three ABI variants o32, n32, and n64 and for both big and little endian, and in the case of Alpha also with a bootable installation CD.
32bit RISC-V now available: Installation stages for 32bit RISC-V systems (rv32) are now available for download, both using hard-float and soft-float ABI, and both using glibc and musl.
End of IA-64 (Itanium) support: Following the removal of IA-64 (Itanium) support in the Linux kernel and in glibc, we have dropped all ia64 profiles and keywords.

Packages

Slotted Rust: The Rust compiler is now slotted, allowing multiple versions to be installed in parallel. This allows us to finally support packages that have a maximum bounded Rust dependency and don’t compile successfully with a newer Rust (yes, that exists!), or ensure that packages use Rust and LLVM versions that fit together (e.g., firefox or chromium).
Reworked LLVM handling: In conjunction with this, the LLVM ebuilds and eclasses have been reworked so packages can specify which LLVM versions they support and dependencies are generated accordingly. The eclasses now provide much cleaner LLVM installation information to the build systems of packages, and therefore, e.g., also fix support for cross-compilation
Python: In the meantime the default Python version in Gentoo has reached Python 3.12. Additionally we have also Python 3.13 available stable - again we’re fully up to date with upstream.
Zig rework and slotting: An updated eclass and ebuild framework for the Zig programming language has been committed that hooks into the ZBS or Zig Build System, allows slotting of Zig versions, allows Zig libraries to be depended on, and even provides some experimental cross-compilation support.
Ada support: We finally have Ada support for just about every architecture. Yay!
Slotted Guile: The last but not least language that received the slotting treatment has been Guile, with three new eclasses, such that now Guile 1, 2, and 3 and their reverse dependencies can coexist in a Gentoo installation.
TeX Live 2023 and 2024: Catching up with our backlog, the packaging of TeX Live has been refreshed; TeX Live 2023 is now marked stable and TeX Live 2024 is marked testing.
DTrace 2.0: The famous tracing tool DTrace has come to Gentoo! All required kernel options are already enabled in the newest stable Gentoo distribution kernel; if you are compiling manually, the DTrace ebuild will inform you about required configuration changes. Internally, DTrace 2.0 for Linux builds on the BPF engine of the Linux kernel, so the build installs a gcc that outputs BPF code (which, btw, also is very useful for systemd).
KDE Plasma 6 upgrade: Stable Gentoo Linux has upgraded to the new major version of the KDE community desktop environment, KDE Plasma 6. As of end of 2024, in Gentoo stable we have KDE Gear 24.08.3, KDE Frameworks 6.7.0, and KDE Plasma 6.2.4. As always, Gentoo testing follows the newest upstream releases (and using the KDE overlay you can even install from git sources). In the course of KDE package maintenance we have over the past months and years contributed over 240 upstream backports to KDE’s Qt5PatchCollection.
Microgram Ramdisk: We have added µgRD (or ugrd) as a lightweight initramfs generator alternative to dracut. As a side effect of this our installkernel mechanism has gained support for arbitrary initramfs generators.

Physical and Software Infrastructure

Mailing list archives: archives.gentoo.org, our mailing list archive, is back, now with a backend based on public-inbox. Many thanks to upstream there for being very helpful; we were even able to keep all historical links to archived list e-mails working.
Ampere Altra Max development server: Arm Ltd. and specifically its Works on Arm team has sent us a fast Ampere Altra Max server to support Gentoo development. With 96 Armv8.2+ 64bit cores, 256 GByte of RAM, and 4 TByte NVMe storage, it is now hosted together with some of our other hardware at OSU Open Source Lab.

Finances of the Gentoo Foundation

Income: The Gentoo Foundation took in approximately $20,800 in fiscal year 2024; the dominant part (over 80%) consists of individual cash donations from the community.
Expenses: Our expenses in 2024 were, as split into the usual three categories, operating expenses (for services, fees, …) $7,900, only minor capital expenses (for bought assets), and depreciation expenses (value loss of existing assets) $13,300.
Balance: We have about $105,000 in the bank as of July 1, 2024 (which is when our fiscal year 2024 ends for accounting purposes). The draft finanical report for 2024 is available on the Gentoo Wiki.
Transition to SPI: With the move of our accounts to SPI, see above, the web pages for individual cash donations now direct the funds to SPI earmarked for Gentoo, both for one time and recurrent donations. Donors of ongoing recurrent donations will be contacted and asked to re-arrange over the upcoming months.

Thank you!

As every year, we would like to thank all Gentoo developers and all who have submitted contributions for their relentless everyday Gentoo work. If you are interested and would like to help, please join us to make Gentoo even better! As a volunteer project, Gentoo could not exist without its community.

December 29 2024

FOSDEM 2025

GentooNews (https://www.gentoo.org/feeds/news.xml ) • December 29, 2024, 6:00

It’s FOSDEM time again! Join us at Université Libre de Bruxelles, Campus du Solbosch, in Brussels, Belgium. The upcoming FOSDEM 2025 will be held on February 1st and 2nd 2025. Our developers will be happy to greet all open source enthusiasts at our Gentoo stand (exact location still to be announced), which we will share this year with then Gentoo-based Flatcar Container Linux. Of course there’s also the chance to celebrate 25 years of compiling! Visit this year’s wiki page to see who’s coming and for more practical information.

December 20 2024

Poetry(-core), or the ultimate footgun

Michał Górny (mgorny) • December 20, 2024, 14:32

I’ve been complaining about the Poetry project a lot, in particular about its use (or more precisely, the use of poetry-core) as a build system. In fact, it pretty much became a synonym of a footgun for me — and whenever I’m about to package some project using poetry-core, or switching to it, I’ve learned to expect some predictable mistake. I suppose the time has come to note all these pitfalls in a single blog post.

The nightmarish caret operator

One of the first things Poetry teaches us is to pin dependencies, SemVer-style. Well, I’m not complaining. I suppose it’s a reasonable compromise between pinning exact versions (which just asks for dependency conflicts between different packages), and leaving user at the mercy of breaking changes in dependencies. The problem is, Poetry teaches us to treat these pins in a wholesale, one-size-fits-all manner.

What I’m talking about is the (in)famous caret operator. I mean, I suppose it’s quite convenient for the general case of semantic versioning, where e.g. ^1.2.3 is handy short for >=1.2.3,<2.0.0, and works quite well for the non-exactly-SemVer case of ^0.2.3 for >=0.2.3,<0.3.0. However, the way it is presented as a panacea means that most of the time people use it for all their dependencies, whether it is meaningful there or not.

So some pins are correct, some are too strict and others are too lax. In the end, you get the worst of both worlds: you annoy distro packagers like us who have to keep relaxing your dependencies, and you don’t help users who still get incidental breakage. Some people even use the caret operator for packages that clearly don’t fit it at all. My favorite example is the equivalent of the following dependency:

tzdata = "^2023.3"

This actually suffers from two problems. Firstly, this package clearly uses CalVer rather than SemVer, so pinning to 2023 seems fishy. Secondly, since we are talking about timezone data, there is really no point in pinning at all — on the contrary, you always want to use up-to-date timezone data.

The misleading include key

When people want to control which files are included in the source distributions, they resort to the include and exclude keys. And they add “obvious” blocks like the following:

include = [
    "CHANGELOG",
    "README.md",
    "LICENSE",
]

Except that this is entirely wrong! A plain entry in the include key is included both in source and in binary distribution. Or, to put it more clearly, this code causes the following files to be installed:

/usr/lib/python3.12/site-packages/CHANGELOG
/usr/lib/python3.12/site-packages/LICENSE
/usr/lib/python3.12/site-packages/README.md

What you need to do instead is to annotate every file with the desired format, i.e.:

include = [
    { path = "CHANGELOG", format = "sdist" },
    { path = "README.md", format = "sdist" },
    { path = "LICENSE", format = "sdist" },
]

Yes, this is absolutely confusing and counterintuitive. On top of that, even today the first example in the linked documentation is clearly wrong. And people keep repeating this mistake over and over again — I know because I keep sending pull requests fixing them, and there is no end to them! In fact, I’ve even seen people adding additional entries without the format just below entries that did have it!

Schrödinger’s optional dependency

Poetry has a custom way of declaring optional dependencies. You declare them just like a regular dependency, and add an optional key to it, e.g.:

[tool.poetry.dependencies]
python = "^3.7"
filetype = "^1.0.7"
deprecation = "^2.1.0"
# yaml-plugin extra
"ruamel.yaml" = {version = "^0.16.12", optional = true}

Well, so that last dependency is optional, right? Well, not necessarily! It is not, unless you actually add it to some dependency group, such as:

[tool.poetry.extras]
yaml-plugin = ["ruamel.yaml"]

And again, this weird behavior leads to real problems. If you declare a dependency as optional, but forget to add it to some group, Poetry will just silently treat it as a required dependency. And this is really easy to miss, unless you actually look at the generated wheel metadata. A bug about confusing handling of optional dependencies has been filed back in 2020.

Summary

These are the handful of common issues I’ve repeatedly seen happening when people tried to use poetry-core as a build system. Sure, other PEP 517 backends aren’t perfect and have their own issues. For one, setuptools pretty much consists of tons of legacy, buggy code, deprecated bits everyone uses anyway, and is barely kept alive these days. People also fall into pitfalls there.

However, I have never seen any other Python or non-Python build system that would be as counterintuitive and mistake-prone as Poetry is. On top of that, implementing PEP 621 (the standard for pyproject.toml pretty much every other PEP 517 backend follows) took 3 years — and even today, Poetry still defaults to their own, nonstandard configuration format.

Whenever I criticize Poetry, people ask me about the alternatives. For completeness, let me repeat my PEP517 backend recommendations here:

For pure Python packages: use either flit-core (lightweight, simple, no dependencies), or hatchling (popular and quite powerful, and we have to deal with its disadvantages anyway). For Python packages with C extensions, meson-python combines the power and correctness of Meson with good Python integration. For Python packages with Rust extensions, Maturin is the way to go.

Poetry(-core), or the ultimate footgun

mgorny (mgorny ) • December 20, 2024, 14:32

The nightmarish caret operator

What I’m talking about is the (in)famous caret operator. I mean, I suppose it’s quite convenient for the general case of semantic versioning, where e.g. ^1.2.3 is handy short for >=1.2.3,<2.0.0, and works quite well for the non-exactly-SemVer case of ^0.2.3 for >=0.2.3,<0.3.0. However, the way it is presented as a panacea means that most of the time people use it for all their dependencies, whether it is meaningful there or not.

tzdata = "^2023.3"

The misleading include key

When people want to control which files are included in the source distributions, they resort to the include and exclude keys. And they add “obvious” blocks like the following:

include = [
    "CHANGELOG",
    "README.md",
    "LICENSE",
]

Except that this is entirely wrong! A plain entry in the include key is included both in source and in binary distribution. Or, to put it more clearly, this code causes the following files to be installed:

/usr/lib/python3.12/site-packages/CHANGELOG
/usr/lib/python3.12/site-packages/LICENSE
/usr/lib/python3.12/site-packages/README.md

What you need to do instead is to annotate every file with the desired format, i.e.:

include = [
    { path = "CHANGELOG", format = "sdist" },
    { path = "README.md", format = "sdist" },
    { path = "LICENSE", format = "sdist" },
]

Schrödinger’s optional dependency

Poetry has a custom way of declaring optional dependencies. You declare them just like a regular dependency, and add an optional key to it, e.g.:

[tool.poetry.dependencies]
python = "^3.7"
filetype = "^1.0.7"
deprecation = "^2.1.0"
# yaml-plugin extra
"ruamel.yaml" = {version = "^0.16.12", optional = true}

Well, so that last dependency is optional, right? Well, not necessarily! It is not, unless you actually add it to some dependency group, such as:

[tool.poetry.extras]
yaml-plugin = ["ruamel.yaml"]

And again, this weird behavior leads to real problems. If you declare a dependency as optional, but forget to add it to some group, Poetry will just silently treat it as a required dependency. And this is really easy to miss, unless you actually look at the generated wheel metadata. A bug about confusing handling of optional dependencies has been filed back in 2020.

Summary

However, I have never seen any other Python or non-Python build system that would be as counterintuitive and mistake-prone as Poetry is. On top of that, implementing PEP 621 (the standard for pyproject.toml pretty much every other PEP 517 backend follows) took 3 years — and even today, Poetry still defaults to their own, nonstandard configuration format.

Whenever I criticize Poetry, people ask me about the alternatives. For completeness, let me repeat my PEP517 backend recommendations here:

November 10 2024

The peculiar world of Gentoo package testing

Michał Górny (mgorny) • November 10, 2024, 14:33

While discussing uv tests with Fedora developers, it occurred to me how different your average Gentoo testing environment is — not only from these used upstream, but also from these used by other Linux distributions. This article will be dedicated exactly to that: to pointing out how it’s different, what does that imply and why I think it’s not a bad thing.

Gentoo as a source-first distro

The first important thing about Gentoo is that it is a source-first distribution. The best way to explain this is to compare it with your average “binary” distribution.

In a “binary” distribution, source and binary packages are somewhat isolated from one another. Developers work with source packages (recipes, specs) and use them to build binary packages — either directly, or via an automation. Then the binary packages hit repositories. The end users usually do not interface with sources at all — may well not even be aware that such a thing exists.

In Gentoo, on the other hand, source packages are the first degree citizens. All users use source repositories, and can optionally use local or remote binary package repositories. I think the best way of thinking about binary packages is: as a form of “cache”.

If the package manager is configured to use binary packages, it attempts to find a package that matches the build parameters — the package version, USE flags, dependencies. If it finds a match, it can use it. If it doesn’t, it just proceeds with building from source. If configured to do so, it may write a binary package as a side effect of that — almost literally cache it. It can also be set to create a binary package without installing it (pre-fill the “cache”). It should hardly surprise anyone at this point that the default local binary packages repository is under the /var/cache tree.

A side implication of this is that the binary packages provided by Gentoo are a subset of all packages available — and on top of that, only a small number of viable package configurations are covered by the official packages.

The build phases

The source build in Gentoo is split into a few phases. The central phases that are of interest here are largely inspired by how autotools-based packages were built. These are:

src_configure — meant to pass input parameters to the build system, and get it to perform necessary platform checks. Usually involves invoking a configure script, or an equivalent action of a build system such as CMake, Meson or another.
src_compile — meant to execute the bulk of compilation, and leave the artifacts in the build tree. Usually involves invoking a builder such as make or ninja.
src_test — meant to run the test suite, if the user wishes testing to be done. Usually involves invoking the check or test target.
src_install — meant to install the artifacts and other files from the work directory into a staging directory (not the live system). The files can be afterwards transferred to the live system and/or packed into a binary package. Usually involves invoking the install target.

Clearly, it’s very similar to how you’d compile and install software yourself: configure, build, optionally test before installing, and then install.

Of course, this process is not really one-size-fits-all. For example, the modern Python packages no longer even try fitting into it. Instead, we build the wheel in the PEP 517 blackbox manner, and install it to a temporary directory straight in the compile phase. As a result, the test phase is run with a locally-installed package (relying on the logic from virtual environments), and the install phase merely moves files around for the package manager to pick them up.

The implications for testing

The key takeaways of the process are these:

The test phase is run inside the working tree, against package that was just built but not installed into the live system.
All the package’s build-time dependencies should be installed into the live system.
However, the system may contain any other packages, including packages that could affect the just-built package or its test suite in unpredictable ways.
As a corollary, the live system may or may not contain a copy of the package in question already installed. And if it does, it may be a different version, and/or a different build configuration.

All of these mean trouble. Sometimes random packages will cause the tests to fail as false positives — and sometimes they make also them wrongly pass or get ignored. Sometimes packages already installed will prevent developers from seeing that they’ve missed some dependency. Often mismatches between installed packages will make reproducing issues hard. On top of that, sometimes an earlier installed copy of the package will leak into the test environment, causing confusing problems.

If there are so many negatives, why do we do it then? Because there is also a very important positive: the packages are being tested as close to the production environment as possible (short of actually installing them — but we want to test before that happens). Presence of a certain package may cause tests to fail as false positive — but it may also uncover an actual runtime issue, one that would not otherwise be caught until it actually broke production. And I’m not talking theoretical here. While I don’t have any links handy right now, over and over again we were hitting real issues — either these that haven’t been caught by upstream CI setups yet, or that simply couldn’t have been caught in an idealized test environment.

So yeah, testing stuff this way may be quite a pain, and a source of huge frustration with the constant stream of false positives. But it’s also an important strength that no idealized — not to say “lazy” — test environment can bring. Add to that the fact that a fair number of Gentoo users are actually installing their packages with tests enabled, and you get testing on a huge variety of systems, with different architectures, dependency versions and USE flags, configuration files… and on top of that, a knack for hacking. Yeah, people hate us for finding all these bugs they’d rather not hear about.

The peculiar world of Gentoo package testing

mgorny (mgorny ) • November 10, 2024, 14:33

Gentoo as a source-first distro

The first important thing about Gentoo is that it is a source-first distribution. The best way to explain this is to compare it with your average “binary” distribution.

The build phases

The source build in Gentoo is split into a few phases. The central phases that are of interest here are largely inspired by how autotools-based packages were built. These are:

src_configure — meant to pass input parameters to the build system, and get it to perform necessary platform checks. Usually involves invoking a configure script, or an equivalent action of a build system such as CMake, Meson or another.
src_compile — meant to execute the bulk of compilation, and leave the artifacts in the build tree. Usually involves invoking a builder such as make or ninja.
src_test — meant to run the test suite, if the user wishes testing to be done. Usually involves invoking the check or test target.
src_install — meant to install the artifacts and other files from the work directory into a staging directory (not the live system). The files can be afterwards transferred to the live system and/or packed into a binary package. Usually involves invoking the install target.

Clearly, it’s very similar to how you’d compile and install software yourself: configure, build, optionally test before installing, and then install.

The implications for testing

The key takeaways of the process are these:

The test phase is run inside the working tree, against package that was just built but not installed into the live system.
All the package’s build-time dependencies should be installed into the live system.
However, the system may contain any other packages, including packages that could affect the just-built package or its test suite in unpredictable ways.
As a corollary, the live system may or may not contain a copy of the package in question already installed. And if it does, it may be a different version, and/or a different build configuration.

November 09 2024

Ready-to-boot, fresh & experimental Gentoo QCOW2 disk images

dilfridge (dilfridge ) • November 09, 2024, 0:46

Recently I've been experimenting with Catalyst, the tool that generates stages and iso files for Gentoo's Release Engineering team. The first, still very experimental result is now available for download - a bootable hard disk image in QEmu's qcow2 format that immediately drops you into a fully working Gentoo environment.

Feel free to download it and try it out, either this first upload or any future weekly build from the amd64 release file directories. The files are not linked on the www.gentoo.org webserver since I consider them not really finished yet, but instead experimental and under development. You can use a QEmu commandline as for example

qemu-system-x86_64 \
-m 8G -smbios type=0,uefi=on -bios /usr/share/edk2-ovmf/OVMF_CODE.fd \
-smp 4 -cpu host -accel kvm -vga virtio -drive file=di.qcow2 &

where the last "file" argument specifies the file that you downloaded, for testing.

The current download initially does not start any network login services such as sshd, but has an empty root password for logging in on the console - this is why I call it a "console" type disk image. Future variants I'm planning include for example a "cloud-init" type, which sets up log-in credentials and further configuration as supplied by a cloud provider.

Cheers and enjoy!

October 23 2024

DTrace 2.0 for Gentoo

GentooNews (https://www.gentoo.org/feeds/news.xml ) • October 23, 2024, 5:00

The real, mythical DTrace comes to Gentoo! Need to dynamically trace your kernel or userspace programs, with rainbows, ponies, and unicorns - and all entirely safely and in production?! Gentoo is now ready for that! Just emerge dev-debug/dtrace and you’re all set. All required kernel options are already enabled in the newest stable Gentoo distribution kernel; if you are compiling manually, the DTrace ebuild will inform you about required configuration changes. Internally, DTrace 2.0 for Linux builds on the BPF engine of the Linux kernel, so don’t be surprised if the awesome cross-compilation features of Gentoo are used to install a gcc that outputs BPF code (which, btw, also comes in very handy for sys-apps/systemd).

Documentation? Sure, there’s lots of it. You can start with our DTrace wiki page, the DTrace for Linux page on GitHub, or the original documentation for Illumos. Enjoy!

October 07 2024

Arm Ltd. provides fast Ampere Altra Max server for Gentoo

GentooNews (https://www.gentoo.org/feeds/news.xml ) • October 07, 2024, 5:00

We’re very happy to announce that Arm Ltd. and specifically its Works on Arm team has sent us a fast Ampere Altra Max server to support Gentoo development. With 96 Armv8.2+ 64bit cores, 256 GByte of RAM, and 4 TByte NVMe storage, it is now hosted together with some of our other hardware at OSU Open Source Lab. The machine will be a clear boost to our future arm64 (aarch64) and arm (32bit) support, via installation stage builds and binary packages, architecture testing of Gentoo packages, as well as our close work with upstream projects such as GCC and glibc. Thank you!

October 04 2024

Testing the safe time64 transition path

Michał Górny (mgorny) • October 04, 2024, 13:54

Recently I’ve been elaborating on the perils of transition to 64-bit time_t, following the debate within Gentoo. Within these deliberations, I have also envisioned potential solutions to ensure that production systems could be migrated safely.

My initial ideas involved treating time64 as a completely new ABI, with a new libdir and forced incompatibility between binaries. This ambitious plan faced two disadvantages. Firstly, it required major modification to various toolchains, and secondly, it raised compatibility concerns between Gentoo (and other distributions that followed this plan) and distributions that switched before or were going to switch without making similar changes. Effectively, it would not only require a lot of effort from us, but also a lot of convincing other people, many of whom probably don’t want to spend any more time on doing extra work for 32-bit architectures. This made me consider alternative ideas.

One of them was to limit the changes to the transition period — use a libt32 temporary library directory to prevent existing programs from breaking while rebuilds were performed, and then simply remove them, and be left with plain lib like other distributions that switched already. In this post, I’d like to elaborate how I went about testing the feasibility of this solution. Please note that this is not a migration guide — it includes steps that are meant to detect problems with the approach, and are not suitable for production systems.

Preparing to catch time32/time64 mixing

As I’ve explained before, the biggest risk during the transition is accidental mixing of time32 and time64 binaries. In the worst case, it could mean not only breaking programs running on production, but actively creating vulnerabilities via out-of-bounds accesses. Therefore, I believe it is crucial to ensure that no such thing happens throughout the migration.

My first step towards testing the migration process was to create an ABI mixing check that would be injected into executables. I’ve placed the following code into /usr/include/__gentoo_time.h:

#include <stdio.h>
#include <stdlib.h>

__attribute__((weak))
__attribute__((visibility("default")))
struct {
	int time32;
	int time64;
} __gentoo_time_bits;

__attribute__((constructor))
static void __gentoo_time_check() {
#if _TIME_BITS == 64
#error "not now"
	__gentoo_time_bits.time64 = 1;
#else
	__gentoo_time_bits.time32 = 1;
#endif

	if (__gentoo_time_bits.time32 && __gentoo_time_bits.time64) {
		FILE *f;
		fprintf(stderr, "time32 and time64 ABI mixing detected\n");
		/* trigger a sandbox failure for good measure too */
		f = fopen("/time32-time64-mixing", "w");
		if (f)
			fclose(f);
		abort();
	}
}

Then, I have added the following line to /usr/include/time.h, just above __BEGIN_DECLS:

#include <__gentoo_time.h>

Now, this meant that any binary including <time.h>, even indirectly, would get our check. In fact, the check would probably be duplicated a lot, but that’s not really a problem for the test system.

The check itself utilizes a bit of magic. It creates a weak __gentoo_time_bits structure that would be shared between the executable itself and all loaded libraries. Every binary would run the constructor function upon loading, and it would fits store its own _TIME_BITS value within the shared structure, and then ensure that no binary set the other value. If that did happen, it would not only cause the program to immediately abort, but also try to trigger a sandbox failure, so the package build would be considered failed even if the build system ignored that particular failure.

However, note the #error in the snippet. This is a temporary hack to block packages that automatically try to use -D_TIME_BITS=64 (e.g. coreutils, grep, man-db), as they would trigger the check prematurely, and as a false positive.

At this point, I did rebuild the whole system, except for glibc, to inject the check into as many time32 binaries as possible:

emerge -ve --exclude=sys-libs/glibc --keep-going=y --jobs=16 @world

A number of packages fail here, because they attempt to force -D_TIME_BITS=64. This is okay, we don’t need perfect coverage, and we definitely don’t want false positives.

Preparing for the transition

The next step is to actually prepare for the transition. The preparation involves two changes, to all packages except for sys-libs/glibc:

Moving all libraries from lib to libt32.
Injecting libt32 directories into RUNTIME of all binaries, executables and libraries alike.

This is done using a tool called time32-prep. It takes care of finding all potential libdirs from ld.so, setting RUNPATH on binaries (and removing any references to plain lib, while at it), and then moving the libraries.

Rebuilding everything

The next step is to configure the system to compile time64 binaries by default. For a start, I have added the following snippet to make.conf, to easily distinguish packages that were rebuilt:

CHOST="i686-pc-linux-gnut64"
CHOST_x86="i686-pc-linux-gnut64"

I’ve rebuilt the dependencies of GCC using time64 flags explicitly:

CFLAGS="-D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64" emerge -1v sys-apps/sandbox dev-libs/{gmp,mpfr,mpc} sys-libs/zlib app-arch/{xz-utils,zstd}

Rebuilt and switched binutils:

emerge -1v sys-devel/binutils
binutils-config 1

Then, I’ve added a user patch to make GCC default to time64:

--- a/gcc/c-family/c-cppbuiltin.cc
+++ b/gcc/c-family/c-cppbuiltin.cc
@@ -1560,6 +1560,9 @@ c_cpp_builtins (cpp_reader *pfile)
     builtin_define_with_int_value ("_FORTIFY_SOURCE", GENTOO_FORTIFY_SOURCE_LEVEL);
 #endif
 
+  cpp_define (pfile, "_FILE_OFFSET_BITS=64");
+  cpp_define (pfile, "_TIME_BITS=64");
+
   /* Misc.  */
   if (flag_gnu89_inline)
     cpp_define (pfile, "__GNUC_GNU_INLINE__");

And rebuilt GCC itself (without time64 flags):

USE=-sanitize emerge -v sys-devel/gcc
gcc-config 1

Note that I had to disable sanitizers, as they currently fail to build with _TIME_BITS=64. I also had to comment out the __gentoo_time.h include for the time of building GCC.

The final step was to rebuild all packages (except for GCC and glibc) with the new compiler:

emerge -ve --exclude=sys-libs/glibc --exclude=sys-devel/{binutils,gcc} --jobs=16 --keep-going=y @world

The results

Well, I have some bad news — at some point, the rebuilds started failing. However, it seems that all failures I’ve hit during the initial testing can be accounted for as something relatively harmless — Perl and Python extensions.

Long story short, since they are installed into a dedicated directory, they can’t be prevented from ABI mixing via the libt32 hack. However, that’s unlikely to be a real problem. They failed for me, because I’ve made ABI mixing absolutely fatal — but in reality only private parts of the Python API use time_t, and these should not be used by any third-party extensions. And in the end, the issues are resolved by rebuilding in a different order.

Next steps

While this could be considered an important success, we’re still way ahead from being ready to go full time64. The time32-prep tool itself has a few TODOs, and definitely needs testing on a more “production-like” system. Then, there are actual problems that the packages are facing on time64 setups (like the GCC build failure in sanitizers), and that need to be fixed before we make things official.

Testing the safe time64 transition path

mgorny (mgorny ) • October 04, 2024, 13:54

One of them was to limit the changes to the transition period — use a libt32 temporary library directory to prevent existing programs from breaking while rebuilds were performed, and then simply remove them, and be left with plain lib like other distributions that switched already. In this post, I’d like to elaborate how I went about testing the feasibility of this solution. Please note that this is not a migration guide — it includes steps that are meant to detect problems with the approach, and are not suitable for production systems.

Preparing to catch time32/time64 mixing

My first step towards testing the migration process was to create an ABI mixing check that would be injected into executables. I’ve placed the following code into /usr/include/__gentoo_time.h:

#include <stdio.h>
#include <stdlib.h>

__attribute__((weak))
__attribute__((visibility("default")))
struct {
	int time32;
	int time64;
} __gentoo_time_bits;

__attribute__((constructor))
static void __gentoo_time_check() {
#if _TIME_BITS == 64
#error "not now"
	__gentoo_time_bits.time64 = 1;
#else
	__gentoo_time_bits.time32 = 1;
#endif

	if (__gentoo_time_bits.time32 && __gentoo_time_bits.time64) {
		FILE *f;
		fprintf(stderr, "time32 and time64 ABI mixing detected\n");
		/* trigger a sandbox failure for good measure too */
		f = fopen("/time32-time64-mixing", "w");
		if (f)
			fclose(f);
		abort();
	}
}

Then, I have added the following line to /usr/include/time.h, just above __BEGIN_DECLS:

#include <__gentoo_time.h>

Now, this meant that any binary including <time.h>, even indirectly, would get our check. In fact, the check would probably be duplicated a lot, but that’s not really a problem for the test system.

The check itself utilizes a bit of magic. It creates a weak __gentoo_time_bits structure that would be shared between the executable itself and all loaded libraries. Every binary would run the constructor function upon loading, and it would fits store its own _TIME_BITS value within the shared structure, and then ensure that no binary set the other value. If that did happen, it would not only cause the program to immediately abort, but also try to trigger a sandbox failure, so the package build would be considered failed even if the build system ignored that particular failure.

However, note the #error in the snippet. This is a temporary hack to block packages that automatically try to use -D_TIME_BITS=64 (e.g. coreutils, grep, man-db), as they would trigger the check prematurely, and as a false positive.

At this point, I did rebuild the whole system, except for glibc, to inject the check into as many time32 binaries as possible:

emerge -ve --exclude=sys-libs/glibc --keep-going=y --jobs=16 @world

A number of packages fail here, because they attempt to force -D_TIME_BITS=64. This is okay, we don’t need perfect coverage, and we definitely don’t want false positives.

Preparing for the transition

The next step is to actually prepare for the transition. The preparation involves two changes, to all packages except for sys-libs/glibc:

Moving all libraries from lib to libt32.
Injecting libt32 directories into RUNTIME of all binaries, executables and libraries alike.

This is done using a tool called time32-prep. It takes care of finding all potential libdirs from ld.so, setting RUNPATH on binaries (and removing any references to plain lib, while at it), and then moving the libraries.

Rebuilding everything

The next step is to configure the system to compile time64 binaries by default. For a start, I have added the following snippet to make.conf, to easily distinguish packages that were rebuilt:

CHOST="i686-pc-linux-gnut64"
CHOST_x86="i686-pc-linux-gnut64"

I’ve rebuilt the dependencies of GCC using time64 flags explicitly:

CFLAGS="-D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64" emerge -1v sys-apps/sandbox dev-libs/{gmp,mpfr,mpc} sys-libs/zlib app-arch/{xz-utils,zstd}

Rebuilt and switched binutils:

emerge -1v sys-devel/binutils
binutils-config 1

Then, I’ve added a user patch to make GCC default to time64:

--- a/gcc/c-family/c-cppbuiltin.cc
+++ b/gcc/c-family/c-cppbuiltin.cc
@@ -1560,6 +1560,9 @@ c_cpp_builtins (cpp_reader *pfile)
     builtin_define_with_int_value ("_FORTIFY_SOURCE", GENTOO_FORTIFY_SOURCE_LEVEL);
 #endif
 
+  cpp_define (pfile, "_FILE_OFFSET_BITS=64");
+  cpp_define (pfile, "_TIME_BITS=64");
+
   /* Misc.  */
   if (flag_gnu89_inline)
     cpp_define (pfile, "__GNUC_GNU_INLINE__");

And rebuilt GCC itself (without time64 flags):

USE=-sanitize emerge -v sys-devel/gcc
gcc-config 1

Note that I had to disable sanitizers, as they currently fail to build with _TIME_BITS=64. I also had to comment out the __gentoo_time.h include for the time of building GCC.

The final step was to rebuild all packages (except for GCC and glibc) with the new compiler:

emerge -ve --exclude=sys-libs/glibc --exclude=sys-devel/{binutils,gcc} --jobs=16 --keep-going=y @world

The results

Long story short, since they are installed into a dedicated directory, they can’t be prevented from ABI mixing via the libt32 hack. However, that’s unlikely to be a real problem. They failed for me, because I’ve made ABI mixing absolutely fatal — but in reality only private parts of the Python API use time_t, and these should not be used by any third-party extensions. And in the end, the issues are resolved by rebuilding in a different order.

Next steps

September 28 2024

The perils of transition to 64-bit time_t

Michał Górny (mgorny) • September 28, 2024, 15:44

(please note that there’s a correction at the bottom)

In the Overview of cross-architecture portability problems, I have dedicated a section to the problems resulting from use of 32-bit time_t type. This design decision, still affecting Gentoo systems using glibc, means that 32-bit applications will suddenly start failing in horrible ways in 2038: they will be getting -1 error instead of the current time, they won’t be able to stat() files. In one word: complete mayhem will emerge.

There is a general agreement that the way forward is to change time_t to a 64-bit type. Musl has already switched to that, glibc supports it as an option. A number of other distributions such as Debian have taken the leap and switched. Unfortunately, source-based distributions such as Gentoo don’t have it that easy. So we are still debating the issue and experimenting, trying to figure out a maximally safe upgrade path for our users.

Unfortunately, that’s nowhere near trivial. Above all, we are talking about a breaking ABI change. It’s all-or-nothing. If a library uses time_t in its API, everything linking to it needs to use the same type width. In this post, I’d like to explore the issue in detail — why is it so bad, and what we can do to make it safer.

Going back to Large File Support

Before we get into the time64 change, as I’m going to shortly call it, we need to go back in history a bit and consider another similar problem: Large File Support.

Long story short, originally 32-bit architectures specify two important file-related types that were 32 bits wide: off_t used to specify file offsets (signed to support relative offsets) and ino_t used to specify inode numbers. This had two implications: you couldn’t open files larger than 2 GiB, and you couldn’t open files whose inode numbers exceeded 32-bit unsigned integer range.

To resolve this problem, Large File Support was introduced. It involved replacing these two types with 64-bit variants, and on glibc it is still optional today. In its case, we didn’t take the leap and transitioned globally. Instead, packages generally started enabling LFS support upstream — also taking care to resolve any ABI breakage in the process. While many packages did that, we shouldn’t consider the problem solved.

The important point here is that time64 support in glibc requires LFS to be used. This makes sense — if we are going to break stuff, we may as well solve both problems.

What ABIs are we talking about?

To put it simply, we have three possible sub-ABIs here:

the original ABI with 32-bit types,
LFS: 64-bit off_t and ino_t, 32-bit time_t,
time64: LFS + 64-bit time_t.

What’s important here is that a single glibc build remains compatible with all three variants. However, libraries that use these types in their API are not.

Today, 32-bit systems roughly use a mix of the first and second ABI — the latter including packages that enabled LFS explicitly. For the future, our goal is to focus on the third option. We are not concerned about providing full-LFS systems with 32-bit time_t.

Why the ABI change is so bad?

Now, the big deal is that we are replacing a 32-bit type with a 64-bit type, in place. Unlike with LFS, glibc does not provide any transitional API that could be used to enable new functions while preserving backwards compatibility — it’s all-or-nothing.

Let’s consider structures. If a structure contains time_t with its natural 32-bit alignment, then there’s no padding for the type to extend to. Inevitable, all fields will have to shift to make room for the new type. Let’s consider a trivial example:

struct {
    int a;
    time_t b;
    int c;
};

With 32-bit time_t, the offset of c is 8. With the 64-bit type, it’s 16. If you mix binaries using different time_t width, they’re inevitably are going to read or write the wrong fields! Or perhaps even read or write out of bounds!

Let’s just look at the size of struct stat, as an example of structure that uses both file and time-related types. On plain 32-bit x86 glibc it’s 88 byte long. With LFS, it’s 96 byte long (size and inode number fields are expanded). With LFS + time64, it’s 108 byte long (three timestamps are expanded).

However, you don’t even need to use structures. After all, we are talking about x86 where function parameters are passed on stack. If one of the parameters is time_t, then positions of all parameters on stack change, and we find ourselves seeing the exact same problem! Consider the following prototype:

extern void foo(int a, time_t b, int c);

Let’s say we’re calling it as foo(1, 2, 3). With 32-bit types, the call looks like the following:

	pushl	$3
	pushl	$2
	pushl	$1
	call	foo@PLT

However, with 64-bit time_t, it changes to:

	pushl	$3
	pushl	$0
	pushl	$2
	pushl	$1
	call	foo@PLT

An additional 32-bit value (zero) is pushed between the “old” b and c. Once again, if we mix both kinds of binaries, they are going to fail to read the parameters correctly!

So yeah, it’s a big deal. And right now, there are no real protections in place to prevent mixing these ABIs. So what you actually may get is runtime breakage, potentially going as far as to create security issues.

You don’t have to take my word for it. You can reproduce it yourself on x86/amd64 easily enough. Let’s take the more likely case of a time32 program linked against a library that has been rebuilt for time64:

$ cat >libfoo.c <<EOF
#include <stdio.h>
#include <time.h>

void foo(int a, time_t b, int *c) {
   printf("a = %d\n", a);
   printf("b = %lld", (long long) b);
   printf("%s", ctime(&b));
   printf("c = %d\n", *c);
}
EOF
$ cat >foo.c <<EOF
#include <stddef.h>
#include <time.h>

extern void foo(int a, time_t b, int *c);

int main() {
    int three = 3;
    foo(1, time(NULL), &three);
    return 0;
}
EOF
$ cc -m32 libfoo.c -shared -o libfoo.so
$ cc -m32 foo.c -o foo -Wl,-rpath,. libfoo.so
$ ./foo
a = 1
b = 1727154919
Tue Sep 24 07:15:19 2024
c = 3
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
  libfoo.c -shared -o libfoo.so
$ ./foo 
a = 1
b = -34556652301432063
Thu Jul 20 06:16:17 -1095054749
c = 771539841

On top of that, the source-first nature of Gentoo amplifies these problems. An average binary distribution rebuilds all binary packages — and then the user upgrades the system in a single, relatively atomic step. Sure, if someone uses third-party repositories or has locally built programs that link to system libraries, problems can emerge but the process is relatively safe.

On the other hand, in Gentoo we are talking about rebuilding @world while breaking ABI in place. For a start, we are talking around prolonged periods of time between two packages being rebuilt when they would actually be mixing incompatible ABI. Then, there is a fair risk that some rebuild will fail and leave your system half-transitioned with no easy way out. Then, there is a real risk that cyclic dependencies will actually make rebuild impossible — rebuilding a dependency will break build-time tools, preventing stuff from being rebuilt. It’s a true horror.

What can we do to make it safer?

Our deliberations currently revolve about three ideas, that are semi-related, though not inevitably dependent one upon another:

Changing the platform tuple (CHOST) for the new ABIs, to clearly distinguish them from the baseline 32-bit ABI.
Changing the libdir for the new ABIs, effectively permitting the rebuilt libraries to be installed independently of the original versions.
Introducing an binary-level ABI distinction that could prevent binaries using different sub-ABI to be linked to one another.

The subsequent sections will focus on each of these changes in detail. Note that all the values used there are just examples, and not necessarily the strings used in a final solution.

The platform tuple change

The platform tuple (generally referenced through the CHOST variable) identifies the platform targeted by the toolchain. For example, it is used as a part of GCC/binutils install paths, effectively allowing toolchains for multiple targets to be installed simultaneously. In clang, it can be used to switch between supported cross-compilation targets, and can control the defaults to match the specified ABI. In Gentoo, it is also used to uniquely identify ABIs for the purpose of multilib support. Because of that, we require that no two co-installable ABIs share the same tuple.

A tuple consists of four parts, separated by hyphens: architecture, vendor, operating system and libc. Of these, vendor is generally freeform but the other three are restricted to some degree. A few semi-equivalent examples of tuples used for 32-bit x86 platform include:

i386-pc-linux-gnu
i686-pc-linux-gnu
i686-unknown-linux-gnu

Historically, two approaches were used to introduce new ABIs. Either the vendor field was changed, or an additional ABI specification was appended to the libc field. For example, Gentoo historically used two different kind of tuples for ARM ABIs with hardware floating-point unit:

armv7a-hardfloat-linux-gnueabi
armv7a-unknown-linux-gnueabihf

The former approach was used earlier, to avoid incompatibility problems resulting from altering other tuple fields. However, as these were fixed and upstreams normalized on the latter solution, Gentoo followed suit.

Similarly, the discussion of time64 ABIs resurfaced the same dilemma: should we just “abuse” the vendor field for this, or instead change libc field and fix packages? The main difference is that the former is “cleaner” as a downstream solution limited to Gentoo, while the latter generally opens up discussions about interoperability. Therefore, the options look like:

i686-gentoo_t64-linux-gnu
i686-pc-linux-gnut64
armv7a-gentoo_t64-linux-gnueabihf
armv7a-unknown-linux-gnueabihft64

Fortunately, changing the tuple should not require much patching. The GNU toolchain and GNU build system both ignore everything following “gnu” in the libc field. Clang will require patching — but upstream is likely to accept our patches, and we will want to make patches anyway, as they will permit clang to automatically choose the right ABI based on the tuple.

The libdir change

The term “libdir” refers to the base name of the library install directory. Having different libdirs, and therefore separate library install directories, makes it possible to build multilib systems, i.e. installing multiple ABI variations of libraries on a single system, and making it possible to run executables for different ABIs. For example, this is what makes it possible to run 32-bit x86 executables on amd64 systems.

The libdir values are generally specified in the ABI. Naturally, the baseline value is plain lib. As a historical convention (since 32-bit architectures were first), usually 32-bit platforms (arm, ppc, x86) use lib, whereas their more modern 64-bit counterparts (amd64, arm64, ppc64) use lib64 — even if a particular architecture never really supported multilib on Gentoo.

Architectures that support multiple ABIs also define different libdirs. For example, the additional x32 ABI on x86 uses libx32. MIPS n32 ABI uses lib32 (with plain lib defining the o32 ABI).

Now, we are considering changing the libdir value for time64 variants of 32-bit ABIs, for example from lib to libt64. This would make it possible to install the rebuilt libraries separately from the old libraries, effectively bringing three advantages:

reducing the risk of time64 executables accidentally linking to time32 libraries,
enabling Portage’s preserved-libs feature to preserve time32 libraries once the respective packages have been rebuilt for time64, and before their reverse dependencies have been rebuilt,
optionally, making it possible to use a time32 + time64 multilib profiles, that could be used to preserve compatibility with prebuilt time32 applications linking to system libraries.

In my opinion, the second point is a killer feature. As I’ve mentioned before, we are talking about the kind of migration that would break executables for a prolonged time on production systems, and possibly break build-time tools, preventing the rebuild from proceeding further. By preserving original libraries, we are minimizing the risk of actual breakage, since the existing executables will keep using the time32 libraries until they are rebuilt and linked to the time64 libraries.

The libdir change is definitely going to require some toolchain patching. We may want to also consider special-casing glibc, as the same set of glibc libraries is valid for all of the sub-ABIs we were considering. However, we will probably want a separate ld.so executable, as it would need to load libraries from the correct libdir, and then we will want to set .interp in time64 executables to reference the time64 ld.so.

Note that due to how multilib is designed in Gentoo, a proper multilib support for this (i.e. the third point) requires a unique platform tuple for the ABI as well — so that specific aspect is dependent on the tuple change.

Ensuring binary incompatibility

In general, you can’t mix binaries using different ABIs. For example, if you try to link a 64-bit program to a 32-bit library, the linker will object:

$ cc foo.c libfoo.so 
/usr/lib/gcc/x86_64-pc-linux-gnu/14/../../../../x86_64-pc-linux-gnu/bin/ld: libfoo.so: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status

Similarly, the dynamic loader will refuse to use a 32-bit library with 64-bit program:

$ ./foo 
./foo: error while loading shared libraries: libfoo.so: wrong ELF class: ELFCLASS32

There are a few mechanisms that are used for this. As demonstrated above, architectures with 32-bit and 64-bit ABIs use two distinct ELF classes (ELFCLASS32 and ELFCLASS64). Additionally, some architectures use different machine identifiers (EM_386 vs. EM_X86_64, EM_PPC vs. EM_PPC64). The x32 bit ABI on x86 “abuses” this by declaring its binaries as ELFCLASS32 + EM_X86_64 (and therefore distinct from ELFCLASS32 + EM_386 and from ELFCLASS64 + EM_X86_64).

Both ARM and MIPS use the flags field (it is a bit-field with architecture-specific flags) to distinguish different ABIs (hardfloat vs. softfloat, n32 ABI on MIPS…). Additionally, both feature a dedicated attribute section — and again, the linker refuses to link incompatible object files.

It may be desirable to implement a similar mechanism for time32 and time64 systems. Unfortunately, it’s not a trivial task. It doesn’t seem that there is a reusable generic mechanism that could be used for that. On top of that, we need a solution that would fit a fair number of different architectures. It seems that the most reasonable solution right now would be to add a new ELF note section dedicated to this feature, and implement complete toolchain support for it.

However, whatever we decide to do, we need to take into consideration that the user may want to disable it. Particularly, there is a fair number of prebuilt software that have no sources available, and it may continue working correctly against system libs, provided it does not call into any API using time_t. The cure of unconditionally preventing them from working might be worse than the disease.

On the bright side, it should be possible to create a non-fatal QA check for this without much hacking, provided that we go with separate libdirs. We can distinguish time64 executables by their .interp section, pointing to the dynamic loader in the appropriate libdir, and then verify that time32 programs will not load any libraries from libt64, and that time64 programs will not load any libraries directly from lib.

What about old prebuilt applications?

So far we were concerned about packages that are building from sources. However, there is still a fair number of old applications, usually proprietary, that are available only as prebuilt binaries — particularly for x86 and PowerPC architectures. These packages are going to face two problems: firstly, compatibility issues with system libraries, and secondly, the y2k38 problem itself.

For the compatibility problem, we have a reasonably good solution already. Since we already had to make them work on amd64, we have a multilib layout in place, along with necessary machinery to build multiple library versions. In fact, given that the primary purpose of multilib is compatibility with old software, it’s not even clear if there is much of a point in switching amd64 multilib to use time64 for 32-bit binaries. Either way, we can easily extend our multilib machinery to distinguish the regular abi_x86_32 target from abi_x86_t64 (and we probably should do that anyway), and then create new multilib x86 profiles that would support both ABIs.

The second part is much harder. Obviously, as soon as we’re past the 2038 cutoff date, all 32-bit programs — using system libraries or not — will simply start failing in horrible ways. One possibility is to work with faketime to control the system clock. Another is to run a whole VM that’s moved back in time.

Summary

As 2038 is approaching, 32-bit applications exercising 32-bit time_t are up to stop working. At this point, it is pretty clear that the only way forward is to rebuild these applications with 64-bit time_t (and while at it, force LFS as well). Unfortunately, that’s not a trivial task since it involves an ABI change, and mixing time32 and time64 programs and libraries can lead to horrible runtime bugs.

While the exact details are still in the making, the proposed changes revolve around three ideas that can be implemented independently to some degree: changing the platform tuple (CHOST), changing libdir and preventing accidentally mixing time32 and time64 binaries.

The tuple change is mostly a more formal way of distinguishing builds for the regular time32 ABI (e.g. i686-pc-linux-gnu) from ones specifically targeting time64 (e.g. i686-pc-linux-gnut64). It should be relatively harmless and easy to carry out, with minimal amount of fixing necessary. For example, clang will need to be updated to accept new tuples.

The libdir change is probably the most important of all, as it permits a breakage-free transition, thanks to Portage’s preserved-libs feature. Long story short, time64 libraries get installed to a new libdir (e.g. libt64), and the original time32 libraries remain in lib until the applications using them are rebuilt. Unfortunately, it’s a bit harder to implement — it requires toolchain changes, and ensuring that all software correctly respects libdir. The extra difficulty is that with this change alone, the dynamic loader won’t ignore time32 libraries if e.g. -Wl,-rpath,/usr/lib is injected somewhere.

The incompatibility part is quite important, but also quite difficult. Ideally, we’d like to stop the linker from trying to accidentally link time32 libraries with time64 programs, and likewise the dynamic loader from trying to load them. Unfortunately, so far we weren’t able to come up with a realistic way of doing that, short of actually making some intrusive changes to the toolchain. On the positive side, writing a QA check to detect accidental mixing at build time shouldn’t be that hard.

Doing all three should enable us to provide a clean and relatively safe transition path for 32-bit Gentoo systems using glibc. However, these only solve problems for packages built from source. Prebuilt 32-bit applications, particularly proprietary software like old games, can’t be helped that way. And even if time64 changes won’t break them via breaking the ABI compatibility with system libraries, then year 2038 will. Unfortunately, there does not seem to be a good solution to that, short of actually running them with faked system time, one way or another.

Of course, all of this is still only a rough draft. A lot may still change, following experiments, discussion and patch submission.

Acknowledgements

I would like to thank the following people for proof-reading and suggestions, and for their overall work towards time64 support in Gentoo: Arsen Arsenović, Andreas K. Hüttel, Sam James and Alexander Monakov.

2024-09-30 correction

Unfortunately, my original ideas were too optimistic. I’ve entirely missed the fact that all libdirs are listed in ld.so.conf, and therefore we cannot rely on hardcoding the libdir path inside ld.so itself. In retrospect, I should have seen that coming — after all, we already adjust these paths for custom LLVM prefix, and that one would require special handling too.

This effectively means that the libdir change probably needs to depend on the binary incompatibility part. Overall, we need to meet three basic goals:

The dynamic loader needs to be able to distinguish time32 and time64 binaries. For time32 programs, it needs to load only time32 libraries; for time64 programs, it needs to load only time64 libraries. In both cases, we need to assume that both kind of libraries will appear in path.
For backwards compatibility, we need to assume that all binaries that do not have an explicit time64 marking are time32.
Therefore, all newly built binaries must carry an explicit time64 marking. This includes binaries built by non-C environments, such as Rust, even if they do not interact with time_t ABI at all. Otherwise, these binaries would forever depend on time32 libraries.

Meeting all these goals is a lot of effort. None of the hacks we debated so far seem sufficient to achieve that, so we are probably talking about the level of effort on par with patching multiple toolchains for a variety of programming languages. Naturally, this is not something we can carry locally in Gentoo, so it also requires cooperation from multiple parties. All that for architectures that are largely considered legacy, and sometimes not even really supported anymore.

Of course, another problem is whether these other toolchains are actually going to produce correct time64 executables. After all, unless they are specifically adapted to respect _TIME_BITS the way C programs do, they are probably going to hardcode specific time_t width, and break horribly when it changes. However, that’s really an upstream problem to solve, and tangential to the issues we are discussing here.

On top of that, we are talking of a major incompatibility. All binaries that aren’t explicitly marked as time64 are going to use time32 libraries, even if they use time64 ABI. Gentoo won’t be able to run third-party executables unless they are patched to carry the correct marking.

Perhaps a better solution is to set our aims lower. Rather than actually distinguishing time32 and time64 binaries, we could instead inject RPATH to all time64 executables, directly forcing the time64 libdir there. This definitely won’t prevent the dynamic loader from using time32 libraries, but it should help transition without causing major incompatibility concerns.

Alternatively, we could consider the problem the other way around. Rather than changing libdir permanently for time64 libraries, we could change it temporarily for time32 libraries. This would imply injecting RPATH into all existing programs and renaming the libdir. Newly built time64 libraries would be installed back into the old libdir, and newly built time64 programs would lack the RPATH forcing time32 libraries. A clear advantage of this solution is that it would remain entirely compatible with other distributions that have taken the leap already.

As you can see, the situation is developing rapidly. Every day is bringing new challenges, and new ideas how to overcome them.

The perils of transition to 64-bit time_t

mgorny (mgorny ) • September 28, 2024, 15:44

(please note that there’s a correction at the bottom)

In the Overview of cross-architecture portability problems, I have dedicated a section to the problems resulting from use of 32-bit time_t type. This design decision, still affecting Gentoo systems using glibc, means that 32-bit applications will suddenly start failing in horrible ways in 2038: they will be getting -1 error instead of the current time, they won’t be able to stat() files. In one word: complete mayhem will emerge.

There is a general agreement that the way forward is to change time_t to a 64-bit type. Musl has already switched to that, glibc supports it as an option. A number of other distributions such as Debian have taken the leap and switched. Unfortunately, source-based distributions such as Gentoo don’t have it that easy. So we are still debating the issue and experimenting, trying to figure out a maximally safe upgrade path for our users.

Unfortunately, that’s nowhere near trivial. Above all, we are talking about a breaking ABI change. It’s all-or-nothing. If a library uses time_t in its API, everything linking to it needs to use the same type width. In this post, I’d like to explore the issue in detail — why is it so bad, and what we can do to make it safer.

Going back to Large File Support

Before we get into the time64 change, as I’m going to shortly call it, we need to go back in history a bit and consider another similar problem: Large File Support.

Long story short, originally 32-bit architectures specify two important file-related types that were 32 bits wide: off_t used to specify file offsets (signed to support relative offsets) and ino_t used to specify inode numbers. This had two implications: you couldn’t open files larger than 2 GiB, and you couldn’t open files whose inode numbers exceeded 32-bit unsigned integer range.

The important point here is that time64 support in glibc requires LFS to be used. This makes sense — if we are going to break stuff, we may as well solve both problems.

What ABIs are we talking about?

To put it simply, we have three possible sub-ABIs here:

the original ABI with 32-bit types,
LFS: 64-bit off_t and ino_t, 32-bit time_t,
time64: LFS + 64-bit time_t.

What’s important here is that a single glibc build remains compatible with all three variants. However, libraries that use these types in their API are not.

Why the ABI change is so bad?

Let’s consider structures. If a structure contains time_t with its natural 32-bit alignment, then there’s no padding for the type to extend to. Inevitable, all fields will have to shift to make room for the new type. Let’s consider a trivial example:

struct {
    int a;
    time_t b;
    int c;
};

With 32-bit time_t, the offset of c is 8. With the 64-bit type, it’s 16. If you mix binaries using different time_t width, they’re inevitably are going to read or write the wrong fields! Or perhaps even read or write out of bounds!

Let’s just look at the size of struct stat, as an example of structure that uses both file and time-related types. On plain 32-bit x86 glibc it’s 88 byte long. With LFS, it’s 96 byte long (size and inode number fields are expanded). With LFS + time64, it’s 108 byte long (three timestamps are expanded).

However, you don’t even need to use structures. After all, we are talking about x86 where function parameters are passed on stack. If one of the parameters is time_t, then positions of all parameters on stack change, and we find ourselves seeing the exact same problem! Consider the following prototype:

extern void foo(int a, time_t b, int c);

Let’s say we’re calling it as foo(1, 2, 3). With 32-bit types, the call looks like the following:

	pushl	$3
	pushl	$2
	pushl	$1
	call	foo@PLT

However, with 64-bit time_t, it changes to:

	pushl	$3
	pushl	$0
	pushl	$2
	pushl	$1
	call	foo@PLT

An additional 32-bit value (zero) is pushed between the “old” b and c. Once again, if we mix both kinds of binaries, they are going to fail to read the parameters correctly!

$ cat >libfoo.c <<EOF
#include <stdio.h>
#include <time.h>

void foo(int a, time_t b, int *c) {
   printf("a = %d\n", a);
   printf("b = %lld", (long long) b);
   printf("%s", ctime(&b));
   printf("c = %d\n", *c);
}
EOF
$ cat >foo.c <<EOF
#include <stddef.h>
#include <time.h>

extern void foo(int a, time_t b, int *c);

int main() {
    int three = 3;
    foo(1, time(NULL), &three);
    return 0;
}
EOF
$ cc -m32 libfoo.c -shared -o libfoo.so
$ cc -m32 foo.c -o foo -Wl,-rpath,. libfoo.so
$ ./foo
a = 1
b = 1727154919
Tue Sep 24 07:15:19 2024
c = 3
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
  libfoo.c -shared -o libfoo.so
$ ./foo 
a = 1
b = -34556652301432063
Thu Jul 20 06:16:17 -1095054749
c = 771539841

On the other hand, in Gentoo we are talking about rebuilding @world while breaking ABI in place. For a start, we are talking around prolonged periods of time between two packages being rebuilt when they would actually be mixing incompatible ABI. Then, there is a fair risk that some rebuild will fail and leave your system half-transitioned with no easy way out. Then, there is a real risk that cyclic dependencies will actually make rebuild impossible — rebuilding a dependency will break build-time tools, preventing stuff from being rebuilt. It’s a true horror.

What can we do to make it safer?

Our deliberations currently revolve about three ideas, that are semi-related, though not inevitably dependent one upon another:

Changing the platform tuple (CHOST) for the new ABIs, to clearly distinguish them from the baseline 32-bit ABI.
Changing the libdir for the new ABIs, effectively permitting the rebuilt libraries to be installed independently of the original versions.
Introducing an binary-level ABI distinction that could prevent binaries using different sub-ABI to be linked to one another.

The subsequent sections will focus on each of these changes in detail. Note that all the values used there are just examples, and not necessarily the strings used in a final solution.

The platform tuple change

The platform tuple (generally referenced through the CHOST variable) identifies the platform targeted by the toolchain. For example, it is used as a part of GCC/binutils install paths, effectively allowing toolchains for multiple targets to be installed simultaneously. In clang, it can be used to switch between supported cross-compilation targets, and can control the defaults to match the specified ABI. In Gentoo, it is also used to uniquely identify ABIs for the purpose of multilib support. Because of that, we require that no two co-installable ABIs share the same tuple.

i386-pc-linux-gnu
i686-pc-linux-gnu
i686-unknown-linux-gnu

armv7a-hardfloat-linux-gnueabi
armv7a-unknown-linux-gnueabihf

i686-gentoo_t64-linux-gnu
i686-pc-linux-gnut64
armv7a-gentoo_t64-linux-gnueabihf
armv7a-unknown-linux-gnueabihft64

The libdir change

The libdir values are generally specified in the ABI. Naturally, the baseline value is plain lib. As a historical convention (since 32-bit architectures were first), usually 32-bit platforms (arm, ppc, x86) use lib, whereas their more modern 64-bit counterparts (amd64, arm64, ppc64) use lib64 — even if a particular architecture never really supported multilib on Gentoo.

Architectures that support multiple ABIs also define different libdirs. For example, the additional x32 ABI on x86 uses libx32. MIPS n32 ABI uses lib32 (with plain lib defining the o32 ABI).

Now, we are considering changing the libdir value for time64 variants of 32-bit ABIs, for example from lib to libt64. This would make it possible to install the rebuilt libraries separately from the old libraries, effectively bringing three advantages:

reducing the risk of time64 executables accidentally linking to time32 libraries,
enabling Portage’s preserved-libs feature to preserve time32 libraries once the respective packages have been rebuilt for time64, and before their reverse dependencies have been rebuilt,
optionally, making it possible to use a time32 + time64 multilib profiles, that could be used to preserve compatibility with prebuilt time32 applications linking to system libraries.

The libdir change is definitely going to require some toolchain patching. We may want to also consider special-casing glibc, as the same set of glibc libraries is valid for all of the sub-ABIs we were considering. However, we will probably want a separate ld.so executable, as it would need to load libraries from the correct libdir, and then we will want to set .interp in time64 executables to reference the time64 ld.so.

Ensuring binary incompatibility

In general, you can’t mix binaries using different ABIs. For example, if you try to link a 64-bit program to a 32-bit library, the linker will object:

$ cc foo.c libfoo.so 
/usr/lib/gcc/x86_64-pc-linux-gnu/14/../../../../x86_64-pc-linux-gnu/bin/ld: libfoo.so: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status

Similarly, the dynamic loader will refuse to use a 32-bit library with 64-bit program:

$ ./foo 
./foo: error while loading shared libraries: libfoo.so: wrong ELF class: ELFCLASS32

There are a few mechanisms that are used for this. As demonstrated above, architectures with 32-bit and 64-bit ABIs use two distinct ELF classes (ELFCLASS32 and ELFCLASS64). Additionally, some architectures use different machine identifiers (EM_386 vs. EM_X86_64, EM_PPC vs. EM_PPC64). The x32 bit ABI on x86 “abuses” this by declaring its binaries as ELFCLASS32 + EM_X86_64 (and therefore distinct from ELFCLASS32 + EM_386 and from ELFCLASS64 + EM_X86_64).

However, whatever we decide to do, we need to take into consideration that the user may want to disable it. Particularly, there is a fair number of prebuilt software that have no sources available, and it may continue working correctly against system libs, provided it does not call into any API using time_t. The cure of unconditionally preventing them from working might be worse than the disease.

On the bright side, it should be possible to create a non-fatal QA check for this without much hacking, provided that we go with separate libdirs. We can distinguish time64 executables by their .interp section, pointing to the dynamic loader in the appropriate libdir, and then verify that time32 programs will not load any libraries from libt64, and that time64 programs will not load any libraries directly from lib.

What about old prebuilt applications?

For the compatibility problem, we have a reasonably good solution already. Since we already had to make them work on amd64, we have a multilib layout in place, along with necessary machinery to build multiple library versions. In fact, given that the primary purpose of multilib is compatibility with old software, it’s not even clear if there is much of a point in switching amd64 multilib to use time64 for 32-bit binaries. Either way, we can easily extend our multilib machinery to distinguish the regular abi_x86_32 target from abi_x86_t64 (and we probably should do that anyway), and then create new multilib x86 profiles that would support both ABIs.

Summary

As 2038 is approaching, 32-bit applications exercising 32-bit time_t are up to stop working. At this point, it is pretty clear that the only way forward is to rebuild these applications with 64-bit time_t (and while at it, force LFS as well). Unfortunately, that’s not a trivial task since it involves an ABI change, and mixing time32 and time64 programs and libraries can lead to horrible runtime bugs.

While the exact details are still in the making, the proposed changes revolve around three ideas that can be implemented independently to some degree: changing the platform tuple (CHOST), changing libdir and preventing accidentally mixing time32 and time64 binaries.

The tuple change is mostly a more formal way of distinguishing builds for the regular time32 ABI (e.g. i686-pc-linux-gnu) from ones specifically targeting time64 (e.g. i686-pc-linux-gnut64). It should be relatively harmless and easy to carry out, with minimal amount of fixing necessary. For example, clang will need to be updated to accept new tuples.

The libdir change is probably the most important of all, as it permits a breakage-free transition, thanks to Portage’s preserved-libs feature. Long story short, time64 libraries get installed to a new libdir (e.g. libt64), and the original time32 libraries remain in lib until the applications using them are rebuilt. Unfortunately, it’s a bit harder to implement — it requires toolchain changes, and ensuring that all software correctly respects libdir. The extra difficulty is that with this change alone, the dynamic loader won’t ignore time32 libraries if e.g. -Wl,-rpath,/usr/lib is injected somewhere.

Of course, all of this is still only a rough draft. A lot may still change, following experiments, discussion and patch submission.

Acknowledgements

2024-09-30 correction

Unfortunately, my original ideas were too optimistic. I’ve entirely missed the fact that all libdirs are listed in ld.so.conf, and therefore we cannot rely on hardcoding the libdir path inside ld.so itself. In retrospect, I should have seen that coming — after all, we already adjust these paths for custom LLVM prefix, and that one would require special handling too.

This effectively means that the libdir change probably needs to depend on the binary incompatibility part. Overall, we need to meet three basic goals:

The dynamic loader needs to be able to distinguish time32 and time64 binaries. For time32 programs, it needs to load only time32 libraries; for time64 programs, it needs to load only time64 libraries. In both cases, we need to assume that both kind of libraries will appear in path.
For backwards compatibility, we need to assume that all binaries that do not have an explicit time64 marking are time32.
Therefore, all newly built binaries must carry an explicit time64 marking. This includes binaries built by non-C environments, such as Rust, even if they do not interact with time_t ABI at all. Otherwise, these binaries would forever depend on time32 libraries.

Meeting all these goals is a lot of effort. None of the hacks we debated so far seem sufficient to achieve that, so we are probably talking about the level of effort on par with patching multiple toolchains for a variety of programming languages. Naturally, this is not something we can carry locally in Gentoo, so it also requires cooperation from multiple parties. All that for architectures that are largely considered legacy, and sometimes not even really supported anymore.

Of course, another problem is whether these other toolchains are actually going to produce correct time64 executables. After all, unless they are specifically adapted to respect _TIME_BITS the way C programs do, they are probably going to hardcode specific time_t width, and break horribly when it changes. However, that’s really an upstream problem to solve, and tangential to the issues we are discussing here.

As you can see, the situation is developing rapidly. Every day is bringing new challenges, and new ideas how to overcome them.

September 23 2024

Overview of cross-architecture portability problems

Michał Górny (mgorny) • September 23, 2024, 9:34

Ideally, you’d want your program to work everywhere. Unfortunately, that’s not that simple, even if you’re using high-level “portable” languages such as Python. In this blog post, I’d like to focus on some aspects of cross-architecture problems I’ve seen or heard about during my time in Gentoo. Please note that I don’t mean this to be a comprehensive list of problems — instead, I’m aiming for an interesting read.

What breaks programs on 32-bit systems? Basic integer type sizes

If you asked anyone what’s the primary difference between 64-bit and 32-bit architectures, they will probably answer that it’s register sizes. For many people, register sizes imply differences in basic integer types, and therefore the primary source of problems on 32-bit architectures, when programs are tested on 64-bit architectures only (which is commonly the case nowadays). Actually, it’s not that simple.

Contrary to common expectations, the differences in basic integer types are minimal. Most importantly, your plain int is 32-bit everywhere. The only type that’s actually different is long — it’s 32-bit on 32-bit architectures, and 64-bit on 64-bit architectures. However, people don’t use long all that often in modern programs, so that’s not very likely to cause issues.

Perhaps some people worry about integer sizes because they still foggily remember the issues from porting old 32-bit software to 64-bit architectures. As I’ve mentioned before, int remained 32-bit — but pointers became 64-bit. As a result, if you attempted to cast pointers (or related data) to int, you’d be in trouble (hence we have size_t, ssize_t, ptrdiff_t). Of course, the same thing (i.e. casting pointers to long) made for 64-bit architectures is ugly but won’t technically cause problems on 32-bit architectures.

Note that I’m talking about System V ABI here. Technically, the POSIX and the C standards don’t specify exact integer sizes, and permit a lot more flexibility (the C standard especially — up to having, say, all the types exactly 32-bit).

Address space size

Now, a more likely problem is the address space limitation. Since pointers are 32-bit on 32-bit architectures, a program can address no more than 4 GiB of memory (in reality, somewhat less than that). What’s really important here is that this limits allocated memory, even it is never actually used.

This can cause curious issues. For example, let’s say that you have a program that allocates a lot of memory, but doesn’t use most of it. If you run this program on a 64-bit system with 2 GiB of total memory, it works just fine. However, if you run it on 32-bit userland with a lot of memory, it fails. And why is that? It’s because the system permitted the program to allocate more memory than it could ever provide — risking an OOM if the program actually tried to use it all; but on the 32-bit architecture, it simply cannot fit all these allocations into 32-bit addresses.

The following sample can trivially demonstrate this:

$ cat > mem-demo.c <<EOF
#include <stdlib.h>
#include <stdio.h>

int main() {
    void *allocs[100];
    int i, j;
    FILE *urandom = fopen("/dev/urandom", "r");

    for (i = 0; i < 100; ++i) {
        allocs[i] = malloc(1024 * 1024 * 1024);
        if (!allocs[i]) {
            printf("malloc for i = %d failed\n", i);
            return 1;
        }
        fread(allocs[i], 1024, 1, urandom);
    }

    for (i = 0; i < 100; ++i)
        free(allocs[i]);
    fclose(urandom);

    return 0;
}
EOF
$ cc -m64 mem-demo.c -o mem-demo && ./mem-demo
$ cc -m32 mem-demo.c -o mem-demo && ./mem-demo 
malloc for i = 3 failed

The program allocates a grand total of 100 GiB of memory, but uses only the first KiB of each allocation. This works just fine on 64-bit architectures but fails on 32-bit because of failing allocation.

At this point, it’s probably worth noting that we are talking about limitations applicable to a single process. A 32-bit kernel can utilize more than 4 GiB of memory, and therefore multiple processes can use a total of more than 4 GiB. There are also cursed ways of making it possible for a single process to access more than 4 GiB of memory. For example, one could use memfd_create() (or equivalently, files on tmpfs) to create in-memory files that exceed process’ address space, or use IPC to exchange data between multiple processes having separate address spaces (thanks to Arsen Arsenović and David Seifert for their hints on this).

Large File Support

Another problem faced by 32-bit programs is that the file-related types are traditionally 32-bit. This has two implications. The more obvious one is that off_t, the type used to express file sized and offsets, is a signed 32-bit integer, so you cannot stat() and therefore open files larger than 2 GiB. The less obvious implication is that ino_t, the type used to express inode numbers, is also 32-bit, so you cannot open files with inode numbers 2^32 and higher. In other words, given large enough filesystem, you may suddenly be unable to open random files, even if they are smaller than 2 GiB.

Now, this is a problem that can be solved. Modern programs usually define _FILE_OFFSET_BITS=64 and get 64-bit types instead. In fact, musl libc unconditionally provides 64-bit types, rendering this problem a relic of the past — and apparently glibc is planning to switch the default in the future as well.

Here’s a trivial demo:

$ cat > lfs-demo.c <<EOF
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    int fd = open("lfs-test", O_RDONLY);

    if (fd == -1) {
        perror("open() failed");
        return 1;
    }

    close(fd);
    return 0;
}
EOF
$ truncate -s 2G lfs-test
$ cc -m64 lfs-demo.c -o lfs-demo && ./lfs-demo
$ cc -m32 lfs-demo.c -o lfs-demo && ./lfs-demo 
open() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 lfs-demo.c \
    -o lfs-demo && ./lfs-demo

Unfortunately, while fixing a single package is trivial, a global switch is not. The sizes of off_t and ino_t change, and so effectively does the ABI of any libraries that use these types in the API — i.e. if you rebuild the library without rebuilding the programs using it, they could break in unexpected ways. What you can do is either switch everything simultaneously, or go slowly and add change the types via a new API, preserving the old one for compatibility. The latter is unlikely to happen, given there’s very little interest in 32-bit architecture support these days. The former also isn’t free of issues — technically speaking, you may end up introducing incompatibility with prebuilt software that used the 32-bit types, and effectively lose the ability to run some proprietary software entirely.

time_t and the y2k38 problem

The low-level way of representing timestamps in C is through the number of seconds since the so-called epoch. This number is represented in a time_t type, which, as you can probably guess, was a signed 32 bit integer on 32-bit architectures. This means that it can hold positive values up to 231 – 1 seconds, which roughly corresponds to 68 years. Since the epoch on POSIX systems was defined as 1970, this means that the type can express timestamps up to 2038.

What does this mean in practice? Programs using 32-bit time_t can’t express dates beyond the cutoff 2038 date. If you try to do arithmetic spanning beyond this date (e.g. “20 years from now”), you get an overflow. stat() is going to fail on files with timestamps beyond that point (though, interestingly, open() works on glibc, so it’s not entirely symmetric with the LFS case). Past the overflow date, you get an error even trying to get the current time — and if your program doesn’t account for the possibility of time() failing, it’s going to be forever stuck 1 second before the epoch, or 1969-12-31 23:59:59. Effectively, it may end up hanging randomly (waiting for some wall clock time to pass), not firing events or seeding a PRNG with a constant.

Again, modern glibc versions provide a switch. If you define _TIME_BITS=64 (plus LFS flags, as a prerequisite), your program is going to get a 64-bit time_t. Modern versions of musl libc also default to the 64-bit type (since 1.2.0). Unfortunately, switching to the 64-bit type brings the same risks as switching to LFS globally — or perhaps even worse because time_t seems to be more common in library API than file size-related types were.

These solutions only work for software that is built from source, and uses time_t correctly. Converting timestamps to int will cause overflow bugs. File formats with 32-bit timestamp fields are essentially broken. Most importantly, all proprietary software will remain broken and in need of serious workarounds.

Here are some samples demonstrating the problems. Please note that the first sample assumes the system clock is set beyond 2038.

$ cat > time-test.c <<EOF
#include <stdio.h>
#include <time.h>

int main() {
    time_t t = time(NULL);

    if (t != -1) {
        struct tm *dt = gmtime(&t);
        char out[32];

        strftime(out, sizeof(out), "%F %T", dt);
        printf("%s\n", out);
    } else
        perror("time() failed");

    return 0;
}
EOF
$ cc -m64 time-test.c -o time-test && ./time-test
2060-03-04 11:13:02
$ cc -m32 time-test.c -o time-test && ./time-test
time() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
    time-test.c -o time-test && ./time-test
2060-03-04 11:13:32
$ cat > mtime-test.c <<EOF
#include <fcntl.h>
#include <sys/stat.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>

int main() {
    struct stat st;
    int fd;

    if (stat("time-data", &st) == 0) {
        char buf[32];
        struct tm *tm = gmtime(&st.st_mtime);
        strftime(buf, sizeof(buf), "%F %T", tm);
        printf("mtime: %s\n", buf);
    } else
        perror("stat() failed");

    fd = open("time-data", O_RDONLY);
    if (fd == -1) {
        perror("open() failed");
        return 1;
    }
    close(fd);

    return 0;
}
$ touch -t '206001021112' mtime-data
$ cc -m64 mtime-test.c -o mtime-test && ./mtime-test
mtime: 2060-01-02 10:12:00
$ cc -m32 mtime-test.c -o mtime-test && ./mtime-test
stat() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
    mtime-test.c -o mtime-test && ./mtime-test
mtime: 2060-01-02 10:12:00

Are these problems specific to C?

It is probably worth noting that while portability issues are generally discussed in terms of C, not all of them are specific to C, or to programs directly interacting with C API.

For example, address space limitations affect all programming languages, unless they take special effort to work around them (I’m not aware of any that do). So a Python program will be limited by the 4 GiB of address space the same way C programs are — except that Python programs don’t allocate memory explicitly, so the limit will be rather on memory used than allocated. On the minus side, Python programs will probably be less memory efficient than C programs.

File and time type sizes also sometimes affect programming languages internally. Modern versions of Python are built with Large File Support enabled, so they aren’t limited to 32-bit file sizes and inode numbers. However, they are limited to 32-bit timestamps:

>>> import datetime
>>> datetime.datetime(2060, 1, 1)
datetime.datetime(2060, 1, 1, 0, 0)
>>> datetime.datetime(2060, 1, 1).timestamp()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: timestamp out of range for platform time_t

Other generic issues Byte order (endianness)

The predominant byte order nowadays is little endian. X86 was always little endian. ARM is bi-endian, but defaults to running little endian (and there were never much incentive to run big endian ARM). PowerPC used to default to big endian, but these days PPC64 systems are mostly running little endian instead.

It’s not that either byte order is superior in some way. It’s just that x86 happened to arbitrarily use that byte order. Given its popularity, a lot of non-portable software has been written that worked correctly on little endian only. Over time, people lost the incentive to run big endian systems and this eventually led to even worse big endian support overall.

The most common issues related to byte order occur when implementing binary data formats, particularly file formats and network protocols. A missing byte order conversion can lead to the program throwing an error or incorrectly reading files written on other platforms, writing incorrect files or failing to communicate with peers on other platforms correctly. In extreme cases, a program that missed some byte order conversions may be unable to read a file it has written before.

Again, byte order problems are not limited to C. For example, the struct module in Python uses explicit byte order, size and alignment modifiers.

Curious enough, byte order issues are not limited to low-level data formats either. To give another example, the UTF-16 and UTF-32 encodings also have little endian and big endian variations. When the user does not request a specific byte order, Python uses host’s byte order and adds a BOM to the string, that is used to detect the correct byte order when decoding.

>>> "foo".encode("UTF-16LE")
b'f\x00o\x00o\x00'
>>> "foo".encode("UTF-16BE")
b'\x00f\x00o\x00o'
>>> "foo".encode("UTF-16")
b'\xff\xfef\x00o\x00o\x00'

char signedness

This is probably one of the most confusing portability problems you may see. Roughly, the problem is that the C standard does not specify the signedness of char type (unlike int). Some platforms define it as signed, others as unsigned. In fact, the standard goes a step further and defines char as a distinct type from both signed char and unsigned char, rather than an alias to either of them.

For example, the System V ABI for x86 and SPARC specifies that char is signed, whereas for MIPS and PowerPC it is unsigned. Assuming either and doing arithmetic on top of that could lead to surprising results on the other set of platforms. In fact, one of the most confusing cases I’ve seen was with code that was used only for big endian platforms, and therefore worked on PowerPC but not on SPARC (even though it would also fail on x86, if it was used there).

Here is an example inspired by it. The underlying idea is to read a little endian 32-bit unsigned integer from a char array:

$ cat > char-sign.c <<EOF
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

int main() {
        char buf[] = {0x00, 0x40, 0x80, 0xa0};
        char *p = buf;
        uint32_t val = 0;

        val |= (*p++);
        val |= (*p++) << 8;
        val |= (*p++) << 16;
        val |= (*p++) << 24;

        printf("%08" PRIx32 "\n", val);
}
EOF
$ cc -funsigned-char char-sign.c -o char-sign
$ ./char-sign
a0804000
$ cc -fsigned-char char-sign.c -o char-sign
$ ./char-sign
ff804000

Please note that for the sake of demonstration, the example uses -fsigned-char and -funsigned-char switches to override the default platform signedness. In real code, you’d explicitly use unsigned char instead.

Strict alignment

I feel that alignment is not a well-known problem, so perhaps I should start by explaining it a bit. Long story short, alignment is about ensuring that particular types are placed across appropriate memory boundaries. For example, on most platforms 32-bit types are expected to be aligned at 32-bit (= 4 byte) boundaries. In other words, you expect that the type’s memory address would be a multiple of 4 bytes — irrespective of whether it’s on stack or heap, used directly, in an array, a structure or perhaps an array of structures.

Perhaps the simplest way to explain that is to show how the compiler achieves alignment in structures. Please consider the following type:

struct {
    int16_t a;
    int32_t b;
    int16_t c;
}

As you can see, it contains two 2-byte types and one 4-byte type — that would be a total of 8 bytes, right? Nothing more wrong, at least on platforms requiring 32-bit alignment for int32_t. To guarantee that b would be correctly aligned whenever the whole structure is correctly aligned, the compiler needs to move it to an offset being a multiple of 4. Furthermore, to guarantee that if the structure is used in array, every instance is correctly aligned, it also needs to increase its size to a multiple of 4.

Effectively, the resulting structure resembles the following:

struct {
    int16_t a;
    int16_t _pad1;
    int32_t b;
    int16_t c;
    int16_t _pad2;
}

In fact, you can find some libraries actually defining structures with explicit padding. So you get a padding of 2 + 2 bytes, b at offset 4, and a total size of 12 bytes.

Now, what would happen if the alignment requirements weren’t met? On the majority of platforms, misaligned types are still going to work, usually at a performance penalty. However, on some platforms like SPARC, they will actually cause the program to terminate with a SIGBUS. Consider the following example:

$ cat > align-test.c <<EOF
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

int main() {
	uint8_t buf[6] = {0, 0, 0, 4, 0, 0};
	int32_t *number = (int32_t *) &buf[2];
	printf("%" PRIi32 "\n", *number);
	return 0;
}
EOF
$ cc align-test.c -o align-test
$ ./align-test
1024

The code is meant to resemble a cheap way of reading data from a file, and then getting a 32-bit integer at offset 2. However, on SPARC this code will not work as expected:

$ ./align-test
Bus error (core dumped)

As you can probably guess, there is a fair number of programs suffering from issues like that simply because they don’t crash on x86, and it’s easy to silence the normal compiler warnings (e.g. by type punning, as used it in the example). However, as noted before, this code will not only cause a crash on SPARC — it may also cause a performance penalty everywhere else.

Stack size

As low-level C programmers tend to learn, there are two main kinds of memory available to the program: the heap and the stack. The heap is the main memory area from which explicit allocations are done. The stack is a relatively small area of memory that is given to the program for its immediate use.

The main difference is that the use of heap is controlled — a well-written written program allocates as much memory as it needs, and doesn’t access areas outside of that. On the other hand, stack use is “uncontrolled” — programs generally don’t check stack bounds. As you may guess, this means that if a program uses it too much, it’s going to exceed the available stack — i.e. hit a stack overflow, which generally manifests itself as a “weird” segmentation fault.

And how do you actually use a lot of stack memory? In C, local function variables are kept on stack — so the more variables you use, the more stack you fill. Furthermore, some ABIs use stack to pass function parameters and return values — e.g. x86 (but not the newer amd64 or x32 ABIs). But most importantly, stack frames are used to record the function call history — and this means the deeper you call, the larger the stack use.

This is precisely why programmers are cautioned against recursive algorithms — especially if built without protection against deep recursion, they provide a trivial way to cause a stack overflow. And this last problem is not limited to C — recursive function calls in Python also result in recursive function calls in C. Python comes with a default recursion limit to prevent this from happening. However, as we recently found out the hard way, this limit needs to be adjusted across different architectures and compiler configurations, as their stack frame sizes may differ drastically: from a baseline of 8–16 bytes on common architectures such as x86 or ARM, through 112–128 bytes on PPC64, up to 160–176 bytes on s390x and SPARC64.

On top of that, the default thread stack size varies across the standard C libraries. On glibc, it is usually between 2 MiB and 10 MiB, whereas on musl it is 128 KiB. Therefore, in some cases you may actually need to explicitly request a larger stack.

The wondrous world of floating-point types x87 math

The x86 platform supports two modes of floating-point arithmetic:

The legacy 387 floating-point arithmetic that utilizes 80-bit precision registers (-mfpmath=387).
The more modern SSE arithmetic that supports all of 32-bit, 64-bit and 80-bit precision types (-mfpmath=sse).

The former is the default on 32-bit x86 platforms using the System V ABI, the latter everywhere else. And why does that matter? Because the former may imply performing some computations using the extended 80-bit precision before converting the result back to the original type, effectively implying a smaller rounding error than performing the same computations on the original type directly.

Consider the following example:

$ cat > float-demo.c <<EOF
#include <stdio.h>

__attribute__((noipa))
double fms(double a, double b, double c) {
	return a * b - c;
}

int main() {
	printf("%+.40f\n", fms(1./3, 1./3, 1./9));
	return 0;
}
EOF
$ cc -mfpmath=sse float-demo.c -o float-demo
$ ./float-demo
+0.0000000000000000000000000000000000000000
$ cc -mfpmath=387 float-demo.c -o float-demo
$ ./float-demo
-0.0000000000000000061663998560113064684174

What’s happening here? The program is computing 1/3 * 1/3 - 1/9, which we know should be zero. Except that it isn’t when using x87 FPU instructions. Why?

Normally, this computation is done in two steps. First, the multiplication 1/3 * 1/3 is done. Afterwards, 1/9 is subtracted from the result. In SSE mode, both steps are done directly on the double type. However, in x87 mode the doubles are converted to 80-bit floats first, both computations are done on these and then the result is converted back to double. We can see that looking at the respective assembly fragments:

$ cc -mfpmath=sse float-demo.c -S -o -
[…]
	movsd	-8(%rbp), %xmm0
	mulsd	-16(%rbp), %xmm0
	subsd	-24(%rbp), %xmm0
[…]
$ cc -mfpmath=387 float-demo.c -S -o -
[…]
	fldl	-8(%rbp)
	fmull	-16(%rbp)
	fsubl	-24(%rbp)
	fstpl	-32(%rbp)
[…]

Now, neither ⅓ nor ⅑ can be precisely expressed in binary system. So 1./3 is actually ⅓ + some error, and 1./9 is ⅑ + another error. It happens that 1./3 * 1./3 after rounding is giving the same value as 1./9 — so subtracting one from the other yields zero. However, when computations are done using an intermediate type of higher precision, the squared error from 1./3 * 1./3 is rounded at a higher precision — and therefore different from the one in 1./9. So counter-intuitively, higher precision here amplifies a rounding error and yields the “incorrect” result!

Of course, this is not that big of a deal — we are talking about 17 decimal places, and user-facing programs will probably round that down to 0. However, this can lead to problems in programs written to expect an exact value — e.g. in test suites.

Gentoo has already switched amd64 multilib profiles to force -mfpmath=sse for 32-bit builds, and it is planning to switch the x86 profiles as well. While this doesn’t solve the underlying issue, it yields more consistent results across different architectures and therefore reduces the risk of our users hitting these bugs. However, this has a surprising downside: some packages actually adapted to expect different results on 32-bit x86, and now fail when SSE arithmetic is used there.

It doesn’t take two architectures to make a rounding problem

Actually, you don’t have to run a program on two different architectures to see rounding problems — different optimization levels, particularly CPU instruction sets can also result in different rounding errors. Let’s try compiling the previous example with and without FMA instructions:

$ cc -mno-fma -O2 float-demo.c -o float-demo
$ ./float-demo
+0.0000000000000000000000000000000000000000
$ cc -mfma -O2 float-demo.c -o float-demo
$ ./float-demo
-0.0000000000000000061679056923619804377437

The first invocation is roughly the same as before. The second one enables use of the FMA instruction set that performs the multiplication and subtraction in one step:

$ cc -mfma -O2 float-demo.c -S -o -
[…]
	vfmsub132sd	%xmm1, %xmm2, %xmm0
[…]

Again, this means that the rounding of the intermediate value is not rounded down to double — and therefore doesn’t carry the same error as 1./9.

Bottom line is this: never match floating-point computation results exactly, allow for some error. Even if something works for you, it may fail not only for a different architecture, but even for different optimization flags. And counter-intuitively, more precise results may amplify errors and yields intuitively “wrong” values.

The long double type

As you can probably guess by now, the C standard doesn’t define precisely what float, double and long double types are. Fortunately, it seems that the first two types are uniformly implemented as, respectively, a single-precision (32-bit) and a double-precision (64-bit) IEEE 754 floating point number. However, as far as the third type is concerned, we might find it to be any of:

the same type as double — on architectures such as 32-bit ARM,
the 80-bit x87 extended precision type — on amd64 and x86,
a type implementing double-double arithmetic — i.e. representing the number as a sum of two double values, giving roughly 106-bit precision, e.g. on PowerPC,
the quadruple precision (128-bit) IEEE 754 type — e.g. on SPARC.

Once again, this is primarily a matter of precision, and therefore it only breaks test suites that assume specific precision for the type. To demonstrate the differences in precision, we can use the following sample program:

#include <stdio.h>

int main() {
	printf("%0.40Lf\n", 1.L/3);
	return 0;
}

Running it across different architectures, we’re going to see:

arm64: 0.3333333333333333333333333333333333172839
ppc64: 0.3333333333333333333333333333333292246828
amd64: 0.3333333333333333333423683514373792036167
arm32: 0.3333333333333333148296162562473909929395

Summary

Portability is no trivial matter, that’s clear. What’s perhaps more surprising is that portability problems aren’t limited to C and similar low-level languages — I have shown multiple examples of how they leak into Python.

Perhaps the most common portability issues these days come from 32-bit architectures. Many projects today are tested only on 64-bit systems, and therefore face regressions on 32-bit platforms. Perhaps surprisingly, most of the issues stem not from incorrect type use in C, but rather from platform limitations — available address space, lack of support for large files or large time_t. All of these limitations apply to non-C programs that are built on C runtime as well, and sometimes require non-trivial fixes. Notably, switching to a 64-bit time_t is going to be a major breaking change (and one that I’ll cover in a separate post).

Other issues may be more obscure, and specific to individual architectures. On PPC64 or SPARC, we hit issues related to big endian byte order. On MIPS and PowerPC, we may be surprised by char being unsigned. On SPARC, we’re going to hit crashes if we don’t align types properly. Again, on PPC64 and SPARC we are also more likely to hit stack overflows. And on i386, we may discover problems due to different precision in floating-point computations.

These are just some examples, and they definitely do not deplete the possible issues. Furthermore, sometimes you may discover a combination of two different problems, furthering your confusion — just like the package that was broken only on big endian systems with signed char.

On the other hand, all these differences provide an interesting opportunity: by testing the package on a bunch of architectures and knowing their characteristics, you can guess what could be wrong with it. Say, if it fails on PPC64 but passes on PPC64LE, you may guess it’s a byte order issue — and then it turns out it was actually a stack overflow, because big endian PPC64 happens to default to ELFv1 ABI that uses slightly larger stack frames. But hey, usually it does help.

Portability is important. The problematic architectures may constitute a tiny portion of your user base — in fact, sometimes I do wonder if some of the programs we’re fixing are actually going to be used by any real user of these architectures, or if we’re merely cargo culting keywords added a long time ago. You may even argue that it’s better for the environment if people discarded these machines rather than kept having them burn energy. However, portability makes for good code. What may seem like bothering for a tiny minority today, may turn out to prevent a major security incident for all your users tomorrow.

Overview of cross-architecture portability problems

mgorny (mgorny ) • September 23, 2024, 9:34

What breaks programs on 32-bit systems?

Basic integer type sizes

Contrary to common expectations, the differences in basic integer types are minimal. Most importantly, your plain int is 32-bit everywhere. The only type that’s actually different is long — it’s 32-bit on 32-bit architectures, and 64-bit on 64-bit architectures. However, people don’t use long all that often in modern programs, so that’s not very likely to cause issues.

Perhaps some people worry about integer sizes because they still foggily remember the issues from porting old 32-bit software to 64-bit architectures. As I’ve mentioned before, int remained 32-bit — but pointers became 64-bit. As a result, if you attempted to cast pointers (or related data) to int, you’d be in trouble (hence we have size_t, ssize_t, ptrdiff_t). Of course, the same thing (i.e. casting pointers to long) made for 64-bit architectures is ugly but won’t technically cause problems on 32-bit architectures.

Address space size

The following sample can trivially demonstrate this:

$ cat > mem-demo.c <<EOF
#include <stdlib.h>
#include <stdio.h>

int main() {
    void *allocs[100];
    int i, j;
    FILE *urandom = fopen("/dev/urandom", "r");

    for (i = 0; i < 100; ++i) {
        allocs[i] = malloc(1024 * 1024 * 1024);
        if (!allocs[i]) {
            printf("malloc for i = %d failed\n", i);
            return 1;
        }
        fread(allocs[i], 1024, 1, urandom);
    }

    for (i = 0; i < 100; ++i)
        free(allocs[i]);
    fclose(urandom);

    return 0;
}
EOF
$ cc -m64 mem-demo.c -o mem-demo && ./mem-demo
$ cc -m32 mem-demo.c -o mem-demo && ./mem-demo 
malloc for i = 3 failed

At this point, it’s probably worth noting that we are talking about limitations applicable to a single process. A 32-bit kernel can utilize more than 4 GiB of memory, and therefore multiple processes can use a total of more than 4 GiB. There are also cursed ways of making it possible for a single process to access more than 4 GiB of memory. For example, one could use memfd_create() (or equivalently, files on tmpfs) to create in-memory files that exceed process’ address space, or use IPC to exchange data between multiple processes having separate address spaces (thanks to Arsen Arsenović and David Seifert for their hints on this).

Large File Support

Another problem faced by 32-bit programs is that the file-related types are traditionally 32-bit. This has two implications. The more obvious one is that off_t, the type used to express file sized and offsets, is a signed 32-bit integer, so you cannot stat() and therefore open files larger than 2 GiB. The less obvious implication is that ino_t, the type used to express inode numbers, is also 32-bit, so you cannot open files with inode numbers 2^32 and higher. In other words, given large enough filesystem, you may suddenly be unable to open random files, even if they are smaller than 2 GiB.

Now, this is a problem that can be solved. Modern programs usually define _FILE_OFFSET_BITS=64 and get 64-bit types instead. In fact, musl libc unconditionally provides 64-bit types, rendering this problem a relic of the past — and apparently glibc is planning to switch the default in the future as well.

Here’s a trivial demo:

$ cat > lfs-demo.c <<EOF
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    int fd = open("lfs-test", O_RDONLY);

    if (fd == -1) {
        perror("open() failed");
        return 1;
    }

    close(fd);
    return 0;
}
EOF
$ truncate -s 2G lfs-test
$ cc -m64 lfs-demo.c -o lfs-demo && ./lfs-demo
$ cc -m32 lfs-demo.c -o lfs-demo && ./lfs-demo 
open() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 lfs-demo.c \
    -o lfs-demo && ./lfs-demo

Unfortunately, while fixing a single package is trivial, a global switch is not. The sizes of off_t and ino_t change, and so effectively does the ABI of any libraries that use these types in the API — i.e. if you rebuild the library without rebuilding the programs using it, they could break in unexpected ways. What you can do is either switch everything simultaneously, or go slowly and add change the types via a new API, preserving the old one for compatibility. The latter is unlikely to happen, given there’s very little interest in 32-bit architecture support these days. The former also isn’t free of issues — technically speaking, you may end up introducing incompatibility with prebuilt software that used the 32-bit types, and effectively lose the ability to run some proprietary software entirely.

time_t and the y2k38 problem

The low-level way of representing timestamps in C is through the number of seconds since the so-called epoch. This number is represented in a time_t type, which, as you can probably guess, was a signed 32 bit integer on 32-bit architectures. This means that it can hold positive values up to 2³¹ – 1 seconds, which roughly corresponds to 68 years. Since the epoch on POSIX systems was defined as 1970, this means that the type can express timestamps up to 2038.

What does this mean in practice? Programs using 32-bit time_t can’t express dates beyond the cutoff 2038 date. If you try to do arithmetic spanning beyond this date (e.g. “20 years from now”), you get an overflow. stat() is going to fail on files with timestamps beyond that point (though, interestingly, open() works on glibc, so it’s not entirely symmetric with the LFS case). Past the overflow date, you get an error even trying to get the current time — and if your program doesn’t account for the possibility of time() failing, it’s going to be forever stuck 1 second before the epoch, or 1969-12-31 23:59:59. Effectively, it may end up hanging randomly (waiting for some wall clock time to pass), not firing events or seeding a PRNG with a constant.

Again, modern glibc versions provide a switch. If you define _TIME_BITS=64 (plus LFS flags, as a prerequisite), your program is going to get a 64-bit time_t. Modern versions of musl libc also default to the 64-bit type (since 1.2.0). Unfortunately, switching to the 64-bit type brings the same risks as switching to LFS globally — or perhaps even worse because time_t seems to be more common in library API than file size-related types were.

These solutions only work for software that is built from source, and uses time_t correctly. Converting timestamps to int will cause overflow bugs. File formats with 32-bit timestamp fields are essentially broken. Most importantly, all proprietary software will remain broken and in need of serious workarounds.

Here are some samples demonstrating the problems. Please note that the first sample assumes the system clock is set beyond 2038.

$ cat > time-test.c <<EOF
#include <stdio.h>
#include <time.h>

int main() {
    time_t t = time(NULL);

    if (t != -1) {
        struct tm *dt = gmtime(&t);
        char out[32];

        strftime(out, sizeof(out), "%F %T", dt);
        printf("%s\n", out);
    } else
        perror("time() failed");

    return 0;
}
EOF
$ cc -m64 time-test.c -o time-test && ./time-test
2060-03-04 11:13:02
$ cc -m32 time-test.c -o time-test && ./time-test
time() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
    time-test.c -o time-test && ./time-test
2060-03-04 11:13:32
$ cat > mtime-test.c <<EOF
#include <fcntl.h>
#include <sys/stat.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>

int main() {
    struct stat st;
    int fd;

    if (stat("time-data", &st) == 0) {
        char buf[32];
        struct tm *tm = gmtime(&st.st_mtime);
        strftime(buf, sizeof(buf), "%F %T", tm);
        printf("mtime: %s\n", buf);
    } else
        perror("stat() failed");

    fd = open("time-data", O_RDONLY);
    if (fd == -1) {
        perror("open() failed");
        return 1;
    }
    close(fd);

    return 0;
}
$ touch -t '206001021112' mtime-data
$ cc -m64 mtime-test.c -o mtime-test && ./mtime-test
mtime: 2060-01-02 10:12:00
$ cc -m32 mtime-test.c -o mtime-test && ./mtime-test
stat() failed: Value too large for defined data type
$ cc -m32 -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 \
    mtime-test.c -o mtime-test && ./mtime-test
mtime: 2060-01-02 10:12:00

Are these problems specific to C?

It is probably worth noting that while portability issues are generally discussed in terms of C, not all of them are specific to C, or to programs directly interacting with C API.

>>> import datetime
>>> datetime.datetime(2060, 1, 1)
datetime.datetime(2060, 1, 1, 0, 0)
>>> datetime.datetime(2060, 1, 1).timestamp()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: timestamp out of range for platform time_t

Other generic issues

Byte order (endianness)

Again, byte order problems are not limited to C. For example, the struct module in Python uses explicit byte order, size and alignment modifiers.

>>> "foo".encode("UTF-16LE")
b'f\x00o\x00o\x00'
>>> "foo".encode("UTF-16BE")
b'\x00f\x00o\x00o'
>>> "foo".encode("UTF-16")
b'\xff\xfef\x00o\x00o\x00'

char signedness

This is probably one of the most confusing portability problems you may see. Roughly, the problem is that the C standard does not specify the signedness of char type (unlike int). Some platforms define it as signed, others as unsigned. In fact, the standard goes a step further and defines char as a distinct type from both signed char and unsigned char, rather than an alias to either of them.

For example, the System V ABI for x86 and SPARC specifies that char is signed, whereas for MIPS and PowerPC it is unsigned. Assuming either and doing arithmetic on top of that could lead to surprising results on the other set of platforms. In fact, one of the most confusing cases I’ve seen was with code that was used only for big endian platforms, and therefore worked on PowerPC but not on SPARC (even though it would also fail on x86, if it was used there).

Here is an example inspired by it. The underlying idea is to read a little endian 32-bit unsigned integer from a char array:

$ cat > char-sign.c <<EOF
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

int main() {
        char buf[] = {0x00, 0x40, 0x80, 0xa0};
        char *p = buf;
        uint32_t val = 0;

        val |= (*p++);
        val |= (*p++) << 8;
        val |= (*p++) << 16;
        val |= (*p++) << 24;

        printf("%08" PRIx32 "\n", val);
}
EOF
$ cc -funsigned-char char-sign.c -o char-sign
$ ./char-sign
a0804000
$ cc -fsigned-char char-sign.c -o char-sign
$ ./char-sign
ff804000

Please note that for the sake of demonstration, the example uses -fsigned-char and -funsigned-char switches to override the default platform signedness. In real code, you’d explicitly use unsigned char instead.

Strict alignment

Perhaps the simplest way to explain that is to show how the compiler achieves alignment in structures. Please consider the following type:

struct {
    int16_t a;
    int32_t b;
    int16_t c;
}

As you can see, it contains two 2-byte types and one 4-byte type — that would be a total of 8 bytes, right? Nothing more wrong, at least on platforms requiring 32-bit alignment for int32_t. To guarantee that b would be correctly aligned whenever the whole structure is correctly aligned, the compiler needs to move it to an offset being a multiple of 4. Furthermore, to guarantee that if the structure is used in array, every instance is correctly aligned, it also needs to increase its size to a multiple of 4.

Effectively, the resulting structure resembles the following:

struct {
    int16_t a;
    int16_t _pad1;
    int32_t b;
    int16_t c;
    int16_t _pad2;
}

In fact, you can find some libraries actually defining structures with explicit padding. So you get a padding of 2 + 2 bytes, b at offset 4, and a total size of 12 bytes.

$ cat > align-test.c <<EOF
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

int main() {
	uint8_t buf[6] = {0, 0, 0, 4, 0, 0};
	int32_t *number = (int32_t *) &buf[2];
	printf("%" PRIi32 "\n", *number);
	return 0;
}
EOF
$ cc align-test.c -o align-test
$ ./align-test
1024

The code is meant to resemble a cheap way of reading data from a file, and then getting a 32-bit integer at offset 2. However, on SPARC this code will not work as expected:

$ ./align-test
Bus error (core dumped)

Stack size

The wondrous world of floating-point types

x87 math

The x86 platform supports two modes of floating-point arithmetic:

The legacy 387 floating-point arithmetic that utilizes 80-bit precision registers (-mfpmath=387).
The more modern SSE arithmetic that supports all of 32-bit, 64-bit and 80-bit precision types (-mfpmath=sse).

Consider the following example:

$ cat > float-demo.c <<EOF
#include <stdio.h>

__attribute__((noipa))
double fms(double a, double b, double c) {
	return a * b - c;
}

int main() {
	printf("%+.40f\n", fms(1./3, 1./3, 1./9));
	return 0;
}
EOF
$ cc -mfpmath=sse float-demo.c -o float-demo
$ ./float-demo
+0.0000000000000000000000000000000000000000
$ cc -mfpmath=387 float-demo.c -o float-demo
$ ./float-demo
-0.0000000000000000061663998560113064684174

What’s happening here? The program is computing 1/3 * 1/3 - 1/9, which we know should be zero. Except that it isn’t when using x87 FPU instructions. Why?

Normally, this computation is done in two steps. First, the multiplication 1/3 * 1/3 is done. Afterwards, 1/9 is subtracted from the result. In SSE mode, both steps are done directly on the double type. However, in x87 mode the doubles are converted to 80-bit floats first, both computations are done on these and then the result is converted back to double. We can see that looking at the respective assembly fragments:

$ cc -mfpmath=sse float-demo.c -S -o -
[…]
	movsd	-8(%rbp), %xmm0
	mulsd	-16(%rbp), %xmm0
	subsd	-24(%rbp), %xmm0
[…]
$ cc -mfpmath=387 float-demo.c -S -o -
[…]
	fldl	-8(%rbp)
	fmull	-16(%rbp)
	fsubl	-24(%rbp)
	fstpl	-32(%rbp)
[…]

Now, neither ⅓ nor ⅑ can be precisely expressed in binary system. So 1./3 is actually ⅓ + some error, and 1./9 is ⅑ + another error. It happens that 1./3 * 1./3 after rounding is giving the same value as 1./9 — so subtracting one from the other yields zero. However, when computations are done using an intermediate type of higher precision, the squared error from 1./3 * 1./3 is rounded at a higher precision — and therefore different from the one in 1./9. So counter-intuitively, higher precision here amplifies a rounding error and yields the “incorrect” result!

Gentoo has already switched amd64 multilib profiles to force -mfpmath=sse for 32-bit builds, and it is planning to switch the x86 profiles as well. While this doesn’t solve the underlying issue, it yields more consistent results across different architectures and therefore reduces the risk of our users hitting these bugs. However, this has a surprising downside: some packages actually adapted to expect different results on 32-bit x86, and now fail when SSE arithmetic is used there.

It doesn’t take two architectures to make a rounding problem

$ cc -mno-fma -O2 float-demo.c -o float-demo
$ ./float-demo
+0.0000000000000000000000000000000000000000
$ cc -mfma -O2 float-demo.c -o float-demo
$ ./float-demo
-0.0000000000000000061679056923619804377437

The first invocation is roughly the same as before. The second one enables use of the FMA instruction set that performs the multiplication and subtraction in one step:

$ cc -mfma -O2 float-demo.c -S -o -
[…]
	vfmsub132sd	%xmm1, %xmm2, %xmm0
[…]

Again, this means that the rounding of the intermediate value is not rounded down to double — and therefore doesn’t carry the same error as 1./9.

The long double type

As you can probably guess by now, the C standard doesn’t define precisely what float, double and long double types are. Fortunately, it seems that the first two types are uniformly implemented as, respectively, a single-precision (32-bit) and a double-precision (64-bit) IEEE 754 floating point number. However, as far as the third type is concerned, we might find it to be any of:

the same type as double — on architectures such as 32-bit ARM,
the 80-bit x87 extended precision type — on amd64 and x86,
a type implementing double-double arithmetic — i.e. representing the number as a sum of two double values, giving roughly 106-bit precision, e.g. on PowerPC,
the quadruple precision (128-bit) IEEE 754 type — e.g. on SPARC.

#include <stdio.h>

int main() {
	printf("%0.40Lf\n", 1.L/3);
	return 0;
}

Running it across different architectures, we’re going to see:

arm64: 0.3333333333333333333333333333333333172839
ppc64: 0.3333333333333333333333333333333292246828
amd64: 0.3333333333333333333423683514373792036167
arm32: 0.3333333333333333148296162562473909929395

Summary

Perhaps the most common portability issues these days come from 32-bit architectures. Many projects today are tested only on 64-bit systems, and therefore face regressions on 32-bit platforms. Perhaps surprisingly, most of the issues stem not from incorrect type use in C, but rather from platform limitations — available address space, lack of support for large files or large time_t. All of these limitations apply to non-C programs that are built on C runtime as well, and sometimes require non-trivial fixes. Notably, switching to a 64-bit time_t is going to be a major breaking change (and one that I’ll cover in a separate post).

Other issues may be more obscure, and specific to individual architectures. On PPC64 or SPARC, we hit issues related to big endian byte order. On MIPS and PowerPC, we may be surprised by char being unsigned. On SPARC, we’re going to hit crashes if we don’t align types properly. Again, on PPC64 and SPARC we are also more likely to hit stack overflows. And on i386, we may discover problems due to different precision in floating-point computations.

September 11 2024

Much improved MIPS and Alpha support in Gentoo Linux

GentooNews (https://www.gentoo.org/feeds/news.xml ) • September 11, 2024, 5:00

Over the last years, MIPS and Alpha support in Gentoo has been slowing down, mostly due to a lack of volunteers keeping these architectures alive. Not anymore however! We’re happy to announce that thanks to renewed volunteer interest both arches have returned to the forefront of Gentoo Linux development, with a consistent dependency tree checked and enforced by our continuous integration system. Up-to-date stage builds and the accompanying binary packages are available for both, in the case of MIPS for all three ABI variants o32, n32, and n64 and for both big and little endian, and in the case of Alpha also with a bootable installation CD.

August 31 2024

KDE Plasma 6 upgrade for stable Gentoo Linux

GentooNews (https://www.gentoo.org/feeds/news.xml ) • August 31, 2024, 5:00

Exciting news for stable Gentoo users: It’s time for the upgrade to the new “megaversion” of the KDE community desktop environment, KDE Plasma 6! Together with KDE Gear 24.05.2, where now most of the applications have been ported, and KDE Frameworks 6.5.0, the underlying library architecture, KDE Plasma 6.1.4 will be stabilized over the next days. The base libraries of Qt 6 are already available.

More technical information on the upgrade, which should be fairly seamless, as well as architecture-specific notes can be found in a repository news item. Enjoy!

August 24 2024

“your actual contribution to gentoo project is now pure shit!”

Mike Pagano (mpagano) • August 24, 2024, 15:46

Ah, the life of a package maintainer. As far as controversial figures go, we probably rank somewhere under florist and nowhere near politician. Update software, back-port patches, submit patches upstream, stay on top of critical bugs and all of this in a Linux Distribution that has seen a decline in popularity. How much hate could I possibly stir up ?

Apparently, for one person, quite a lot. Living a pretty reserved life, I have never before experienced a real or implied threat. Note that I do drive on American roads, so I’m know people have expressed displeasure with my driving at points in the past, but nothing beyond normal, and nothing that I can recall short of a middle finger or two.

The below shows an exchange with an individual who apparently has a concerning sense of entitlement for the kind of work guarantees he receives from no cost software maintained by a volunteer who has never, and still does not, receive enumeration of any kind.

Stay safe everyone.

Note: The only editing I did was to fix the flow or add a comment to make it easier to read since this person likes to top post.

On Friday, July 26, 2024 at 12:43:26 PM GMT+2, Max Dubois
makemehappy@rocketmail.com wrote:

Hello,

According with this bug in bugzilla:

219061 – Memory leaks on vmalloc crash every 32 bit kernel after a
commit in 6.6.24 branch
bugzilla.kernel.org/show_bug.cgi?id=219061

219061 – Memory leaks on vmalloc crash every 32 bit kernel after
a

    commi...

bugzilla.kernel.org/show_bug.cgi?id=219061

Evey kernel.org (pure X86 platform) is serious bugged after 6.6.23,
also Gentoo (my preferred distro) has the bug so you should,
eventually after try the bug yourself, mark 6.6.23 in Green, becouse
all the others listed in the gentoo kernel-source page got the bug
(and obviously also the kernel-bin packages too).

The bug is a memory leak that produce vmalloc errors on machines
using highmem (>1024 MB) and this like explained in the bugzilla
will crash very fast a running machine destroying bowser tabs,
preventing for opening apps, terminals and so on).

To reproduce the bug is very easy:

build, if you don’t always have it, an x86 virtual machine and
configure it with 4 GB of ram, Virtualbox or VMware is the same,

Boot it with any kernel (gentoo or kernel.org or every kernel) over
6.6.23 (the last working). The machine will boot fine and it seems
to work as expected. Open a terminal and run a logging program (I
like metalog) and then start to use it to run apps, open a firefox
browser, some other terminals. Open some tabs on browser and look at
the logs. In minutes you will get messages like this and others in
the log, probably some kernel oops too:

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 24576
failed: use vmalloc= to increase size

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480
failed: use vmalloc= to increase size

Jul 24 17:04:42 debian1232vm kernel: alloc_vmap_area: 104 callbacks
suppressed

Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480
failed: use vmalloc= to increase size

The running kernel s a brand new 32 bit 6.10.1 downloaded from
kernel.org and compiled.

Increasing the vmallOn Friday, July 26, 2024 at 12:43:26 PM GMT+2, Max Dubois
makemehappy@rocketmail.com wrote:
forums.gentoo.org/viewtopic-t-1169951.html
forums.gentoo.org/viewtopic-t-1169951.html

X86 is no more widely used then if we say we maintain compatibility
with x86 systems this bug has to be fixed and the Gentoo kernel x86
shouldn’t have anything called stable after 6.6.23.

Regards from sunny italy to a fellow “paisano” like you probably are
♦

Max

PS: grateful for any feedback for my e-mail to you and for any
eventually fix.

Note from Mike: So this was right before a business trip, and I’m not a great flyer, so I was mentally focused on getting through that trip. Got to fly home in a Tropical Storm. Yay.

Second email from my new fan:

Il giorno 23 ago 2024, alle ore 15:27, Mike Pagano
mpagano@gentoo.org ha scritto:

On 8/22/24 18:41, Max Dubois wrote:

Hello mr. Pagano,

After quietly one month of no reply and no action, I can see you
ignored my previous mail.

That is not a good service for the Gentoo community, you still serve
to Gentoo 32 bit kernel like all after 6.6.23 (gentoo or not gentoo)
without even a comment. And make it even greeen!

There is a bug and that bug not allow any use of the 32 bit system
when > 1 GB memory is installed in the system and this bug is
recognised by the kernel.org developers (if not they probably had
closed the ticket in the kernel.org bug section, don’t you think
so?).

Now, I’m aware 32 bit systems are not used anymore, then it is not
serious from Gentoo and in this case from a good paisano like you
are, do nothing to inform people related to the gentoo-sources
branch you maintain.

This is not a good service to the community and I’m sad this is done
by an italian like you! ♦

As a true italian and good friend of many paisani like you, I think
you should act in some way to inform Gentoo user base about this
problem. Obviously do your tests before, you should if you still
didn’t them silently.

I know, you are a busy man, then it is simply not serious to act
like this.

BTW, I’ve been also waiting for your reply to my previous e-mail
then you ignored my concise and very precise mail to you. Many guys
in the kernel.org kernel list were interested and contacted me, but
you, mr. “not interested” Pagago.

Mr. Pagano, people in wonderful Campania, the region where your
blood come from, don’t act like this! You are probably from small
Frignano, I visited it and it is a wonderful little village, and
Caserta and his reggia is so fantastic (I hope your visited it,
lotsa real americans visit it, so someone from USA – with roots
there – should come to visit!) and I also had a girlfriend years ago
from there (southern italian girls are the best and the prettiest
all around).

Back in subject, please do something for this problem, don’t fool
gentoo users and gentoo tree.

Ciao paisano!

PS: I want escalate the problem if you don’t want to take any action
and act silently. Gentoo users don’t deserve a maintainer not
pointing out if not solving problems in the package they maintain.

My Reply

On 8/23/24 15:27, Mike Pagano wrote:
We do not hold up stabilization for bugs that impact such a small niche of users.
If we accommodated all of these kinds of requests, no kernel would ever be stable.
Good luck with your issue. In the future, keep your emails to me technical and exclude
references to my nationality, real or imagined.

Mike Pagano

Note from Mike: This has been true for the nearly 17 years I have been maintaining the Kernel in Gentoo. Sometimes people have hardware failures, sneak in a proprietary driver, who knows. But unless it impacts a large subset of people, we don’t hold up stabilization. Plus, this particular stabilization was for a root exploit.

On Fri, Aug 23, 2024 at 06:59:45PM +0200, Max Dubois wrote:

You like it technical fellow Michele? Here it is!

I forgot this and this is valid for kernel.org guys too.

I wrote it in the bug notes too when someone asked me to fix this!

First of all I’m not a developer, you can call me an advanced user,
second for someone that always have a developer machine with a local
copy of github kernel.org is a lot simple to bisecting the kernel
compared to me I don’t need such a blob on my machines!!!

Thankx to me, you guys all knows the bug happen between 6.6.23
(working) and 6.6.24 (not working).

You guys patch the kernel all the time so it isn’t complicated at all
bisecting the kernel to find the culprit modification bug that
introduced the problem.

You, dear Michele, maintain gentoo sources, you should have all the
tools around to do that and serve the community!

Inviato da iPhone

Another reply….
Il giorno 23 ago 2024, alle ore 18:39, Max Dubois
makemehappy@rocketmail.com ha scritto:
You should proud to be italian, mr. Pagano!
And I bet you also speak some broccolino and you should proud of that
too…

New York, New Jersey that is the broccolino nation and we, from the
real thing, we love you all… ♦ and i’m sorry you guys, your
ancestors, been forcing to left such a fantastic place like Italy for
such a shitty place, horrible weather, no history, poor quality life
New York, New Jersey allways offered, not talking about how this places
are adter covid panthomine♦

Tou should move in California if you can ♦

And yes this bug just impact a small percentual of users then it is
just becouse just few people are 32 bit now! This doesn’t mean that not
all the 32 bit aren’t buggy for ALL the 32 bit users and it still seems
incredible to me you ignore that and act if the problem is not there.

It is not a not working driver, it is the WHOLE system, all the 32 bir
linux systems, real or virtual, crashing after boot, in minutes!!!

Ciao fratello Michele, stammi bene!

MD (from a beach in the Pontine Islands named Ponza)

[1]945 Isola Di Ponza Stock Photos, High-Res Pictures, and Images
[2]gettyimages.it
[3]

My last Reply

On 8/23/24 19:26, Mike Pagano wrote:

Do not contact me any further

Date: Fri, 23 Aug 2024 20:57:04 +0200
From: Max Dubois makemehappy@rocketmail.com

Lol

You are sooo conceited! I saw a picture of you and you look exactly like some good men fron the area of naples!!! You could be a great pizzaiolo or a great mafia man, choose you if you prefer to be around the shitty jersey or the fantastic costiera amalfitana ♦ (pizza in your area even if they call it ITALIANA is pure shit) you look perfect for a new soprano serie and believe me all this is a big compliment to you!!!

Ciao michelino, alla prossima!

PS: your actual contribution to gentoo project is now pure shit! You shouldn’t mark green buggy kernel (everything over 6.6.23), you are completeky not honest with the community and even with yourself. And a broccolino like you shouls behave better also professionaly! You should let your gentoo-soirces commitment becouse you fail.

“your actual contribution to gentoo project is now pure shit!”

mpagano (mpagano ) • August 24, 2024, 15:46

Stay safe everyone.

Note: The only editing I did was to fix the flow or add a comment to make it easier to read since this person likes to top post.

On Friday, July 26, 2024 at 12:43:26 PM GMT+2, Max Dubois
makemehappy@rocketmail.com wrote:

Hello,

According with this bug in bugzilla:

219061 – Memory leaks on vmalloc crash every 32 bit kernel after a
commit in 6.6.24 branch
https://bugzilla.kernel.org/show_bug.cgi?id=219061

219061 – Memory leaks on vmalloc crash every 32 bit kernel after
a

    commi...

https://bugzilla.kernel.org/show_bug.cgi?id=219061

To reproduce the bug is very easy:

build, if you don’t always have it, an x86 virtual machine and
configure it with 4 GB of ram, Virtualbox or VMware is the same,

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 24576
failed: use vmalloc= to increase size

Jul 24 17:04:37 debian1232vm kernel: vmap allocation for size 20480
failed: use vmalloc= to increase size

Jul 24 17:04:42 debian1232vm kernel: alloc_vmap_area: 104 callbacks
suppressed

Jul 24 17:04:42 debian1232vm kernel: vmap allocation for size 20480
failed: use vmalloc= to increase size

The running kernel s a brand new 32 bit 6.10.1 downloaded from
kernel.org and compiled.

Increasing the vmallOn Friday, July 26, 2024 at 12:43:26 PM GMT+2, Max Dubois
makemehappy@rocketmail.com wrote:
https://forums.gentoo.org/viewtopic-t-1169951.html
https://forums.gentoo.org/viewtopic-t-1169951.html

X86 is no more widely used then if we say we maintain compatibility
with x86 systems this bug has to be fixed and the Gentoo kernel x86
shouldn’t have anything called stable after 6.6.23.

Regards from sunny italy to a fellow “paisano” like you probably are

Max

PS: grateful for any feedback for my e-mail to you and for any
eventually fix.

Note from Mike: So this was right before a business trip, and I’m not a great flyer, so I was mentally focused on getting through that trip. Got to fly home in a Tropical Storm. Yay.

Second email from my new fan:

Il giorno 23 ago 2024, alle ore 15:27, Mike Pagano
mpagano@gentoo.org ha scritto:

On 8/22/24 18:41, Max Dubois wrote:

Hello mr. Pagano,

After quietly one month of no reply and no action, I can see you
ignored my previous mail.

That is not a good service for the Gentoo community, you still serve
to Gentoo 32 bit kernel like all after 6.6.23 (gentoo or not gentoo)
without even a comment. And make it even greeen!

This is not a good service to the community and I’m sad this is done
by an italian like you!

I know, you are a busy man, then it is simply not serious to act
like this.

Back in subject, please do something for this problem, don’t fool
gentoo users and gentoo tree.

Ciao paisano!

My Reply

Mike Pagano

On Fri, Aug 23, 2024 at 06:59:45PM +0200, Max Dubois wrote:

You like it technical fellow Michele? Here it is!

I forgot this and this is valid for kernel.org guys too.

I wrote it in the bug notes too when someone asked me to fix this!

First of all I’m not a developer, you can call me an advanced user,
second for someone that always have a developer machine with a local
copy of github kernel.org is a lot simple to bisecting the kernel
compared to me I don’t need such a blob on my machines!!!

Thankx to me, you guys all knows the bug happen between 6.6.23
(working) and 6.6.24 (not working).

You guys patch the kernel all the time so it isn’t complicated at all
bisecting the kernel to find the culprit modification bug that
introduced the problem.

You, dear Michele, maintain gentoo sources, you should have all the
tools around to do that and serve the community!

Inviato da iPhone

Another reply….
Il giorno 23 ago 2024, alle ore 18:39, Max Dubois
makemehappy@rocketmail.com ha scritto:
You should proud to be italian, mr. Pagano!
And I bet you also speak some broccolino and you should proud of that
too…

New York, New Jersey that is the broccolino nation and we, from the
real thing, we love you all… and i’m sorry you guys, your
ancestors, been forcing to left such a fantastic place like Italy for
such a shitty place, horrible weather, no history, poor quality life
New York, New Jersey allways offered, not talking about how this places
are adter covid panthomine

Tou should move in California if you can

And yes this bug just impact a small percentual of users then it is
just becouse just few people are 32 bit now! This doesn’t mean that not
all the 32 bit aren’t buggy for ALL the 32 bit users and it still seems
incredible to me you ignore that and act if the problem is not there.

It is not a not working driver, it is the WHOLE system, all the 32 bir
linux systems, real or virtual, crashing after boot, in minutes!!!

Ciao fratello Michele, stammi bene!

MD (from a beach in the Pontine Islands named Ponza)

[1]945 Isola Di Ponza Stock Photos, High-Res Pictures, and Images
[2]gettyimages.it
[3]

My last Reply

On 8/23/24 19:26, Mike Pagano wrote:

Do not contact me any further

Date: Fri, 23 Aug 2024 20:57:04 +0200
From: Max Dubois makemehappy@rocketmail.com

Lol

You are sooo conceited! I saw a picture of you and you look exactly like some good men fron the area of naples!!! You could be a great pizzaiolo or a great mafia man, choose you if you prefer to be around the shitty jersey or the fantastic costiera amalfitana (pizza in your area even if they call it ITALIANA is pure shit) you look perfect for a new soprano serie and believe me all this is a big compliment to you!!!

Ciao michelino, alla prossima!

August 20 2024

Gentoo: profiles and keywords rather than releases

Michał Górny (mgorny) • August 20, 2024, 18:44

Different distributions have different approaches to releases. For example, Debian simultaneously maintains multiple releases (branches). The “stable” branch is recommended for production use, “testing” for more recent software versions. Every two years or so, the branches “shift” (i.e. the previous “testing” becomes the new “stable”, and so on) and users are asked to upgrade to the next release.

Fedora releases aren’t really branched like Debian. Instead, they make a new release (with potentially major changes for an upgrade) every half a year, and maintain old releases for 13 months. You generally start with the newest release, and periodically upgrade.

Arch Linux follows a rolling release model instead. There is just one branch that all Arch users use, and releases are made periodically only for the purpose of installation media. Major upgrades are done in-place (and I have to say, they don’t always go well).

Now, Gentoo is something of a hybrid, as it combines the best of both worlds. It is a rolling release distribution with a single shared repository that is available to all users. However, within this repository we use a keywording system to provide a choice between stable and testing packages, to facilitate both production and development systems (with some extra flexibility), and versioned profiles to tackle major lock-step upgrades.

Architectures

Before we enter any details, we need to clarify what an architecture (though I suppose platform might be a better term) is in Gentoo. In Gentoo, architectures provide a coarse (and rather arbitrary) way of classifying different supported processor families.

For example, the amd64 architecture is indicates 64-bit x86 processors (also called x86-64) running 64-bit userland, while x86 indicates 32-bit userland for x86 processors (both 32-bit and 64-bit in capability). Similarly, 64-bit AArch64 (ARMv8) userland is covered by arm64, while the 32-bit userland on all ARM architecture versions is covered by the arm. This is best seen in the ARM stage downloads — a single architecture is split into subarchitectures there.

For some architectures, the split is even coarser. For example, mips and riscv (at least for the moment) cover both 32-bit and 64-bit variations of the architecture. ppc64 covers both big-endian and little-endian (PPC64LE) variations — and the default big-endian variation tends to cause more issues with software.

Why does the split matter? Primarily because architectures define keywords, and keywords indicate whether the package works. A coarser split means that a single keyword may be used to cover a wide variety of platforms — not of which are equally working. But more on that further on.

By the way, I’ve mentioned “platforms” earlier. Why? Because besides the usual architectures, we are using names such as amd64-linux and x64-macos for Prefix — i.e. running Gentoo inside another operating system (or Linux distribution). Historically, we also had a Gentoo/FreeBSD variation.

Profiles

The simplest way of thinking of profiles would be as different Gentoo configurations. Gentoo provides a number of profiles for every supported architecture. Profiles serve multiple purposes.

The most obvious purpose is providing suitable defaults for different, well, profiles of Gentoo usage. So we have base profiles that are better suited for headless systems, and desktop profiles that are optimized for desktop use. Within desktop profiles, we have subprofiles for GNOME and Plasma desktops. We have base profiles for OpenRC, and subprofiles for systemd; base profiles for the GNU toolchain and subprofiles for the LLVM toolchain. Of course, these merely control defaults — you aren’t actually required to use a specific subprofile to use the relevant software; you can adjust your configuration directly instead. However, using a right fit of a profile makes things easier, and increases the chances of finding Gentoo binary packages that match your setup.

But there’s more to profiles than that. Profiles also control non-trivial system configuration aspects that cannot be easily changed. We have separate profiles for systems that have undergone the “/usr merge”, and for systems that haven’t — and you can’t switch between the two without actually migrating your system first. On some architectures we have profiles with and without multilib; this is e.g. necessary to run 32-bit executables on amd64. On ARM, separate profiles are provided for different architecture versions. The implication of all that is that profiles also control which packages can actually be installed on a system. You can’t install 32-bit software on an amd64 non-multilib system, or packages requiring newer ARM instructions on a system using a profile for older processors.

Finally, profiles are versioned to carry out major changes in Gentoo. This is akin to how Debian or Fedora do releases. When we introduce major changes that require some kind of migration, we do that via a new profile version. Users are provided with upgrade instructions, and are asked to migrate their systems. And we do support both old and new profiles for some time. To list two examples:

17.1 amd64 profiles changed the multilib layout from using lib64 + lib32 (+ a compatibility lib symlink) to lib64 + lib.
23.0 profiles featured hardening- and optimization-related toolchain changes.

Every available profile has one of three stability levels: stable, dev or exp. As you can guess, “stable” profiles are the ones that are currently considered safe to use on production systems. “Dev” profiles should be good too, but they’re not as well tested yet. Then, “exp” profiles come with no guarantees, not even of dependency graph integrity (to be explained further on).

Keywords

While profiles can control which packages can be installed to some degree, keywords are at the core of that. Keywords are specified for every package version separately (inside the ebuild), and are specified (or not) for every architecture.

A keyword can effectively have one of four states:

stable (e.g. amd64), indicating that the package should be good to be used on production;
testing (often called ~arch, e.g. ~amd64), indicating that the package should work, but we don’t give strong guarantees;
unkeyworded (i.e. no keyword for given architecture is present), usually indicating that the package has not been tested yet;
disabled (e.g. -amd64), indicating that the package can’t work on given architecture. This is rarely used, usually for prebuilt software.

Now, the key point is that users have control over which keywords their package managers accepts. If you’re running a production system, you may want to set it to accept stable keywords only — in which case only stable packages will normally be allowed to be installed, and your packages will only be upgraded once the next version is marked stable. Or you may set your system to accept both stable and testing keywords, and help us test them.

Of course, this is not just a binary global switch. At the cost of increased risk and reduced chances of getting support, you can adjust allowed keywords for packages, and run a mix of stable and testing. Or you can install some packages that has no keywords at all, including live packages built straight from a VCS repository. Or you can even set your system to follow keywords for another architecture — the sky is the limit!

Note that not all Gentoo architectures use stable keywords at a time. There are so called “pure ~arch arches” that use testing keywords only. An examples of such architectures are alpha, loong and riscv.

Bad terminology: stable and stable

Here’s a time for a short intermezzo: as you may have noticed, we have used the term “stable” twice already: one time for profiles, and the other time for the keywords. Combined with the fact that not all architectures actually use stable keywords, this can get really confusing. Unfortunately, it’s a historical legacy that we have to live with.

So to clarify. A stable profile is a profile that should be good to use on production systems. A stable package (i.e. a package [version] with stable keywords) is a package version that should be good to use on production systems.

However, the two aren’t necessarily linked. You can use a dev or even exp profile, but only accept stable keywords, and the other way around. Furthermore, architectures that don’t use stable keywords at all, do have stable profiles.

Visibility and dependency graph integrity

Equipped with all that information, now we can introduce the concept of package visibility. Long story short, a package (version) is visible if it is installable on a given system. The primary reasons why a package couldn’t be installed are insufficient keywords, or an explicit mask. Let’s consider these cases in detail.

As I’ve mentioned earlier, a particular system can be configured to accept either stable, or both stable and testing keywords. Therefore, on a system set to accept stable keywords, only packages featuring stable keywords can be visible (the remaining packages are masked by “missing keyword”). On a system set to accept both stable and testing keywords, all packages featuring either stable or testing keywords can be visible.

Additionally, packages can be explicitly masked either globally in the repository, or in profiles. These masks are used for a variety of reasons: when a particular package is incompatible with the configuration of a given profile (say, 32-bit packages on a non-multilib 64-bit profile), when it turns out to be broken or when we believe that it needs more testing before we let users install it (even on testing-keyword systems).

The considerations of package visibility here are limited to the package itself. However, in order for the package to be installable, all its dependencies need to be installable as well. For packages with stable keywords, this means that all their dependencies (including optional dependencies conditional to USE flags that can be enabled on a stable system) have a matching version with stable keywords as well. Conversely, for packages with testing keywords, this means that all dependencies need to have either stable or testing keywords. Furthermore, said dependency versions must not be masked on any profile, on which the package in question is visible.

This is precisely what dependency graph integrity checks are all about. They are performed for all profiles that are either stable or dev (i.e. exp profiles are excluded, and don’t guarantee integrity), for all package versions with stable or testing keywords — and for each of these kind of keywords separately. And when integrity is not maintained, we get automated reports about it, and deployment pipeline is blocked, so ideally users don’t have to experience the problem firsthand.

The life of a keyword

Now that we have all the fundamental ideas covered, we can start discussing how packages get their keywords in the first place.

The default state for a keyword is “unspecified”. For a package to gain a testing keyword, it needs to be tested on the architecture in question. This can either be done by a developer directly, or via a keywording request filed on Gentoo Bugzilla, that will be processed by an arch tester. Usually, only the newest version of the package is handled, but in special circumstances testing keywords can be added to older versions as well (e.g. when required to satisfy a dependency). Any dependencies that are lacking a matching keyword need to be tested as well.

And what does happen if the package does not pass testing? Ideally, we file a bug upstream and get it fixed. But realistically, we can’t always manage that. Sometimes the bug remains open for quite some time, waiting for someone to take action or for a new release that might happen to start working. Sometimes we decide that keywording a particular package at the time is not worth the effort — and if it is required as an optional dependency of something else, we instead mask the relevant USE flags in the profiles corresponding to the given architecture. In extreme cases, we may actually add a negative -arch flag, to indicate that the package can’t work on given architecture. However, this is really rare and we generally do it only as a hint if people spend their time trying to keyword it over and over again.

Once a package gains a testing keyword, it “sticks”. Whenever a new version is added, all the keywords from the previous version are copied into it, and stable keywords are lowered into testing keywords. This is done even though the developer only tested it on one of the architectures. Packages generally lose testing keywords only if we either have a justified suspicion that they have stopped working, or if they gained new dependencies that are lacking the keywords in question. Most of the time, we request readding the testing keywords (rekeywording) immediately afterwards.

Now, stable requests follow a stricter routine. The maintainer must decide that a particular package version is ready to become stable first. A rule of thumb is that it’s been in testing for a month, and no major regressions have been reported. However, the exact details differ. For example, some projects make separate “stable branch” and “testing branch” releases, and we mark only the former stable. And when vulnerabilities are found in software, we tend to proceed with adding stable keywords to the fixed versions immediately.

Then, a stabilization request is filed, and then the package is tested on every architecture before the respective stable keyword is added. Testing is generally done on a system that is set only to accept stable keywords, therefore it may provide a slightly different environment that the original testing done when the package was added. Note that there is an exception to that rule — if we believe that particular packages are unlikely to exhibit different behavior across different architectures, we do ALLARCHES stabilization and add all the requested stable keywords after testing on one system.

Unlike with testing keywords, stable keywords need to be added to every version separately. When a new package version is added, all stable keywords in it are replaced by the corresponding testing keywords.

This process pretty much explains the difference between the guarantees given by testing and stable keywords. The testing keywords indicate that some version of the package has been tested on the given architecture at some point, and that we have good reasons to believe that it still works. The stable keywords indicate that this particular version has been tested on a system running stable keywords, and therefore it is less likely to turn out broken. Unfortunately, whether it actually is free of bugs is largely dependent on the quality of test suites, dependencies and so on. So yeah, it’s a mess.

The cost of keywords

I suppose that from user’s perspective it would be best if all packages that work on a given architecture had keywords for it; and ideally, all versions suitable for it would have stable keywords on all relevant architectures. However, every keyword comes with a cost. And that’s not only the cost of actual testing, but also a long-term maintenance cost.

For the most important architectures, Gentoo developers have access to one or more dedicated machines. These machines are used to various purposes: arch testing (i.e. processing keywording and stabilization requests, usually semi-automated), building stage archives, building binary packages, and last but not least: providing development environments that are needed to debug and fix bugs. For other architectures, we are entirely dependent on volunteers doing the testing — a few prominent volunteers worthy of the highest praise, I must add.

The cost incurred by testing keywords is comparatively small, but contrary to what you might think, it’s not a one time cost. Once a package gains a testing keyword, we generally want to keep it going forward. This means that if it gains new dependencies, we’re going to have to retest it — and its new dependencies. However, that’s the easy part.

The hard part is that stuff can actually break over time. The package itself can start exhibiting test failures, or stop working entirely. Its new dependencies may turn out to be broken on the architecture in question. In these cases, it’s not just the cost of testing — but actually reporting bugs, and possibly debugging and writing patches when upstream authors don’t have access to the relevant hardware (and/or don’t care). Sometimes you even learn that the author never intended to support given architecture, and is unwilling to accept well-written patches.

And if it turns out that it really isn’t feasible to keep the keyword going forward anymore, sometimes removing it may also turn out to be a lot of effort — especially if multiple packages depending on this one have been keyworded as well.

Of course, the cost for stable keywords is much higher. After all, it’s no longer a case of one time testing, but we actually have to test every single version that’s going stable. This is somewhat amortized by ALLARCHES packages that need to be tested on a single architecture only (and therefore usually are tested on one of the “fast” architectures), but still it’s a lot. On top of that, frequent testing is more likely to reveal problems, and therefore require immediate fixes. This is actually a good thing, but also a future cost to consider. And removing keywords from packages that used to be stable is likely to have greater impact than from these that never were.

Struggling architectures

All the costs considered, it shouldn’t come as a surprise that we sometimes find ourselves struggling with some of the less popular architectures. We may have limited access to hardware, the hardware itself may not be very performant, the hardware and the operating system may be susceptible to breakage. So if we keyword too much, then the arch teams can no longer keep up, the queue is getting long, and requests aren’t handled timely. In the extreme case, we may lose the last machine for a given architecture and become stuck, unable to go forward. These are all things to consider.

For these reasons, we periodically discuss the state of architectures in Gentoo. If we determine that some of them are finding it hard to cope, we look for solutions. Of course, one possibility to weigh in is getting more hardware — but that’s not always justified, or even possible. Sometimes we need to actually reduce the workload.

For architectures that use stable keywords, the obvious possibility is to reduce the number of packages using them — i.e. destabilize packages. Ordinarily, the best targets for this effort would be packages that are old, particularly problematic or unpopular, as they can reduce our effective maintenance cost while minimizing the potential discomfort to users. However, we might need to go deeper than that. In extreme cases, we can go as far as to reduce the stable package set to core system packages. At some point, this kind of reduction forces users to run a mixed stable-testing keyword system, but that at least permits them to limit risk of regressions in the most important packages.

If even that is insufficient, there are more options at our disposal. We can look into removing keywords entirely from packages, particularly packages that require further rekeywording work. We can decide to remove stable keywords from an architecture entirely. In the worst case, we can decide to mark all profiles exp, effectively abandoning dependency graph integrity (at this point, some dependencies may start missing keywords and packages may not be trivially installable), or we can decide to remove the support for a given architecture entirely.

Summary

Gentoo uses a combined profile and keyword system to facilitate user needs on top of a single ebuild repository. This is in contrast with many other distributions that use multiple repositories, make releases, sometimes maintain multiple release branches simultaneously. In fact, some distributions actually split into multiple versions to facilitate different user profiles. Gentoo does all that in a single, coherent product with rolling releases and profile upgrade paths.

The system of keywords is aimed at providing good user experience while keeping the maintenance affordable. On most of the supported architectures, we provide stable keywords to help keeping production systems on reasonably tested software. Before packages becomes stable, we offer them to more adventurous users via testing keywords. Gentoo also offers great flexibility — users can mix stable and testing keywords freely (though at the risk of hitting unexpected issues), or run experimental packages that aren’t ready to get testing keywords yet.

Unfortunately, there are limits to how much support for various architectures we can provide. We are largely reliant on either having appropriate machines available, or volunteers with the hardware to test stuff for us, not to mention developers having skills and energy to debug and fix architecture-specific problems. Sometimes this turns out to be insufficient to cope with all the work, and we need to give up on some of the architecture support.

Still, I think the system works pretty well here, and it is one of Gentoo’s strong suits. Sure, it occasionally needs a push here and there, or a policy change, but it’s been one of Gentoo’s foundations for years, and it doesn’t look as if it’s going to be replaced anytime soon.

Gentoo: profiles and keywords rather than releases

mgorny (mgorny ) • August 20, 2024, 18:44

Now, Gentoo is something of a hybrid, as it combines the best of both worlds. It is a rolling release distribution with a single shared repository that is available to all users. However, within this repository we use a keywording system to provide a choice between stable and testing packages, to facilitate both production and development systems (with some extra flexibility), and versioned profiles to tackle major lock-step upgrades.

Architectures

Before we enter any details, we need to clarify what an architecture (though I suppose platform might be a better term) is in Gentoo. In Gentoo, architectures provide a coarse (and rather arbitrary) way of classifying different supported processor families.

For example, the amd64 architecture is indicates 64-bit x86 processors (also called x86-64) running 64-bit userland, while x86 indicates 32-bit userland for x86 processors (both 32-bit and 64-bit in capability). Similarly, 64-bit AArch64 (ARMv8) userland is covered by arm64, while the 32-bit userland on all ARM architecture versions is covered by the arm. This is best seen in the ARM stage downloads — a single architecture is split into subarchitectures there.

For some architectures, the split is even coarser. For example, mips and riscv (at least for the moment) cover both 32-bit and 64-bit variations of the architecture. ppc64 covers both big-endian and little-endian (PPC64LE) variations — and the default big-endian variation tends to cause more issues with software.

By the way, I’ve mentioned “platforms” earlier. Why? Because besides the usual architectures, we are using names such as amd64-linux and x64-macos for Prefix — i.e. running Gentoo inside another operating system (or Linux distribution). Historically, we also had a Gentoo/FreeBSD variation.

Profiles

The simplest way of thinking of profiles would be as different Gentoo configurations. Gentoo provides a number of profiles for every supported architecture. Profiles serve multiple purposes.

17.1 amd64 profiles changed the multilib layout from using lib64 + lib32 (+ a compatibility lib symlink) to lib64 + lib.
23.0 profiles featured hardening- and optimization-related toolchain changes.

Every available profile has one of three stability levels: stable, dev or exp. As you can guess, “stable” profiles are the ones that are currently considered safe to use on production systems. “Dev” profiles should be good too, but they’re not as well tested yet. Then, “exp” profiles come with no guarantees, not even of dependency graph integrity (to be explained further on).

Keywords

While profiles can control which packages can be installed to some degree, keywords are at the core of that. Keywords are specified for every package version separately (inside the ebuild), and are specified (or not) for every architecture.

A keyword can effectively have one of four states:

stable (e.g. amd64), indicating that the package should be good to be used on production;
testing (often called ~arch, e.g. ~amd64), indicating that the package should work, but we don’t give strong guarantees;
unkeyworded (i.e. no keyword for given architecture is present), usually indicating that the package has not been tested yet;
disabled (e.g. -amd64), indicating that the package can’t work on given architecture. This is rarely used, usually for prebuilt software.

Note that not all Gentoo architectures use stable keywords at a time. There are so called “pure ~arch arches” that use testing keywords only. An examples of such architectures are alpha, loong and riscv.

Bad terminology: stable and stable

So to clarify. A stable profile is a profile that should be good to use on production systems. A stable package (i.e. a package [version] with stable keywords) is a package version that should be good to use on production systems.

However, the two aren’t necessarily linked. You can use a dev or even exp profile, but only accept stable keywords, and the other way around. Furthermore, architectures that don’t use stable keywords at all, do have stable profiles.

Visibility and dependency graph integrity

Equipped with all that information, now we can introduce the concept of package visibility. Long story short, a package (version) is visible if it is installable on a given system. The primary reasons why a package couldn’t be installed are insufficient keywords, or an explicit mask. Let’s consider these cases in detail.

This is precisely what dependency graph integrity checks are all about. They are performed for all profiles that are either stable or dev (i.e. exp profiles are excluded, and don’t guarantee integrity), for all package versions with stable or testing keywords — and for each of these kind of keywords separately. And when integrity is not maintained, we get automated reports about it, and deployment pipeline is blocked, so ideally users don’t have to experience the problem firsthand.

The life of a keyword

Now that we have all the fundamental ideas covered, we can start discussing how packages get their keywords in the first place.

The default state for a keyword is “unspecified”. For a package to gain a testing keyword, it needs to be tested on the architecture in question. This can either be done by a developer directly, or via a keywording request filed on Gentoo Bugzilla, that will be processed by an arch tester. Usually, only the newest version of the package is handled, but in special circumstances testing keywords can be added to older versions as well (e.g. when required to satisfy a dependency). Any dependencies that are lacking a matching keyword need to be tested as well.

And what does happen if the package does not pass testing? Ideally, we file a bug upstream and get it fixed. But realistically, we can’t always manage that. Sometimes the bug remains open for quite some time, waiting for someone to take action or for a new release that might happen to start working. Sometimes we decide that keywording a particular package at the time is not worth the effort — and if it is required as an optional dependency of something else, we instead mask the relevant USE flags in the profiles corresponding to the given architecture. In extreme cases, we may actually add a negative -arch flag, to indicate that the package can’t work on given architecture. However, this is really rare and we generally do it only as a hint if people spend their time trying to keyword it over and over again.

Then, a stabilization request is filed, and then the package is tested on every architecture before the respective stable keyword is added. Testing is generally done on a system that is set only to accept stable keywords, therefore it may provide a slightly different environment that the original testing done when the package was added. Note that there is an exception to that rule — if we believe that particular packages are unlikely to exhibit different behavior across different architectures, we do ALLARCHES stabilization and add all the requested stable keywords after testing on one system.

The cost of keywords

Of course, the cost for stable keywords is much higher. After all, it’s no longer a case of one time testing, but we actually have to test every single version that’s going stable. This is somewhat amortized by ALLARCHES packages that need to be tested on a single architecture only (and therefore usually are tested on one of the “fast” architectures), but still it’s a lot. On top of that, frequent testing is more likely to reveal problems, and therefore require immediate fixes. This is actually a good thing, but also a future cost to consider. And removing keywords from packages that used to be stable is likely to have greater impact than from these that never were.

Struggling architectures

For architectures that use stable keywords, the obvious possibility is to reduce the number of packages using them — i.e. destabilize packages. Ordinarily, the best targets for this effort would be packages that are old, particularly problematic or unpopular, as they can reduce our effective maintenance cost while minimizing the potential discomfort to users. However, we might need to go deeper than that. In extreme cases, we can go as far as to reduce the stable package set to core system packages. At some point, this kind of reduction forces users to run a mixed stable-testing keyword system, but that at least permits them to limit risk of regressions in the most important packages.

If even that is insufficient, there are more options at our disposal. We can look into removing keywords entirely from packages, particularly packages that require further rekeywording work. We can decide to remove stable keywords from an architecture entirely. In the worst case, we can decide to mark all profiles exp, effectively abandoning dependency graph integrity (at this point, some dependencies may start missing keywords and packages may not be trivially installable), or we can decide to remove the support for a given architecture entirely.

Summary

August 14 2024

Gentoo Linux drops IA-64 (Itanium) support

GentooNews (https://www.gentoo.org/feeds/news.xml ) • August 14, 2024, 5:00

Following the removal of IA-64 (Itanium) support in the Linux kernel and glibc, and subsequent discussions on our mailing list, as well as a vote by the Gentoo Council, Gentoo will discontinue all ia64 profiles and keywords. The primary reason for this decision is the inability of the Gentoo IA-64 team to support this architecture without kernel support, glibc support, and a functional development box (or even a well-established emulator). In addition, there have been only very few users interested in this type of hardware.

As also announced in a news item, in one month, i.e. in the first half of September 2024, all ia64 profiles will be removed, all ia64 keywords will be dropped from all packages, and all IA-64 related Gentoo bugs will be closed.

Welcome to Planet Gentoo, an aggregation of Gentoo-related weblog articles written by Gentoo developers. For a broader range of topics, you might be interested in Gentoo Universe.

October 12 2025

How we incidentally uncovered a 7-year old bug in gentoo-ci

July 26 2025

EPYTEST_PLUGINS and other goodies now in Gentoo

The unceasing fight against plugin autoloading

EPYTEST_PLUGINS

Going towards no autoloading by default

EPYTEST_PLUGIN* to deal with special cases

Old and new bits: common plugins

JUnit XML output and gpy-junit2deselect

hypothesis-gentoo to deal with health check nightmare

Summary

April 30 2025

Urgent - OSU Open Source Lab needs your help

February 20 2025

Bootable Gentoo QCOW2 disk images - ready for the cloud!

Questions and answers

How can I quickly test the images?

What settings do I need for qemu?

Can I install the images onto a real harddisk / SSD?

So what are the cloud-init images good for?

Are you planning to support further architectures?

Are you planning to support legacy boot?

How about disks with 4096 byte sectors?

Why XFS as file system?

February 01 2025

Tinderbox shutdown

January 05 2025

2024 in retrospect & happy new year 2025!

Gentoo in numbers

New developers

Featured changes and news

Distribution-wide Initiatives

Architectures

Packages

Physical and Software Infrastructure

Finances of the Gentoo Foundation

Thank you!

December 29 2024

FOSDEM 2025

December 20 2024

Poetry(-core), or the ultimate footgun

The nightmarish caret operator

The misleading include key

Schrödinger’s optional dependency

Summary

November 10 2024

The peculiar world of Gentoo package testing

Gentoo as a source-first distro

The build phases

The implications for testing

November 09 2024

Ready-to-boot, fresh & experimental Gentoo QCOW2 disk images

October 23 2024

DTrace 2.0 for Gentoo

October 07 2024

Arm Ltd. provides fast Ampere Altra Max server for Gentoo

October 04 2024

Testing the safe time64 transition path

Preparing to catch time32/time64 mixing

Preparing for the transition

Rebuilding everything

The results

Next steps

September 28 2024

The perils of transition to 64-bit time_t

Going back to Large File Support

What ABIs are we talking about?

Why the ABI change is so bad?

What can we do to make it safer?

The platform tuple change

The libdir change

Ensuring binary incompatibility

What about old prebuilt applications?

Summary

Acknowledgements

2024-09-30 correction

September 23 2024

Overview of cross-architecture portability problems

Welcome to Planet Gentoo,
an aggregation of Gentoo-related weblog articles written by Gentoo developers.
For a broader range of topics, you might be interested in Gentoo Universe.