Gentoo Logo
Gentoo Logo Side
Gentoo Spaceship

Contributors:
. Aaron W. Swenson
. Agostino Sarubbo
. Alexey Shvetsov
. Alexis Ballier
. Alexys Jacob
. Alice Ferrazzi
. Alice Ferrazzi
. Andreas K. Hüttel
. Anthony Basile
. Arun Raghavan
. Bernard Cafarelli
. Christian Ruppert
. Chí-Thanh Christopher Nguyễn
. Denis Dupeyron
. Detlev Casanova
. Diego E. Pettenò
. Domen Kožar
. Doug Goldstein
. Eray Aslan
. Fabio Erculiani
. Gentoo Haskell Herd
. Gentoo Miniconf 2016
. Gentoo Monthly Newsletter
. Gentoo News
. Gilles Dartiguelongue
. Greg KH
. Göktürk Yüksek
. Hanno Böck
. Hans de Graaff
. Ian Whyman
. Jason A. Donenfeld
. Jeffrey Gardner
. Joachim Bartosik
. Johannes Huber
. Jonathan Callen
. Jorge Manuel B. S. Vicetto
. Kristian Fiskerstrand
. Lance Albertson
. Liam McLoughlin
. Luca Barbato
. Marek Szuba
. Mart Raudsepp
. Matt Turner
. Matthew Thode
. Michael Palimaka
. Michal Hrusecky
. Michał Górny
. Mike Doty
. Mike Gilbert
. Mike Pagano
. Nathan Zachary
. Pacho Ramos
. Patrick Kursawe
. Patrick Lauer
. Patrick McLean
. Paweł Hajdan, Jr.
. Piotr Jaroszyński
. Rafael G. Martins
. Remi Cardona
. Richard Freeman
. Robin Johnson
. Ryan Hill
. Sean Amoss
. Sebastian Pipping
. Sergei Trofimovich
. Steev Klimaszewski
. Stratos Psomadakis
. Sven Vermeulen
. Sven Wegener
. Tom Wijsman
. Yury German
. Zack Medico

Last updated:
April 20, 2018, 12:07 UTC

Disclaimer:
Views expressed in the content published here do not necessarily represent the views of Gentoo Linux or the Gentoo Foundation.


Bugs? Comments? Suggestions? Contact us!

Powered by:
Planet Venus

Welcome to Gentoo Universe, an aggregation of weblog articles on all topics written by Gentoo developers. For a more refined aggregation of Gentoo-related topics only, you might be interested in Planet Gentoo.

April 18, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
Updates on Silicon Labs CP2110 (April 18, 2018, 23:04 UTC)

One month ago I started the yak shave of supporting the Silicon Labs CP2110 with a fully opensource stack, that I can even re-use for glucometerutils.

The first step was deciding how to implement this. While the device itself supports quite a wide range of interfaces, including a GPIO one, I decided that since I’m only going to be able to test and use practically the serial interface, I would at least start with just that. So you’ll probably see the first output as a module for pyserial that implements access to CP2110 devices.

The second step was to find an easy way to test this in a more generic way. Thankfully, Martin Holzhauer, who commented on the original post, linked to an adapter by MakerSpot that uses that chip (the link to the product was lost in the migration to WordPress, sigh), which I then ordered and received a number of weeks later, since it had to come to the US and clear customs through Amazon.

All of this was the easy part, the next part was actually implementing enough of the protocol described in the specification, so that I could actually send and receive data — and that also made it clear that despite the protocol being documented, it’s not as obvious as it might sound — for instance, the specification says that the reports 0x01 to 0x3F are used to send and receive data, but it does not say why there are so many reports… except that it turns out they are actually used to specify the length of the buffer: if you send two bytes, you’ll have to use the 0x02 report, for ten bytes 0x0A, and so on, until the maximum of 63 bytes as 0x3F. This became very clear when I tried sending a long string and the output was impossible to decode.

Speaking of decoding, my original intention was to just loop together the CP2110 device with a CH341 I bought a few years ago, and have them loop data among each other to validate that they work. Somehow this plan failed: I can get data from the CH341 into the CP2110 and it decodes fine (using picocom for the CH341, and Silicon Labs own binary for the CP2110), but I can’t seem to get the CH341 to pick up the data sent through the CP2110. I thought it was a bad adapter, but then I connected the output to my Saleae Logic16 and it showed the data fine, so… no idea.

The current status is:

  • I know the CH341 sends out a good signal;
  • I know the CP2110 can receive a good signal from the CH341, with the Silicon Labs software;
  • I know the CP2110 can send a good signal to the Saleae Logic16, both with the Silicon Labs software and my tiny script;
  • I can’t get the CH341 to receive data from the CP2110.

Right now the state is still very much up in the air, and since I’ll be travelling quite a bit without a chance to bring with me the devices, there probably won’t be any news about this for another month or two.

Oh and before I forget, Rich Felker gave me another interesting idea: CUSE (Character Devices in User Space) is a kernel-supported way to “emulate” in user space devices that would usually be implemented in the kernel. And that would be another perfect application for this: if you just need to use a CP2110 as an adapter for something that needs to speak with a serial port, then you can just have a userspace daemon that implements CUSE, and provide a ttyUSB-compatible device, while not requiring short-circuiting the HID and USB-Serial subsystems.

Zack Medico a.k.a. zmedico (homepage, bugs)

In portage-2.3.30, portage’s python API provides an asyncio event loop policy via a DefaultEventLoopPolicy class. For example, here’s a little program that uses portage’s DefaultEventLoopPolicy to do the same thing as emerge --regen, using an async_iter_completed function to implement the --jobs and --load-average options:

#!/usr/bin/env python

from __future__ import print_function

import argparse
import functools
import multiprocessing
import operator

import portage
from portage.util.futures.iter_completed import (
    async_iter_completed,
)
from portage.util.futures.unix_events import (
    DefaultEventLoopPolicy,
)


def handle_result(cpv, future):
    metadata = dict(zip(portage.auxdbkeys, future.result()))
    print(cpv)
    for k, v in sorted(metadata.items(),
        key=operator.itemgetter(0)):
        if v:
            print('\t{}: {}'.format(k, v))
    print()


def future_generator(repo_location, loop=None):

    portdb = portage.portdb

    for cp in portdb.cp_all(trees=[repo_location]):
        for cpv in portdb.cp_list(cp, mytree=repo_location):
            future = portdb.async_aux_get(
                cpv,
                portage.auxdbkeys,
                mytree=repo_location,
                loop=loop,
            )

            future.add_done_callback(
                functools.partial(handle_result, cpv))

            yield future


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--repo',
        action='store',
        default='gentoo',
    )
    parser.add_argument(
        '--jobs',
        action='store',
        type=int,
        default=multiprocessing.cpu_count(),
    )
    parser.add_argument(
        '--load-average',
        action='store',
        type=float,
        default=multiprocessing.cpu_count(),
    )
    args = parser.parse_args()

    try:
        repo_location = portage.settings.repositories.\
            get_location_for_name(args.repo)
    except KeyError:
        parser.error('unknown repo: {}\navailable repos: {}'.\
            format(args.repo, ' '.join(sorted(
            repo.name for repo in
            portage.settings.repositories))))

    policy = DefaultEventLoopPolicy()
    loop = policy.get_event_loop()

    for future_done_set in async_iter_completed(
        future_generator(repo_location, loop=loop),
        max_jobs=args.jobs,
        max_load=args.load_average,
        loop=loop):
        loop.run_until_complete(future_done_set)


if __name__ == '__main__':
    main()

April 15, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
A review of the Curve debit card (April 15, 2018, 23:04 UTC)

Somehow, I end up spending a significant amount of my time thinking, testing and playing with financial services, both old school banks and fintech startups.

One of the most recent ones I have been playing with is Curve. The premise of the service is to allow you to use a single card for all transactions, having the ability to then charge any card underneath it as it is convenient. This was a curious enough idea, so I asked the friend who was telling me about it to give me his referral code to sign up. If you want to sign up, my code is BG2G3.

Signing up and getting the card is quite easy, even though they have (or had when I signed up) a “waitlist” — so after you sign up it takes a few days for you to be able to actually order the card and get it in your hands. They suggest you to make other people sign up for it as well to lower the time in the waitlist, but that didn’t seem to be a requirement for me. The card arrived, to my recalling, no more than two days after they said they shipped it, probably because it was coming from London itself, and that’s all you need to receive.

So how does this all work? You need to connect your existing cards to Curve, and verify them to be able to charge them — verification can be either through a 3Dsecure/Verified by Visa login, or through the usual charge-code-reverse dance that Google, PayPal and the others all use. Once you connect the cards, and select the currently-charged card, you can start using the Curve card to pay, and it acts as a “proxy” for the other card, charging it for the same amount, with some caveats.

The main advantage that my friend suggested for this set up is that you if you have a corporate card (I do), you can add that one to Curve too, and rely on that to not have to do the payback process at work if you make a mistake paying for something. As this happened to me a few times before, mostly out of selecting the wrong payment profile in apps such as Uber or Hailo, or going over the daily allowance for meals as I changed plans, it sounded interesting enough. This can work both by making sure to select the corporate card before making the purchase (for instance by defaulting to it during a work trip), or by “turning back time” on an already charged transaction. Cool.

I also had a hope that the card could be attached to Google Pay, but that’s not the case. Nor they implement their own NFC payment application, which is a bit disappointing.

Beside the “turn back time” feature, the app also has some additional features, such as the integration with the accounting software Xero, including the ability to attach a receipt image to the expense (if this was present for Concur, I’d be a real believer, but until then it’s not really that useful to me), and to receive email “receipts” (more like credit card slips) for purchases made to a certain card (not sure why that’s not just a global, but meh).

Not much else is available in the app to make it particularly useful or interesting to me, honestly. There’s some category system for expenses, very similar to the one for Revolut, but that’s about it.

On the more practical side of things, Curve does not apply any surcharges as long as the transaction is in the same currency as the card, and that includes the transactions in which you turned back time. Which is handy if you don’t know what the currency you’ll be charged in will be in, though that does not really happen often.

What I found particularly useful for this is that the card itself look like a proper “British” card — with my apartment as the verified address on it. But then I can charge one of my cards in Euro, US Dollars, or Revolut itself… although I could just charge Revolut in those cases. The main difference between the two approach is that I can put the Euro purchases onto an Euro credit card, instead of a debit one… except that the only Euro credit card I’m left also has my apartment as its verifiable billing address, so… I’d say I’m not the target audience for this feature.

For “foreign transactions” (that is, where the charged currency and the card currency disagree), Curve charges a 1% foreign transaction fee. This is pointless for me thanks to Revolut, but it still is convenient if you only have accounts with traditional banks, particularly in the UK where most banks apply a 3% foreign transaction fee instead.

In addition to the free card, they also offer a £50 (a year, I guess — it’s not clear!) “black” card that offers 1% cashback at selected retailers. You can actually get 90 days cashback for three retailers of your choice on the free card as well, but to be honest, American Express is much more widely accepted in chains, and its rewards are better. I ended up choosing to do the cashback with TfL, Costa (because they don’t take Amex contactless anyway), and Sainsbury’s just to select something I use.

In all of this, I have to say I fail to see where the business makes money. Okay, financial services are not my area of expertise, but if you’re just proxying payments, without even taking deposits (the way Revolut does), and not charging additional foreign transaction fees, and even giving cashback… where’s the profit?

I guess there is some money to be made by profiling users and selling the data to advertisers — but between GDPR and the fact that most consumers don’t like the idea of being made into products with no “kick back”. I guess if it was me I would be charging 1% on the “turn back time” feature, but that might make moot the whole point of the service. I don’t know.

At the end of the day, I’m also not sure how useful this card is going to be for me, on the day to day. The ability to have a single entry in those systems that are used “promiscuously” for business and personal usage sounds good, but even in that case, it means ignoring the advantages of having a credit card, and particularly a rewards card like my Amex. So, yeah, don’t really see much use for it myself.

It also complicates things when it comes to risk engines for fraud protection: your actual bank will see all the transactions as coming from a single vendor, with the minimum amount of information attached to it. This will likely defeat all the fraud checks by the bank, and will likely rely on Curve’s implementation of fraud checks — which I have no idea how they work, since I have not yet managed to catch them.

Also as far as I could tell, Curve (like Revolut) does not implement 3DSecure (the “second factor” authentication used by a number of merchants to validate e-commerce transactions), making it less secure than any of the other cards I have — a cloned/stolen card can only be disabled after the fact, and replaced. Revolut at least allows me to separate the physical card from my e-commerce transactions, which is neat (and now even supports one time credit card numbers).

There is also another thing that is worth considering, that shows the different point of views (and threat models) of the two services: Curve becomes your single card (single point of failure, too) for all your activities: it makes your life easy by making sure you only ever need to use one card, even if you have multiple bank accounts in multiple countries, and you can switch between them at the tap of a finger. Revolut on the other hand allows you to give each merchant its own credit card number (Premium accounts get unlimited virtual cards) — or even have “burner” cards that change numbers after use.

All in all, I guess it depends what you want to achieve. Between the two, I think I’ll vastly stick to Revolut, and my usage of Curve will taper off once the 90 days cashback offer is completed — although it’s still nice to have for a few websites that gave me trouble with Revolut, as long as I’m charging my Revolut account again, and the charge is in Sterling.

If you do want to sign up, feel free to use BG2G3 as the referral code, it would give a £5 credit for both you and me, under circumstances that are not quite clear to me, but who knows.

April 14, 2018
Sergei Trofimovich a.k.a. slyfox (homepage, bugs)
crossdev and GNU Hurd (April 14, 2018, 00:00 UTC)

trofi's blog: crossdev and GNU Hurd

crossdev and GNU Hurd

Tl;DR: crossdev is a tool to generate a cross-compiler for you in gentoo and with some hacks (see below) you can even cross-compile to hurd!

FOSDEM 2018 conference happened recently and a lot of cool talks tool place there. The full list counts 689 events!

Hurd’s PCI arbiter was a nice one. I never actually tried hurd before and thought to give it a try in a VM.

Debian already provides full hurd installer (installation manual) and I picked it. Hurd works surprisingly well for a such an understaffed project! Installation process is very simple: it’s a typical debian CD which asks you for a few details about final system (same as for linux) and you get your OS booted.

Hurd has a ton of debian software already built and working (like 80% of the whole repo). Even GHC is ported there. While at it I grabbed all the tiny GHC patches related to hurd from Debian and pushed them upstream:

Now plain ./configure && make && make test just works.

Hurd supports only 32-bit x86 CPUs and does not support SMP (only one CPU is available). That makes building heaviweight stuff (like GHC) in a virtual machine a slow process.

To speed things up a bit I decided to build a cross-compiler from gentoo linux to hurd with the help of crossdev. What does it take to support bootstrap like that? Let’s see!

To get the idea of how to cross-compile to another OS let’s check how typical linux to linux case looks like.

Normally aiming gcc at another linux-glibc target takes the following steps:

- install cross-binutils
- install system headers (kernel headers and glibc headers):
- install minimal gcc without glibc support (not able to link final executables yet)
- install complete glibc (gcc will need crt.o files)
- install full gcc (able to link final binaries for C and C++)

In gentoo crossdev does all the above automatically by running emerge a few times for you. I wrote a more up-to-date crossdev README to describe a few details of what is happening when you run crossdev -t <target>.

hurd-glibc is not fundamentally different from linux-glibc case. Only a few packages need to change their names, namely:

- install cross-binutils
- install gnumach-headers (kernel headers part 1)
- [NEW] install cross-mig tool (Mach Interface Generator, a flavour of IDL compiler)
- install hurd-headers and glibc-headers (kernel headers part 2 and libc headers)
- install minimal gcc without libc support (not able to link final executables yet)
- install complete libc (gcc will need crt.o files)
- install full gcc (able to link final binaries for C and C++)

The only change from linux is the cross-mig tool. I’ve collected ebuilds needed in gentoo-hurd overlay.

Here is how one gets hurd cross-compiler with today’s crossdev-99999999:

git clone https://github.com/trofi/gentoo-hurd.git
HURD_OVERLAY=$(pwd)/gentoo-hurd
CROSS_OVERLAY=$(portageq get_repo_path / crossdev)
TARGET_TUPLE=i586-pc-gnu
# this will fail around glibc, it's ok we'll take on manually from there
crossdev --l 9999 -t crossdev -t ${TARGET_TUPLE}
ln -s "${HURD_OVERLAY}"/sys-kernel/gnumach ${CROSS_OVERLAY}/cross-${TARGET_TUPLE}/
ln -s "${HURD_OVERLAY}"/dev-util/mig ${CROSS_OVERLAY}/cross-${TARGET_TUPLE}/
ln -s "${HURD_OVERLAY}"/sys-kernel/hurd ${CROSS_OVERLAY}/cross-${TARGET_TUPLE}/
emerge -C cross-${TARGET_TUPLE}/linux-headers
ACCEPT_KEYWORDS='**' USE=headers-only emerge -v1 cross-${TARGET_TUPLE}/gnumach
ACCEPT_KEYWORDS='**' USE=headers-only emerge -v1 cross-${TARGET_TUPLE}/mig
ACCEPT_KEYWORDS='**' USE=headers-only emerge -v1 cross-${TARGET_TUPLE}/hurd
ACCEPT_KEYWORDS='**' USE=headers-only emerge -v1 cross-${TARGET_TUPLE}/glibc
USE="-*" emerge -v1 cross-${TARGET_TUPLE}/gcc
ACCEPT_KEYWORDS='**' USE=-headers-only emerge -v1 cross-${TARGET_TUPLE}/glibc
USE="-sanitize" emerge -v1 cross-${TARGET_TUPLE}/gcc

Done!

A few things to note here:

  • gentoo-hurd overlay is used for new gnumach, mig and hurd packages
  • symlinks to new packages are created manually (crossdev fix pending, wait on packages to get into ::gentoo)
  • uninstall linux-headers as our target is not linux (crossdev fix pending)
  • use ACCEPT_KEYWORDS=’**’ for many packages (need to decide on final keywords, maybe x86-hurd)
  • all crossdev phases are ran manually
  • only glibc git version is working as changes were merged upstream very recently.
  • USE=“sanitize” is disabled in final gcc because it’s broken for hurd

Now you can go to /usr/${TARGET_TUPLE}/etc/portage/ and tweak the defaults for ELIBC, KERNEL and other things.

Basic sanity check for a toolchain:

$ i586-pc-gnu-gcc main.c -o main
$ file main
main: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld.so, for GNU/Hurd 0.0.0, with debug_info, not stripped

Copying to the target hurd VM and runnig there also works as expected.

I use crossdev -t x86_64-HEAD-linux-gnu to have GHC built against HEAD in parallel to system’s GHC. Let’s use that for more heavyweight test to build a GHC cross-compiler to hurd:

$ EXTRA_ECONF=--with-ghc=x86_64-HEAD-linux-gnu-ghc emerge -v1 cross-i586-pc-gnu/ghc --quiet-build=n

This fails as:

rts/posix/Signals.c:398:28: error:
     note: each undeclared identifier is reported only once for each function it appears in
    |
398 |         action.sa_flags |= SA_SIGINFO;
    |                            ^

Which hints at lack of SA_SIGINFO support in upstream glibc.git. Debian as an out-of-tree tg-hurdsig-SA_SIGINFO.diff patch to provide these defines (as least it’s not our local toolchain breakage). The outcome is positive: we have got very far into cross-compiling and hit real portability issues. Woohoo!

Final words

As long as underlying toolchains are not too complicated building cross-compilers in gentoo is trivial. Next tiny step is to cross-build hurd kernel itself and run it in qemu. Ebuilds in gentoo-hurd are not yet ready for it but tweaking them should be easy.

Have fun!

Posted on April 14, 2018
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> comments powered by Disqus

April 11, 2018
Andreas K. Hüttel a.k.a. dilfridge (homepage, bugs)

Today's news is that we have submitted a manuscript for publication, describing Lab::Measurement and with it our approach towards fast, flexible, and platform-independent measuring with Perl! The manuscript mainly focuses on the new, Moose-based class hierarchy. We have uploaded it to arXiv as well; here is the (for now) full bibliographic information of the preprint:

 "Lab::Measurement - a portable and extensible framework for controlling lab equipment and conducting measurements"
S. Reinhardt, C. Butschkow, S. Geissler, A. Dirnaichner, F. Olbrich, C. Lane, D. Schröer, and A. K. Hüttel
submitted for publication; arXiv:1804.03321 (PDF, HTML, BibTeX entry)
If you're using Lab::Measurement in your lab, and this results in some nice publication, then we'd be very grateful for a citation of our work - for now the preprint, and later hopefully the accepted version.

Hanno Böck a.k.a. hanno (homepage, bugs)

SnallygasterA few days ago I figured out that several blogs operated by T-Mobile Austria had a Git repository exposed which included their wordpress configuration file. Due to the fact that a phpMyAdmin installation was also accessible this would have allowed me to change or delete their database and subsequently take over their blogs.

Git Repositories, Private Keys, Core Dumps

Last year I discovered that the German postal service exposed a database with 200.000 addresses on their webpage, because it was simply named dump.sql (which is the default filename for database exports in the documentation example of mysql). An Australian online pharmacy exposed a database under the filename xaa, which is the output of the "split" tool on Unix systems.

It also turns out that plenty of people store their private keys for TLS certificates on their servers - or their SSH keys. Crashing web applications can leave behind coredumps that may expose application memory.

For a while now I became interested in this class of surprisingly trivial vulnerabilities: People leave files accessible on their web servers that shouldn't be public. I've given talks at a couple of conferences (recordings available from Bornhack, SEC-T, Driving IT). I scanned for these issues with a python script that extended with more and more such checks.

Scan your Web Pages with snallygaster

It's taken a bit longer than intended, but I finally released it: It's called Snallygaster and is available on Github and PyPi.

Apart from many checks for secret files it also contains some checks for related issues like checking invalid src references which can lead to Domain takeover vulnerabilities, for the Optionsleed vulnerability which I discovered during this work and for a couple of other vulnerabilities I found interesting and easily testable.

Some may ask why I wrote my own tool instead of extending an existing project. I thought about it, but I didn't really find any existing free software vulnerability scanner that I found suitable. The tool that comes closest is probably Nikto, but testing it I felt it comes with a lot of checks - thus it's slow - and few results. I wanted a tool with a relatively high impact that doesn't take forever to run. Another commonly mentioned free vulnerability scanner is OpenVAS - a fork from Nessus back when that was free software - but I found that always very annoying to use and overengineered. It's not a tool you can "just run". So I ended up creating my own tool.

A Dragon Legend in US Maryland

Finally you may wonder what the name means. The Snallygaster is a dragon that according to some legends was seen in Maryland and other parts of the US. Why that name? There's no particular reason, I just searched for a suitable name, I thought a mythical creature may make a good name. So I searched Wikipedia for potential names and checked for name collisions. This one had none and also sounded funny and interesting enough.

I hope snallygaster turns out to be useful for administrators and pentesters and helps exposing this class of trivial, but often powerful, vulnerabilities. Obviously I welcome new ideas of further tests that could be added to snallygaster.

April 09, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)

TL;DR summary: be very careful if you use a .eu domain as your point of contact for anything. If you’re thinking of registering a .eu domain to use as your primary domain, just don’t.


I have forecasted a rant when I pointed out I changed domain with my move to WordPress.

I have registered flameeyes.eu nearly ten years ago, part of the reason was because flameeyes.com was (at the time) parked to a domain squatter, and part because I have been a strong supported of the European Union.

In those ten years I started using the domain not just for my website, but as my primary contact email. It’s listed as my contact address everywhere, I have all kind of financial, commercial and personal services attached to that email. It’s effectively impossible for me to ever detangle from it, even if I spend the next four weeks doing nothing but amending registrations — some services just don’t allow you to ever change email address; many requires you to contact support and spend time talking with a person to get the email updated on the account.

And now, because I moved to the United Kingdom, which decided to leave the Union, the Commission threatens to prevent me from keeping my domain. It may sound obvious, since EURid says

A website with a .eu or .ею domain name extension tells your customers that you are a legal entity based in the EU, Iceland, Liechtenstein or Norway and are therefore, subject to EU law and other relevant trading standards.

But at the same time it now provides a terrible collapse of two worlds: technical and political. The idea that you any entity in control of a .eu domain is by requirement operating under EU law sounds good on paper… until you come to this corner case where a country leaves the Union — and now either you water down this promise, eroding trust in the domain by not upholding this law domain, or you end up with domain takeover, eroding trust in the domain on technical merit.

Most of the important details for this are already explained in a seemingly unrelated blog post by Hanno Böck: Abandoned Domain Takeover as a Web Security Risk. If EURid will forbid renewal of .eu domains for entities that are no longer considered part of the EU, a whole lot of domains will effectively be “up for grabs”. Some may currently be used as CDN aliases, and be used to load resources on other websites; those would be the worst, as they would allow the controller of the domains to inject content in other sites that should otherwise be secure.

But even more important for companies that used their .eu domain as their primary point of contact: think of any PO, or invoice, or request for information, that would be sent to a company email address — and now think of a malicious actor getting access to those communications! This is not just the risk that me (and any other European supporter who happened to live in the UK, I’m sure I’m not alone) as a single individual have — it’s a possibly unlimited amount of scams that people would be subjected to, as it would be trivial to pass for a company, once their domain is taken over!

As you can see from the title, I think this particular move is also going to hit the European supporters the most. Not just because of those individuals (like me!) who wanted to signal how they feel part of something bigger than their country of birth, but also because I expect a number of UK companies used .eu domain specifically to declare themselves open to European customers — as otherwise, between pricing in Sterling, and a .co.uk domain, it would always feel like buying “foreign goods”. Now those companies, that believed in Europe, find themselves in the weakest of positions.

Speaking of individuals, when I read the news I had a double-take, and had to check the rules for .eu domains again. At first I assumed that something was clearly wrong: I’m a European Union citizen, surely I will be able to keep my domain, no matter where I live! Unfortunately, that’s not the case:

In this first step the Registrant must verify whether it meets the General
Eligibility Criteria, whereby it must be:
(i) an undertaking having its registered office, central administration or
principal place of business within the European Union, Norway, Iceland
or Liechtenstein, or
(ii) an organisation established within the European Union, Norway, Iceland
or Liechtenstein without prejudice to the application of national law, or
(iii) a natural person resident within the European Union, Norway, Iceland or
Liechtenstein.

If you are a European Union citizen, but you don’t want your digital life to ever be held hostage by the Commission or your country’s government playing games with it, do not use a .eu domain. Simple as that. EURid does not care about the well-being of their registrants.

If you’re a European company, do think twice on whether you want to risk that a change in government for the country you’re registered in would lead you to open both yourself, your suppliers and your customers into the a wild west of overtaken domains.

Effectively, what EURid has signalled with this is that they care so little about the technical hurdles of their customers, that I would suggest against ever relying on a .eu domain for anyone at all. Register it as a defense against scammers, but don’t do business on it, as it’s less stable than certain microstate domains, or even the more trendy and modern gTLDs.

I’ll call this a self-goal. I still trust the European Union, and the Commission, to have the interests of the many in their mind. But the way they tried to apply a legislative domain to the .eu TLD was brittle at best to begin with, and now there’s no way out of here that does not ruin someone’s day, and erode the trust in that very same domain.

It’s also important to note that most of the bigger companies, those that I hear a lot of European politicians complain about, would have no problem with something like this: just create a fully-own subsidiary somewhere in Europe, say for instance Slovakia, and have it hold onto the domain. And have it just forward onto a gTLD to do business on, so you don’t even give the impression of counting on that layer of legislative trust.

Given the scary damage that would be caused by losing control over my email address of ten years, I’m honestly considering looking for a similar loophole. The cost of establishing an LLC in another country, firmly within EU boundaries, is not pocket money, but it’s still chump change compared to the amount of damage (financial, reputation, relationships, etc) that it would be a good investment.

April 08, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
WordPress, really? (April 08, 2018, 12:05 UTC)

If you’re reading this blog post, particularly directly on my website, you probably noticed that it’s running on WordPress and that it’s on a new domain, no longer referencing my pride in Europe, after ten years of using it as my domain. Wow that’s a long time!

I had two reasons for the domain change: the first is that I didn’t want to keep the full chain of redirects of extremely old link onto whichever new blogging platform I would select. And the second is it that it made it significantly easier to set up a WordPress.com copy of the blog while I tweaked and set it up, rather than messing up with the domain at once. The second one will come with a separate rant very soon, but it’s related to the worrying statement from the European Commission regarding the usage of dot-EU domains in the future. But as I said, that’s a separate rant.

I have had a few people surprised when I was talking over Twitter about the issues I faced on the migration. I want to give some more context on why I went this way.

As you remember, last year I complained about Hugo – to the point that a lot of the referrers to this blog are still coming from the Hacker News thread about that – and I started looking for alternatives. And when I looked at WordPress I found that setting it up properly would take me forever, so I kept my mouth shut and doubled-down on Hugo.

Except, because of the way it is set up, it meant not having an easy way to write blog posts, or correct blog posts, from a computer that is not my normal Linux laptop with the SSH token and everything else. Which was too much of a pain to keep working with. While Hector and others suggested flows that involved GIT-based web editors, it all felt too Rube Goldberg to me… and since moving to London my time is significantly limited compared to before, so I either spend time on setting everything up, or I can work on writing more content, which can hopefully be more useful.

I ended up deciding to pay for the personal tier of WordPress.com services, since I don’t care about monetization of this content, and even the few affiliate links I’ve been using with Amazon are not really that useful at the end of the day, so I gave up on setting up OneLink and the likes here. It also turned out that Amazon’s image-and-text links (which use JavaScript and iframes) are not supported by WordPress.com even with the higher tiers, so those were deleted too.

Nobody seems to have published an easy migration guide from Hugo to WordPress, as most of the search queries produced results for the other way around. I will spend some time later on trying to refine the janky template I used and possibly release it. I also want to release the tool I wrote to “annotate” the generated WRX file with the Disqus archive… oh yes, the new blog has all the comments of the old one, and does not rely on Disqus, as I promised.

On the other hand, there are a few things that did get lost in the transition: while JetPack Plugin gives you the ability to write posts in Markdown (otherwise I wouldn’t have even considered WordPress), it doesn’t seem like the importer knows at all how to import Markdown content. So all the old posts have been pre-rendered — a shame, but honestly that doesn’t happen very often that I need to go through old posts. Particularly now that I merged in the content from all my older blogs into Hugo first, and now this one massive blog.

Hopefully expect more posts from me very soon now, and not just rants (although probably just mostly rants).

And as a closing aside, if you’re curious about the picture in the header, I have once again used one of my own. This one was taken at the maat in Lisbon. The white balance on this shot was totally off, but I liked the result. And if you’re visiting Lisbon and you’re an electronics or industrial geek you definitely have to visit the maat!

April 03, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
py3status v3.8 (April 03, 2018, 12:06 UTC)

Another long awaited release has come true thanks to our community!

The changelog is so huge that I had to open an issue and cry for help to make it happen… thanks again @lasers for stepping up once again 🙂

Highlights

  • gevent support (-g option) to switch from threads scheduling to greenlets and reduce resources consumption
  • environment variables support in i3status.conf to remove sensible information from your config
  • modules can now leverage a persistent data store
  • hundreds of improvements for various modules
  • we now have an official debian package
  • we reached 500 stars on github #vanity

Milestone 3.9

  • try to release a version faster than every 4 months (j/k) 😉

The next release will focus on bugs and modules improvements / standardization.

Thanks contributors!

This release is their work, thanks a lot guys!

  • alex o’neill
  • anubiann00b
  • cypher1
  • daniel foerster
  • daniel schaefer
  • girst
  • igor grebenkov
  • james curtis
  • lasers
  • maxim baz
  • nollain
  • raspbeguy
  • regnat
  • robert ricci
  • sébastien delafond
  • themistokle benetatos
  • tobes
  • woland

March 30, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)

I recently dockerized a small Django application. I build the Dockerfile in a way that the resulting image would allow running the container as if it was plain manage.py, e.g. that besides docker-compose up I could also do:

# For a psql session into the database:
docker-compose run <image_name> dbshell

# Or, to run the test suite:
docker-compose run <image_name> test

To make that work, I made this Docker entrypoint script:

#! /bin/bash
# Copyright (C) 2018 Sebastian Pipping <sebastian@pipping.org>
# Licensed under CC0 1.0 Public Domain Dedication.
# https://creativecommons.org/publicdomain/zero/1.0/

set -e
set -u

RUN() {
    ( PS4='# ' && set -x && "$@" )
}

RUN wait-for-it "${POSTGRES_HOST}:${POSTGRES_PORT}" -t 30

cd /app

if [[ $# -gt 0 ]]; then
    RUN ./manage.py "$@"
else
    RUN ./manage.py makemigrations
    RUN ./manage.py migrate
    RUN ./manage.py createcustomsuperuser  # self-made

    RUN ./manage.py runserver 0.0.0.0:${APP_PORT}
fi

Management command createcustomsuperuser is something simple that I built myself for this very purpose: Create a super user, support scripting, accept a passwords as bad as “password” or “demo” without complaints, and be okay if the user exists with the same credentials already (idempotency). I uploaded createcustomsuperuser.py as a Gist to GitHub as it’s a few lines more.

Back to the entrypoint script. For the RUN ./manage.py "$@" part to work, in the Dockerfile both ENTRYPOINT and CMD need to use the [..] syntax, e.g.:

ENTRYPOINT ["/app/docker-entrypoint.sh"]
CMD []

For more details on ENTRYPOINT quirks like that I recommend John Zaccone’s well-written article “ENTRYPOINT vs CMD: Back to Basics“.

March 20, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)

In the comments to my latest post on the Silicon Labs CP2110, the first comment got me more than a bit upset because it was effectively trying to mansplain to me how a serial adapter (or more properly an USB-to-UART adapter) works. Then I realized there’s one thing I can do better than complain and that is providing even more information on this for the next person who might need them. Because I wish I knew half of what I know now back when I tried to write the driver for ch314.

So first of all, what are we talking about? UART is a very wide definition for any interface that implements serial communication that can be used to transmit between a host and a device. The word “serial port” probably bring different ideas to mind depending on the background of a given person, whether it is mice and modems connected to PCs, or servers’ serial terminals, or programming interfaces for microcontrollers. For the most part, people in the “consumer world” think of serial as RS-232 but people who have experience with complex automation systems, whether it is home, industrial, or vehicle automation, have RS-485 as their main reference. None of that actually matters, since these standards mostly deal with electrical or mechanical standards.

As physical serial ports on computer stopped appearing many years ago, most of the users moved to USB adapters. These adapters are all different between each other and that’s why there’s around 40KSLOC of serial adapters drivers in the Linux kernel (according to David’s SLOCCount). And that’s without counting the remaining 1.5KSLOC for implementing CDC ACM which is the supposedly-standard approach to serial adapters.

Usually the adapters are placed either directly on the “gadget” that needs to be connected, which expose a USB connector, or on a cable used to connect to it, in which case the device usually has a TRS or similar connectors. The TRS-based serial cables appeared to become more and more popular thanks to osmocom as they are relatively inexpensive to build, both as cables and as connectors onto custom boards.

Serial interface endpoints in operating systems (/dev/tty{S,USB,ACM}* on Linux, COM* on Windows, and so on) do not only transfer data between host and device, but also provides configuration of parameters such as transmission rate and “symbol shape” — you may or may not have heard references to something like “9600n8” which is a common way to express the transmission protocol of a serial interface: 9600 symbols per second (“baud rate”), no parity, 8-bit per symbol. You can call these “out of band” parameters, as they are transmitted to the UART interface, but not to the device itself, and they are the crux of the matter of interacting with these USB-to-UART adapters.

I already wrote notes about USB sniffing, so I won’t go too much into detail there, but most of the time when you’re trying to figure out what the control software sends to a device, you start by taking a USB trace, which gives you a list of USB Request Blocks (effectively, transmission packets), and you get to figure out what’s going on there.

For those devices that use USB-to-UART adapters and actually use the OS-provided serial interface (that is, COM* under Windows, where most of the control software has to run), you could use specialised software to only intercept the communication on that interface… but I don’t know of any such modern software, while there are at least a few well-defined interface to intercept USB communication. And that would not work for software that access the USB adapter directly from userspace, which is always the case for Silicon Labs CP2110, but is also the case for some of the FTDI devices.

To be fair, for those devices that use TRS, I actually have considered just intercepting the serial protocol using the Saleae Logic Pro, but beside being overkill, it’s actually just a tiny fraction of the devices that can be intercepted that way — as the more modern ones just include the USB-to-UART chip straight onto the device, which is also the case for the meter using the CP2110 I referenced earlier.

Within the request blocks you’ll have not just the serial communication, but also all the related out-of-band information, which is usually terminated on the adapter/controller rather than being forwarded onto the device. The amount of information changes widely between adapters. Out of those I have had direct experience, I found one (TI3420) that requires a full firmware upload before it would start working, which means recording everything from the moment you plug in the device provides a lot more noise than you would expect. But most of those I dealt with had very simple interfaces, using Control transfers for out-of-band configuration, and Bulk or Interrupt1 transfers for transmitting the actual serial interface.

With these simpler interfaces, my “analysis” scripts (if you allow me the term, I don’t think they are that complicated) can produce a “chatter” file quite easily by ignoring the whole out of band configuration. Then I can analyse those chatter files to figure out the device’s actual protocol, and for the most part it’s a matter of trying between one and five combinations of transmission protocol to figure out the right one to speak to the device — in glucometerutils I have two drivers using 9600n8 and two drivers using 38400n8. In some cases, such as the TI3420 one, I actually had to figure out the configuration packet (thanks to the Linux kernel driver and the datasheet) to figure out that it was using 19200n8 instead.

But again, for those, the “decoding” is just a matter to filtering away part of the transmission to keep the useful parts. For others it’s not as easy.

0029 <<<< 00000000: 30 12                                             0.

0031 <<<< 00000000: 05 00                                             ..

0033 <<<< 00000000: 2A 03                                             *.

0035 <<<< 00000000: 42 00                                             B.

0037 <<<< 00000000: 61 00                                             a.

0039 <<<< 00000000: 79 00                                             y.

0041 <<<< 00000000: 65 00                                             e.

0043 <<<< 00000000: 72 00                                             r.

This is an excerpt from the chatter file of a session with my Contour glucometer. What happens here is that instead of buffering the transmission and sending a single request block with a whole string, the adapter (FTDI FT232RL) sends short burts, probably to reduce latency and keep a more accurate serial protocol (which is important for device that need accurate timing, for instance some in-chip programming interfaces). This would be also easy to recompose, except it also comes with

0927 <<<< 00000000: 01 60                                             .`

0929 <<<< 00000000: 01 60                                             .`

0931 <<<< 00000000: 01 60                                             .`

which I’m somehow sceptical they come from the device itself. I have not paid enough attention yet to figure out from the kernel driver whether this data is marked as coming from the device or is some kind of keepalive or synchronisation primitive of the adapter.

In the case of the CP2110, the first session I captured starts with:

0003 <<<< 00000000: 46 0A 02                                          F..

0004 >>>> 00000000: 41 01                                             A.

0006 >>>> 00000000: 50 00 00 4B 00 00 00 03  00                       P..K.....

0008 >>>> 00000000: 01 51                                             .Q

0010 >>>> 00000000: 01 22                                             ."

0012 >>>> 00000000: 01 00                                             ..

0014 >>>> 00000000: 01 00                                             ..

0016 >>>> 00000000: 01 00                                             ..

0018 >>>> 00000000: 01 00                                             ..

and I can definitely tell you that the first three URBs are not sent to the device at all. That’s because HID (the higher-level protocol that CP2110 uses on top of USB) uses the first byte of the block to identify the “report” it sends or receives. Checking these against AN434 give me a hint of what’s going on:

  • report 0x46 is “Get Version Information” — CP2110 always returns 0x0A as first byte, followed by a device version, which is unspecified; probably only used to confirm that the device is right, and possibly debugging purposes;
  • report 0x41 is “Get/Set UART Enabled” — 0x01 just means “turn on the UART”;
  • report 0x50 is “Get/Set UART Config” — and this is a bit more complex to parse: the first four bytes (0x00004b00) define the baud rate, which is 19200 symbols per second; then follows one byte for parity (0x00, no parity), one for flow control (0x00, no flow control), one for the number of data bits (0x03, 8-bit per symbol), and finally one for the stop bit (0x00, short stop bit); that’s a long way to say that this is configured as 19200n8.
  • report 0x01 is the actual data transfer, which means the transmission to the device starts with 0x51 0x22 0x00 0x00 0x00 0x00.

This means that I need a smarter analysis script that understands this protocol (which may be as simple as just ignoring anything that does not use report 0x01) to figure out what the control software is sending.

And at the same time, it needs code to know how “talk serial” to this device. Usually the out-of-bad configuration is done by a kernel driver: you ioctl() the serial device to the transmission protocol you need, the driver sends the right request block to the USB endpoint. But in the case of the CP2110 device, there’s no kernel driver implementing this, at least per Silicon Labs design: since HID devices are usually exposed to userland, and in particular to non-privileged applications, sending and receiving the reports can be done directly from the apps. So indeed there is no COM* device exposed on Windows, even with the drivers installed.

Could someone (me?) write a Linux kernel driver that expose CP2110 as a serial, rather than HID, device? Sure. It would require fiddling around with the HID subsystem a bit to have it ignore the original device, and that means it’ll probably break any application built with Silicon Labs’ own development kit, unless someone has a suggestion on how to have both interfaces available at the same time, while it would allow accessing those devices without special userland code. But I think I’ll stick with the idea of providing a Free and Open Source implementation of the protocol, for Python. And maybe add support for it to pyserial to make it easier for me to use it.


  1. All these terms make more sense if you have at least a bit of knowledge of USB works behind the scene, but I don’t want to delve too much into that.
    [return]

March 18, 2018
Andreas K. Hüttel a.k.a. dilfridge (homepage, bugs)

https://www.labmeasurement.de/images/2018-03-simon-berlin.jpg
The 2018 spring meeting of the DPG condensed matter physics section in Berlin is over, and we've all listened to interesting talks and seen exciting physics. And we've also presented the Lab::Measurement poster! Here's the photo proof of Simon explaining our software... click on the image or the link for a larger version!

Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
Yak Shaving: Silicon Labs CP2110 and Linux (March 18, 2018, 00:06 UTC)

One of my favourite passtimes in the past years has been reverse engineering glucometers for the sake of writing an utility package to export data to it. Sometimes, in the quest of just getting data out of a meter I end up embarking in yak shaves that are particularly bothersome, as they are useful only for me and no one else.

One of these yak shaves might be more useful to others, but it will have to be seen. I got my hands on a new meter, which I will review later on. This meter has software for Windows to download the readings, so it’s a good target for reverse engineering. What surprised me, though, was that once I connected the device to my Linux laptop first, it came up as an HID device, described as an “USB HID to UART adapter”: the device uses a CP2110 adapter chip by Silicon Labs, and it’s the first time I saw this particular chip (or even class of chip) in my life.

Effectively, this device piggybacks the HID interface, which allows vendor-specified protocols to be implemented in user space without needing in-kernel drivers. I’m not sure if I should be impressed by the cleverness or disgusted by the workaround. In either case, it means that you end up with a stacked protocol design: the glucometer protocol itself is serial-based, implemented on top of a serial-like software interface, which converts it to the CP2110 protocol, which is encapsulated into HID packets, which are then sent over USB…

The good thing is that, as the datasheet reports, the protocol is available: “Open access to interface specification”. And indeed in the download page for the device, there’s a big archive of just-about-everything, including a number of precompiled binary libraries and a bunch of documents, among which figures AN434, which describe the full interface of the device. Source code is also available, but having spot checked it, it appears it has no license specification and as such is to be considered proprietary, and possibly virulent.

So now I’m warming up to the idea of doing a bit more of yak shaving and for once trying not to just help myself. I need to understand this protocol for two purposes: one is obviously having the ability to communicate with the meter that uses that chip; the other is being able to understand what the software is telling the device and vice-versa.

This means I need to have generators for the host side, but parsers for both. Luckily, construct should make that part relatively painless, and make it very easy to write (if not maintain, given the amount of API breakages) such a parser/generator library. And of course this has to be in Python because that’s the language my utility is written in.

The other thing that I realized as I was toying with the idea of writing this is that, done right, it can be used together with facedancer, to implement the gadget side purely in Python. Which sounds like a fun project for those of us into that kind of thing.

But since this time this is going to be something more widely useful, and not restricted to my glucometer work, I’m now looking to release this using a different process, as that would allow me to respond to issues and codereviews from my office as well as during the (relatively little) spare time I have at home. So expect this to take quite a bit longer to be released.

At the end of the day, what I hope to have is an Apache 2 licensed Python library that can parse both host-to-controller and controller-to-host packets, and also implement it well enough on the client side (based on the hidapi library, likely) so that I can just import the module and use it for a new driver. Bonus points if I can sue this to implement a test fake framework to implement the tests for the glucometer.

In all of this, I want to make sure to thank Silicon Labs for releasing the specification of the protocol. It’s not always that you can just google up the device name to find the relevant protocol documentation, and even when you do it’s hard to figure out if it’s enough to implement a driver. The fact that this is possible surprised me pleasantly. On the other hand I wish they actually released their code with a license attached, and possibly a widely-usable one such as MIT or Apache 2, to allow users to use the code directly. But I can see why that wouldn’t be particularly high in their requirements.

Let’s just hope this time around I can do something for even more people.

March 17, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)
Holy cow! Larry the cow Gentoo tattoo (March 17, 2018, 14:53 UTC)

Probably not new but was new to me: Just ran into this Larry the Cow tattoo online:

Larry the Gender Challenged Cow

March 11, 2018
Greg KH a.k.a. gregkh (homepage, bugs)

As many people know, last week there was a court hearing in the Geniatech vs. McHardy case. This was a case brought claiming a license violation of the Linux kernel in Geniatech devices in the German court of OLG Cologne.

Harald Welte has written up a wonderful summary of the hearing, I strongly recommend that everyone go read that first.

In Harald’s summary, he refers to an affidavit that I provided to the court. Because the case was withdrawn by McHardy, my affidavit was not entered into the public record. I had always assumed that my affidavit would be made public, and since I have had a number of people ask me about what it contained, I figured it was good to just publish it for everyone to be able to see it.

There are some minor edits from what was exactly submitted to the court such as the side-by-side German translation of the English text, and some reformatting around some footnotes in the text, because I don’t know how to do that directly here, and they really were not all that relevant for anyone who reads this blog. Exhibit A is also not reproduced as it’s just a huge list of all of the kernel releases in which I felt that were no evidence of any contribution by Patrick McHardy.

AFFIDAVIT

I, the undersigned, Greg Kroah-Hartman,
declare in lieu of an oath and in the
knowledge that a wrong declaration in
lieu of an oath is punishable, to be
submitted before the Court:

I. With regard to me personally:

1. I have been an active contributor to
   the Linux Kernel since 1999.

2. Since February 1, 2012 I have been a
   Linux Foundation Fellow.  I am currently
   one of five Linux Foundation Fellows
   devoted to full time maintenance and
   advancement of Linux. In particular, I am
   the current Linux stable Kernel maintainer
   and manage the stable Kernel releases. I
   am also the maintainer for a variety of
   different subsystems that include USB,
   staging, driver core, tty, and sysfs,
   among others.

3. I have been a member of the Linux
   Technical Advisory Board since 2005.

4. I have authored two books on Linux Kernel
   development including Linux Kernel in a
   Nutshell (2006) and Linux Device Drivers
   (co-authored Third Edition in 2009.)

5. I have been a contributing editor to Linux
   Journal from 2003 - 2006.

6. I am a co-author of every Linux Kernel
   Development Report. The first report was
   based on my Ottawa Linux Symposium keynote
   in 2006, and the report has been published
   every few years since then. I have been
   one of the co-author on all of them. This
   report includes a periodic in-depth
   analysis of who is currently contributing
   to Linux. Because of this work, I have an
   in-depth knowledge of the various records
   of contributions that have been maintained
   over the course of the Linux Kernel
   project.

   For many years, Linus Torvalds compiled a
   list of contributors to the Linux kernel
   with each release. There are also usenet
   and email records of contributions made
   prior to 2005. In April of 2005, Linus
   Torvalds created a program now known as
   “Git” which is a version control system
   for tracking changes in computer files and
   coordinating work on those files among
   multiple people. Every Git directory on
   every computer contains an accurate
   repository with complete history and full
   version tracking abilities.  Every Git
   directory captures the identity of
   contributors.  Development of the Linux
   kernel has been tracked and managed using
   Git since April of 2005.

   One of the findings in the report is that
   since the 2.6.11 release in 2005, a total
   of 15,637 developers have contributed to
   the Linux Kernel.

7. I have been an advisor on the Cregit
   project and compared its results to other
   methods that have been used to identify
   contributors and contributions to the
   Linux Kernel, such as a tool known as “git
   blame” that is used by developers to
   identify contributions to a git repository
   such as the repositories used by the Linux
   Kernel project.

8. I have been shown documents related to
   court actions by Patrick McHardy to
   enforce copyright claims regarding the
   Linux Kernel. I have heard many people
   familiar with the court actions discuss
   the cases and the threats of injunction
   McHardy leverages to obtain financial
   settlements. I have not otherwise been
   involved in any of the previous court
   actions.

II. With regard to the facts:

1. The Linux Kernel project started in 1991
   with a release of code authored entirely
   by Linus Torvalds (who is also currently a
   Linux Foundation Fellow).  Since that time
   there have been a variety of ways in which
   contributions and contributors to the
   Linux Kernel have been tracked and
   identified. I am familiar with these
   records.

2. The first record of any contribution
   explicitly attributed to Patrick McHardy
   to the Linux kernel is April 23, 2002.
   McHardy’s last contribution to the Linux
   Kernel was made on November 24, 2015.

3. The Linux Kernel 2.5.12 was released by
   Linus Torvalds on April 30, 2002.

4. After review of the relevant records, I
   conclude that there is no evidence in the
   records that the Kernel community relies
   upon to identify contributions and
   contributors that Patrick McHardy made any
   code contributions to versions of the
   Linux Kernel earlier than 2.4.18 and
   2.5.12. Attached as Exhibit A is a list of
   Kernel releases which have no evidence in
   the relevant records of any contribution
   by Patrick McHardy.

March 03, 2018
Sven Vermeulen a.k.a. swift (homepage, bugs)
Automating compliance checks (March 03, 2018, 12:20 UTC)

With the configuration baseline for a technical service being described fully (see the first, second and third post in this series), it is time to consider the validation of the settings in an automated manner. The preferred method for this is to use Open Vulnerability and Assessment Language (OVAL), which is nowadays managed by the Center for Internet Security, abbreviated as CISecurity. Previously, OVAL was maintained and managed by Mitre under NIST supervision, and Google searches will often still point to the old sites. However, documentation is now maintained on CISecurity's github repositories.

But I digress...

Read-only compliance validation

One of the main ideas with OVAL is to have a language (XML-based) that represents state information (what something should be) which can be verified in a read-only fashion. Even more, from an operational perspective, it is very important that compliance checks do not alter anything, but only report.

Within its design, OVAL engineering has considered how to properly manage huge sets of assessment rules, and how to document this in an unambiguous manner. In the previous blog posts, ambiguity was resolved through writing style, and not much through actual, enforced definitions.

OVAL enforces this. You can't write a generic or ambiguous rule in OVAL. It is very specific, but that also means that it is daunting to implement the first few times. I've written many OVAL sets, and I still struggle with it (although that's because I don't do it enough in a short time-frame, and need to reread my own documentation regularly).

The capability to perform read-only validation with OVAL leads to a number of possible use cases. In the 5.10 specification a number of use cases are provided. Basically, it boils down to vulnerability discovery (is a system vulnerable or not), patch management (is the system patched accordingly or not), configuration management (are the settings according to the rules or not), inventory management (detect what is installed on the system or what the systems' assets are), malware and threat indicator (detect if a system has been compromised or particular malware is active), policy enforcement (verify if a client system adheres to particular rules before it is granted access to a network), change tracking (regularly validating the state of a system and keeping track of changes), and security information management (centralizing results of an entire organization or environment and doing standard analytics on it).

In this blog post series, I'm focusing on configuration management.

OVAL structure

Although the OVAL standard (just like the XCCDF standard actually) entails a number of major components, I'm going to focus on the OVAL definitions. Be aware though that the results of an OVAL scan are also standardized format, as are results of XCCDF scans for instance.

OVAL definitions have 4 to 5 blocks in them: - the definition itself, which describes what is being validated and how. It refers to one or more tests that are to be executed or validated for the definition result to be calculated - the test or tests, which are referred to by the definition. In each test, there is at least a reference to an object (what is being tested) and optionally to a state (what should the object look like) - the object, which is a unique representation of a resource or resources on the system (a file, a process, a mount point, a kernel parameter, etc.). Object definitions can refer to multiple resources, depending on the definition. - the state, which is a sort-of value mapping or validation that needs to be applied to an object to see if it is configured correctly - the variable, an optional definition which is what it sounds like, a variable that substitutes an abstract definition with an actual definition, allowing to write more reusable tests.

Let's get an example going, but without the XML structure, so in human language. We want to define that the Kerberos definition on a Linux system should allow forwardable tickets by default. This is accomplished by ensuring that, inside the /etc/krb5.conf file (which is an INI-style configuration file), the value of the forwardable key inside the [libdefaults] section is set to true.

In OVAL, the definition itself will document the above in human readable text, assign it a unique ID (like oval:com.example.oval:def:1) and mark it as being a definition for configuration validation (compliance). Then, it defines the criteria that need to be checked in order to properly validate if the rule is applicable or not. These criteria include validation if the OVAL statement is actually being run on a Linux system (as it makes no sense to run it against a Cisco router) which is Kerberos enabled, and then the criteria of the file check itself. Each criteria links to a test.

The test of the file itself links to an object and a state. There are a number of ways how we can check for this specific case. One is that the object is the forwardable key in the [libdefaults] section of the /etc/krb5.conf file, and the state is the value true. In this case, the state will point to those two entries (through their unique IDs) and define that the object must exist, and all matches must have a matching state. The "all matches" here is not that important, because there will generally only be one such definition in the /etc/krb5.conf file.

Note however that a different approach to the test can be declared as well. We could state that the object is the [libdefaults] section inside the /etc/krb5.conf file, and the state is the value true for the forwardable key. In this case, the test declares that multiple objects must exist, and (at least) one must match the state.

As you can see, the OVAL language tries to map definitions to unambiguous definitions. So, how does this look like in OVAL XML?

The OVAL XML structure

The full example contains a few more entries than those we declare next, in order to be complete. The most important definitions though are documented below.

Let's start with the definition. As stated, it will refer to tests that need to match for the definition to be valid.

<definitions>
  <definition id="oval:com.example.oval:def:1" version="1" class="compliance">
    <metadata>
      <title>libdefaults.forwardable in /etc/krb5.conf must be set to true</title>
      <affected family="unix">
        <platform>Red Hat Enterprise Linux 7</platform>
      </affected>
      <description>
        By default, tickets obtained from the Kerberos environment must be forwardable.
      </description>
    </metadata>
    <criteria operator="AND">
      <criterion test_ref="oval:com.example.oval:tst:1" comment="Red Hat Enterprise Linux is installed"/>
      <criterion test_ref="oval:com.example.oval:tst:2" comment="/etc/krb5.conf's libdefaults.forwardable is set to true"/>
    </criteria>
  </definition>
</definitions>

The first thing to keep in mind is the (weird) identification structure. Just like with XCCDF, it is not sufficient to have your own id convention. You need to start an id with oval: followed by the reverse domain definition (here com.example.oval), followed by the type (def for definition) and a sequence number.

Also, take a look at the criteria. Here, two tests need to be compliant (hence the AND operator). However, more complex operations can be done as well. It is even allowed to nest multiple criteria, and refer to previous definitions, like so (taken from the ssg-rhel6-oval.xml file:

<criteria comment="package hal removed or service haldaemon is not configured to start" operator="OR">
  <extend_definition comment="hal removed" definition_ref="oval:ssg:def:211"/>
  <criteria operator="AND" comment="service haldaemon is not configured to start">
    <criterion comment="haldaemon runlevel 0" test_ref="oval:ssg:tst:212"/>
    <criterion comment="haldaemon runlevel 1" test_ref="oval:ssg:tst:213"/>
    <criterion comment="haldaemon runlevel 2" test_ref="oval:ssg:tst:214"/>
    <criterion comment="haldaemon runlevel 3" test_ref="oval:ssg:tst:215"/>
    <criterion comment="haldaemon runlevel 4" test_ref="oval:ssg:tst:216"/>
    <criterion comment="haldaemon runlevel 5" test_ref="oval:ssg:tst:217"/>
    <criterion comment="haldaemon runlevel 6" test_ref="oval:ssg:tst:218"/>
  </criteria>
</criteria>

Next, let's look at the tests.

<tests>
  <unix:file_test id="oval:com.example.oval:tst:1" version="1" check_existence="all_exist" check="all" comment="/etc/redhat-release exists">
    <unix:object object_ref="oval:com.example.oval:obj:1" />
  </unix:file_test>
  <ind:textfilecontent54_test id="oval:com.example.oval:tst:2" check="all" check_existence="all_exist" version="1" comment="The value of forwardable in /etc/krb5.conf">
    <ind:object object_ref="oval:com.example.oval:obj:2" />
    <ind:state state_ref="oval:com.example.oval:ste:2" />
  </ind:textfilecontent54_test>
</tests>

There are two tests defined here. The first test just checks if /etc/redhat-release exists. If not, then the test will fail and the definition itself will result to false (as in, not compliant). This isn't actually a proper definition, because you want the test to not run when it is on a different platform, but for the sake of example and simplicity, let's keep it as is.

The second test will check for the value of the forwardable key in /etc/krb5.conf. For it, it refers to an object and a state. The test states that all objects must exist (check_existence="all_exist") and that all objects must match the state (check="all").

The object definition looks like so:

<objects>
  <unix:file_object id="oval:com.example.oval:obj:1" comment="The /etc/redhat-release file" version="1">
    <unix:filepath>/etc/redhat-release</unix:filepath>
  </unix:file_object>
  <ind:textfilecontent54_object id="oval:com.example.oval:obj:2" comment="The forwardable key" version="1">
    <ind:filepath>/etc/krb5.conf</ind:filepath>
    <ind:pattern operation="pattern match">^\s*forwardable\s*=\s*((true|false))\w*</ind:pattern>
    <ind:instance datatype="int" operation="equals">1</ind:instance>
  </ind:textfilecontent54_object>
</objects>

The first object is a simple file reference. The second is a text file content object. More specifically, it matches the line inside /etc/krb5.conf which has forwardable = true or forwardable = false in it. An expression is made on it, so that we can refer to the subexpression as part of the test.

This test looks like so:

<states>
  <ind:textfilecontent54_state id="oval:com.example.oval:ste:2" version="1">
    <ind:subexpression datatype="string">true</ind:subexpression>
  </ind:textfilecontent54_state>
</states>

This test refers to a subexpression, and wants it to be true.

Testing the checks with Open-SCAP

The Open-SCAP tool is able to test OVAL statements directly. For instance, with the above definition in a file called oval.xml:

~$ oscap oval eval --results oval-results.xml oval.xml
Definition oval:com.example.oval:def:1: true
Evaluation done.

The output of the command shows that the definition was evaluated successfully. If you want more information, open up the oval-results.xml file which contains all the details about the test. This results file is also very useful while developing OVAL as it shows the entire result of objects, tests and so forth.

For instance, the /etc/redhat-release file was only checked to see if it exists, but the results file shows what other parameters can be verified with it as well:

<unix-sys:file_item id="1233781" status="exists">
  <unix-sys:filepath>/etc/redhat-release</unix-sys:filepath>
  <unix-sys:path>/etc</unix-sys:path>
  <unix-sys:filename>redhat-release</unix-sys:filename>
  <unix-sys:type>regular</unix-sys:type>
  <unix-sys:group_id datatype="int">0</unix-sys:group_id>
  <unix-sys:user_id datatype="int">0</unix-sys:user_id>
  <unix-sys:a_time datatype="int">1515186666</unix-sys:a_time>
  <unix-sys:c_time datatype="int">1514927465</unix-sys:c_time>
  <unix-sys:m_time datatype="int">1498674992</unix-sys:m_time>
  <unix-sys:size datatype="int">52</unix-sys:size>
  <unix-sys:suid datatype="boolean">false</unix-sys:suid>
  <unix-sys:sgid datatype="boolean">false</unix-sys:sgid>
  <unix-sys:sticky datatype="boolean">false</unix-sys:sticky>
  <unix-sys:uread datatype="boolean">true</unix-sys:uread>
  <unix-sys:uwrite datatype="boolean">true</unix-sys:uwrite>
  <unix-sys:uexec datatype="boolean">false</unix-sys:uexec>
  <unix-sys:gread datatype="boolean">true</unix-sys:gread>
  <unix-sys:gwrite datatype="boolean">false</unix-sys:gwrite>
  <unix-sys:gexec datatype="boolean">false</unix-sys:gexec>
  <unix-sys:oread datatype="boolean">true</unix-sys:oread>
  <unix-sys:owrite datatype="boolean">false</unix-sys:owrite>
  <unix-sys:oexec datatype="boolean">false</unix-sys:oexec>
  <unix-sys:has_extended_acl datatype="boolean">false</unix-sys:has_extended_acl>
</unix-sys:file_item>

Now, this is just on OVAL level. The final step is to link it in the XCCDF file.

Referring to OVAL in XCCDF

The XCCDF Rule entry allows for a check element, which refers to an automated check for compliance.

For instance, the above rule could be referred to like so:

<Rule id="xccdf_com.example_rule_krb5-forwardable-true">
  <title>Enable forwardable tickets on RHEL systems</title>
  ...
  <check system="http://oval.mitre.org/XMLSchema/oval-definitions-5">
    <check-content-ref href="oval.xml" name="oval:com.example.oval:def:1" />
  </check>
</Rule>

With this set in the Rule, Open-SCAP can validate it while checking the configuration baseline:

~$ oscap xccdf eval --oval-results --results xccdf-results.xml xccdf.xml
...
Title   Enable forwardable kerberos tickets in krb5.conf libdefaults
Rule    xccdf_com.example_rule_krb5-forwardable-tickets
Ident   RHEL7-01007
Result  pass

A huge advantage here is that, alongside the detailed results of the run, there is also better human readable output as it shows the title of the Rule being checked.

The detailed capabilities of OVAL

In the above example I've used two examples: a file validation (against /etc/redhat-release) and a file content one (against /etc/krb5.conf). However, OVAL has many more checks and support for it, and also has constraints that you need to be aware of.

In the OVAL Project github account, the Language repository keeps track of the current documentation. By browsing through it, you'll notice that the OVAL capabilities are structured based on the target technology that you can check. Right now, this is AIX, Android, Apple iOS, Cisco ASA, Cisco CatOS, VMWare ESX, FreeBSD, HP-UX, Cisco iOS and iOS-XE, Juniper JunOS, Linux, MacOS, NETCONF, Cisco PIX, Microsoft SharePoint, Unix (generic), Microsoft Windows, and independent.

The independent one contains tests and support for resources that are often reusable toward different platforms (as long as your OVAL and XCCDF supporting tools can run it on those platforms). A few notable supporting tests are:

  • filehash58_test which can check for a number of common hashes (such as SHA-512 and MD5). This is useful when you want to make sure that a particular (binary or otherwise) file is available on the system. In enterprises, this could be useful for license files, or specific library files.
  • textfilecontent54_test which can check the content of a file, with support for regular expressions.
  • xmlfilecontent_test which is a specialized test toward XML files

Keep in mind though that, as we have seen above, INI files specifically have no specialization available. It would be nice if CISecurity would develop support for common textual data formats, such as CSV (although that one is easily interpretable with the existing ones), JSON, YAML and INI.

The unix one contains tests specific to Unix and Unix-like operating systems (so yes, it is also useful for Linux), and together with the linux one a wide range of configurations can be checked. This includes support for generic extended attributes (fileextendedattribute_test) as well as SELinux specific rules (selinuxboolean_test and selinuxsecuritycontext_test), network interface settings (interface_test), runtime processes (process58_test), kernel parameters (sysctl_test), installed software tests (such as rpminfo_test for RHEL and other RPM enabled operating systems) and more.

March 01, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)

When system-wide pip turns out too old (e.g. for lacking support for pip check), one may end up trying to update pip using a command like:

sudo pip install --upgrade pip

That’s likely to end up with this message:

Not uninstalling pip at /usr/lib/python2.7/dist-packages, owned by OS

That non-error and the confusion that easily happens right after is why I’m writing this post.

So let’s look at the whole thing in a bit more context on a shell, a Debian jessie one in this case:

# cat /etc/debian_version 
8.10

# pip install --upgrade pip ; echo $?
Downloading/unpacking pip from https://pypi.python.org/packages/b6[..]44
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.6
    Not uninstalling pip at /usr/lib/python2.7/dist-packages, owned by OS
Successfully installed pip
Cleaning up...
0

# pip --version
pip 1.5.6 from /usr/lib/python2.7/dist-packages (python 2.7)

Now the interesting part is that it looks like pip would not have been updated. That impression is false: Latest pip has been installed successfully (to /usr/local/bin). One of two things is going on here:

a) Unexpected Path resolution order

You have /usr/bin/ before /usr/local/bin/ in $PATH, e.g. as with root of Debian jessie, so that the new pip has no chance of winning the race of path resolution for pip. For example:

# sed 's,:,\n,g' <<<"$PATH"
/bin
/sbin
/usr/bin
/usr/sbin
/usr/local/bin
/usr/local/sbin
/opt/bin
/usr/lib/llvm/5/bin
/usr/lib/llvm/4/bin

b) Location hashing at shell level

Your shell has hashed the old location of pip (as Bash would do) and “hides” the new version from you in the current shell session. To see that in action, we utilize Bash builtins type and hash:

# type pip
pip is hashed (/usr/bin/pip)

# pip --version
pip 1.5.6 from /usr/lib/python2.7/dist-packages (python 2.7)

# hash -d pip

# type pip
pip is /usr/local/bin/pip

# pip --version
pip 9.0.1 from /usr/local/lib/python2.7/dist-packages (python 2.7)

So in either case you can run a recent pip from /usr/local/bin/pip right after pip install --upgrade pip, no need to resort to get-pip.py or so, in fact.

This “crazy” vulnerability in LibreOffice only came to my attention recently:

LibreOffice < 6.0.1 – ‘=WEBSERVICE’ Remote Arbitrary File Disclosure (exploit-db.com)

Please make sure yours peers update in time.

February 28, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Evaluating ScyllaDB for production 2/2 (February 28, 2018, 10:32 UTC)

In my previous blog post, I shared 7 lessons on our experience in evaluating Scylla for production.

Those lessons were focused on the setup and execution of the POC and I promised a more technical blog post with technical details and lessons learned from the POC, here it is!

Before you read on, be mindful that our POC was set up to test workloads and workflows, not to benchmark technologies. So even if the Scylla figures are great, they have not been the main drivers of the actual conclusion of the POC.

Business context

As a data driven company working in the Marketing and Advertising industry, we help our clients make sense of multiple sources of data to build and improve their relationship with their customers and prospects.

Dealing with multiple sources of data is nothing new but their volume has dramatically changed during the past decade. I will spare you with the Big-Data-means-nothing term and the technical challenges that comes with it as you already heard enough of it.

Still, it is clear that our line of business is tied to our capacity at mixing and correlating a massive amount of different types of events (data sources/types) coming from various sources which all have their own identifiers (think primary keys):

  • Web navigation tracking: identifier is a cookie that’s tied to the tracking domain (we have our own)
  • CRM databases: usually the email address or an internal account ID serve as an identifier
  • Partners’ digital platform: identifier is also a cookie tied to their tracking domain

To try to make things simple, let’s take a concrete example:

You work for UNICEF and want to optimize their banner ads budget by targeting the donors of their last fundraising campaign.

  • Your reference user database is composed of the donors who registered with their email address on the last campaign: main identifier is the email address.
  • To buy web display ads, you use an Ad Exchange partner such as AppNexus or DoubleClick (Google). From their point of view, users are seen as cookie IDs which are tied to their own domain.

So you basically need to be able to translate an email address to a cookie ID for every partner you work with.

Use case: ID matching tables

We operate and maintain huge ID matching tables for every partner and a great deal of our time is spent translating those IDs from one to another. In SQL terms, we are basically doing JOINs between a dataset and those ID matching tables.

  • You select your reference population
  • You JOIN it with the corresponding ID matching table
  • You get a matched population that your partner can recognize and interact with

Those ID matching tables have a pretty high read AND write throughput because they’re updated and queried all the time.

Usual figures are JOINs between a 10+ Million dataset and 1.5+ Billion ID matching tables.

The reference query basically looks like this:

SELECT count(m.partnerid)
FROM population_10M_rows AS p JOIN partner_id_match_400M_rows AS m
ON p.id = m.id

 Current implementations

We operate a lambda architecture where we handle real time ID matching using MongoDB and batch ones using Hive (Apache Hadoop).

The first downside to note is that it requires us to maintain two copies of every ID matching table. We also couldn’t choose one over the other because neither MongoDB nor Hive can sustain both the read/write lookup/update ratio while performing within the low latencies that we need.

This is an operational burden and requires quite a bunch of engineering to ensure data consistency between different data stores.

Production hardware overview:

  • MongoDB is running on a 15 nodes (5 shards) cluster
    • 64GB RAM, 2 sockets, RAID10 SAS spinning disks, 10Gbps dual NIC
  • Hive is running on 50+ YARN NodeManager instances
    • 128GB RAM, 2 sockets, JBOD SAS spinning disks, 10Gbps dual NIC

Target implementation

The key question is simple: is there a technology out there that can sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?

Having one technology to handle both use cases would allow:

  • Simpler data consistency
  • Operational simplicity and efficiency
  • Reduced costs

POC hardware overview:

So we decided to find out if Scylla could be that technology. For this, we used three decommissioned machines that we had in the basement of our Paris office.

  • 2 DELL R510
    • 19GB RAM, 2 socket 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
  • 1 DELL R710
    • 19GB RAM, 2 socket 4 cores, RAID0 SAS spinning disks, 1Gbps NIC

I know, these are not glamorous machines and they are even inconsistent in specs, but we still set up a 3 node Scylla cluster running Gentoo Linux with them.

Our take? If those three lousy machines can challenge or beat the production machines on our current workloads, then Scylla can seriously be considered for production.

Step 1: Validate a schema model

Once the POC document was complete and the ScyllaDB team understood what we were trying to do, we started iterating on the schema model using a query based modeling strategy.

So we wrote down and rated the questions that our model(s) should answer to, they included stuff like:

  • What are all our cookie IDs associated to the given partner ID ?
  • What are all the cookie IDs associated to the given partner ID over the last N months ?
  • What is the last cookie ID/date for the given partner ID ?
  • What is the last date we have seen the given cookie ID / partner ID couple ?

As you can imagine, the reverse questions are also to be answered so ID translations can be done both ways (ouch!).

Prototyping

This is no news that I’m a Python addict so I did all my prototyping using Python and the cassandra-driver.

I ended up using a test-driven data modelling strategy using pytest. I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.

Schema

In our case, we ended up with three denormalized tables to answer all the questions we had. To answer the first three questions above, you could use the schema below:

CREATE TABLE IF NOT EXISTS ids_by_partnerid(
 partnerid text,
 id text,
 date timestamp,
 PRIMARY KEY ((partnerid), date, id)
 )
 WITH CLUSTERING ORDER BY (date DESC)

Note on clustering key ordering

One important learning I got in the process of validating the model is about the internals of Cassandra’s file format that resulted in the choice of using a descending order DESC on the date clustering key as you can see above.

If your main use case of querying is to look for the latest value of an history-like table design like ours, then make sure to change the default ASC order of your clustering key to DESC. This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!

Let me quote Glauber Costa’s detailed explanation on this:

Basically in Cassandra’s file format, the index points to an entire partition (for very large partitions there is a hack to avoid that, but the logic is mostly the same). So if you want to read the first row, that’s easy you get the index to the partition and read the first row. If you want to read the last row, then you get the index to the partition and do a linear scan to the next.

This is the kind of learning you can only get from experts like Glauber and that can justify the whole POC on its own!

Step 2: Set up scylla-grafana-monitoring

As I said before, make sure to set up and run the scylla-grafana-monitoring project before running your test workloads. This easy to run solution will be of great help to understand the performance of your cluster and to tune your workload for optimal performances.

If you can, also discuss with the ScyllaDB team to allow them to access the Grafana dashboard. This will be very valuable since they know where to look better than we usually do… I gained a lot of understandings thanks to this!

Note on scrape interval

I advise you to lower the Prometheus scrape interval to have a shorter and finer sampling of your metrics. This will allow your dashboard to be more reactive when you start your test workloads.

For this, change the prometheus/prometheus.yml file like this:

scrape_interval: 2s # Scrape targets every 2 seconds (5s default)
scrape_timeout: 1s # Timeout before trying to scrape a target again (4s default)

Test your monitoring

Before going any further, I strongly advise you to run a stress test on your POC cluster using the cassandra-stress tool and share the results and their monitoring graphs with the ScyllaDB team.

This will give you a common understanding of the general performances of your cluster as well as help in outlining any obvious misconfiguration or hardware problem.

Key graphs to look at

There are a lot of interesting graphs so I’d like to share the ones that I have been mainly looking at. Remember that depending on your test workloads, some other graphs may be more relevant for you.

  • number of open connections

You’ll want to see a steady and high enough number of open connections which will prove that your clients are pushed at their maximum (at the time of testing this graph was not on Grafana and you had to add it yourself)

  • cache hits / misses

Depending on your reference dataset, you’ll obviously see that cache hits and misses will have a direct correlation with disk I/O and overall performances. Running your test workloads multiple times should trigger higher cache hits if your RAM is big enough.

  • per shard/node distribution

The Requests Served per shard graph should display a nicely distributed load between your shards and nodes so that you’re sure that you’re getting the best out of your cluster.

The same is true for almost every other “per shard/node” graph: you’re looking for evenly distributed load.

  • sstable reads

Directly linked with your disk performances, you’ll be trying to make sure that you have almost no queued sstable reads.

Step 3: Get your reference data and metrics

We obviously need to have some reference metrics on our current production stack so we can compare them with the results on our POC Scylla cluster.

Whether you choose to use your current production machines or set up a similar stack on the side to run your test workloads is up to you. We chose to run the vast majority of our tests on our current production machines to be as close to our real workloads as possible.

Prepare a reference dataset

During your work on the POC document, you should have detailed the usual data cardinality and volume you work with. Use this information to set up a reference dataset that you can use on all of the platforms that you plan to compare.

In our case, we chose a 10 Million reference dataset that we JOINed with a 400+ Million extract of an ID matching table. Those volumes seemed easy enough to work with and allowed some nice ratio for memory bound workloads.

Measure on your current stack

Then it’s time to load this reference datasets on your current platforms.

  • If you run a MongoDB cluster like we do, make sure to shard and index the dataset just like you do on the production collections.
  • On Hive, make sure to respect the storage file format of your current implementations as well as their partitioning.

If you chose to run your test workloads on your production machines, make sure to run them multiple times and at different hours of the day and night so you can correlate the measures with the load on the cluster at the time of the tests.

Reference metrics

For the sake of simplicity I’ll focus on the Hive-only batch workloads. I performed a count on the JOIN of the dataset and the ID matching table using Spark 2 and then I also ran the JOIN using a simple Hive query through Beeline.

I gave the following definitions on the reference load:

  • IDLE: YARN available containers and free resources are optimal, parallelism is very limited
  • NORMAL: YARN sustains some casual load, parallelism exists but we are not bound by anything still
  • HIGH: YARN has pending containers, parallelism is high and applications have to wait for containers before executing

There’s always an error margin on the results you get and I found that there was not significant enough differences between the results using Spark 2 and Beeline so I stuck with a simple set of results:

  • IDLE: 2 minutes, 15 seconds
  • NORMAL: 4 minutes
  • HIGH: 15 minutes

Step 4: Get Scylla in the mix

It’s finally time to do your best to break Scylla or at least to push it to its limits on your hardware… But most importantly, you’ll be looking to understand what those limits are depending on your test workloads as well as outlining out all the required tuning that you will be required to do on the client side to reach those limits.

Speaking about the results, we will have to differentiate two cases:

  1. The Scylla cluster is fresh and its cache is empty (cold start): performance is mostly Disk I/O bound
  2. The Scylla cluster has been running some test workload already and its cache is hot: performance is mostly Memory bound with some Disk I/O depending on the size of your RAM

Spark 2 / Scala test workload

Here I used Scala (yes, I did) and DataStax’s spark-cassandra-connector so I could use the magic joinWithCassandraTable function.

  • spark-cassandra-connector-2.0.1-s_2.11.jar
  • Java 7

I had to stick with the 2.0.1 version of the spark-cassandra-connector because newer version (2.0.5 at the time of testing) were performing bad with no apparent reason. The ScyllaDB team couldn’t help on this.

You can interact with your test workload using the spark2-shell:

spark2-shell --jars jars/commons-beanutils_commons-beanutils-1.9.3.jar,jars/com.twitter_jsr166e-1.1.0.jar,jars/io.netty_netty-all-4.0.33.Final.jar,jars/org.joda_joda-convert-1.2.jar,jars/commons-collections_commons-collections-3.2.2.jar,jars/joda-time_joda-time-2.3.jar,jars/org.scala-lang_scala-reflect-2.11.8.jar,jars/spark-cassandra-connector-2.0.1-s_2.11.jar

Then use the following Scala imports:

// main connector import
import com.datastax.spark.connector._

// the joinWithCassandraTable failed without this (dunno why, I'm no Scala guy)
import com.datastax.spark.connector.writer._
implicit val rowWriter = SqlRowWriter.Factory

Finally I could run my test workload to select the data from Hive and JOIN it with Scylla easily:

val df_population = spark.sql("SELECT id FROM population_10M_rows")
val join_rdd = df_population.rdd.repartitionByCassandraReplica("test_keyspace", "partner_id_match_400M_rows").joinWithCassandraTable("test_keyspace", "partner_id_match_400M_rows")
val joined_count = join_rdd.count()

Notes on tuning spark-cassandra-connector

I experienced pretty crappy performances at first. Thanks to the easy Grafana monitoring, I could see that Scylla was not being the bottleneck at all and that I instead had trouble getting some real load on it.

So I engaged in a thorough tuning of the spark-cassandra-connector with the help of Glauber… and it was pretty painful but we finally made it and got the best parameters to get the load on the Scylla cluster close to 100% when running the test workloads.

This tuning was done in the spark-defaults.conf file:

  • have a fixed set of executors and boost their overhead memory

This will increase test results reliability by making sure you always have a reserved number of available workers at your disposal.

spark.dynamicAllocation.enabled=false
spark.executor.instances=30
spark.yarn.executor.memoryOverhead=1024
  • set the split size to 1MB

Default is 8MB but Scylla uses a split size of 1MB so you’ll see a great boost of performance and stability by setting this setting to the right number.

spark.cassandra.input.split.size_in_mb=1
  • align driver timeouts with server timeouts

It is advised to make sure that your read request timeouts are the same on the driver and the server so you do not get stalled states waiting for a timeout to happen on one hand. You can do the same with write timeouts if your test workloads are write intensive.

/etc/scylla/scylla.yaml

read_request_timeout_in_ms: 150000

spark-defaults.conf

spark.cassandra.connection.timeout_ms=150000
spark.cassandra.read.timeout_ms=150000

// optional if you want to fail / retry faster for HA scenarios
spark.cassandra.connection.reconnection_delay_ms.max=5000
spark.cassandra.connection.reconnection_delay_ms.min=1000
spark.cassandra.query.retry.count=100
  • adjust your reads per second rate

Last but surely not least, this setting you will need to try and find out the best value for yourself since it has a direct impact on the load on your Scylla cluster. You will be looking at pushing your POC cluster to almost 100% load.

spark.cassandra.input.reads_per_sec=6666

As I said before, I could only get this to work perfectly using the 2.0.1 version of the spark-cassandra-connector driver. But then it worked very well and with great speed.

Spark 2 results

Once tuned, the best results I was able to reach on this hardware are listed below. It’s interesting to see that with spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances.

  • hot cache: 2min
  • cold cache: 12min

Wow! Those three refurbished machines can compete with our current production machines and implementations, they can even match an idle Hive cluster of a medium size!

Python test workload

I couldn’t conclude on a Scala/Spark 2 only test workload. So I obviously went back to my language of choice Python only to discover at my disappointment that there is no joinWithCassandraTable equivalent available on pyspark

I tried with some projects claiming otherwise with no success until I changed my mind and decided that I probably didn’t need Spark 2 at all. So I went into the crazy quest of beating Spark 2 performances using a pure Python implementation.

This basically means that instead of having a JOIN like helper, I had to do a massive amount of single “id -> partnerid” lookups. Simple but greatly inefficient you say? Really?

When I broke down the pieces, I was left with the following steps to implement and optimize:

  • Load the 10M rows worth of population data from Hive
  • For every row, lookup the corresponding partnerid in the ID matching table from Scylla
  • Count the resulting number of matches

The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not. So you can’t possibly imagine outperforming Spark 2 with your single machine.

However, let’s remember that Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time. This is a quite expensive process that we have a chance to avoid using Python!

So what I needed was a distributed computation framework that would allow to load data in a partitioned way and run the lookups on all the partitions in parallel before merging the results. In Python, this framework exists and is named Dask!

You will obviously need to have to deploy a dask topology (that’s easy and well documented) to have a comparable number of dask workers than of Spark 2 executors (30 in my case) .

The corresponding Python code samples are here.

Hive + Scylla results

Reading the population id’s from Hive, the workload can be split and executed concurrently on multiple dask workers.

  • read the 10M population rows from Hive in a partitioned manner
  • for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid
  • create a dataframe from the resulting matches
  • gather back all the dataframes and merge them
  • count the number of matches

The results showed that it is possible to compete with Spark 2 with Dask:

  • hot cache: 2min (rounded up)
  • cold cache: 6min

Interestingly, those almost two minutes can be broken down like this:

  • distributed read data from Hive: 50s
  • distributed lookup from Scylla: 60s
  • merge + count: 10s

This meant that if I could cut down the reading of data from Hive I could go even faster!

Parquet + Scylla results

Going further on my previous remark I decided to get rid of Hive and put the 10M rows population data in a parquet file instead. I ended up trying to find out the most efficient way to read and load a parquet file from HDFS.

My conclusion so far is that you can’t be the amazing libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!

The results showed that I could almost get rid of a whole minute in the total process, effectively and easily beating Spark 2!

  • hot cache: 1min 5s
  • cold cache: 5min

Notes on the Python cassandra-driver

Tests using Python showed robust queries experiencing far less failures than the spark-cassandra-connector, even more during the cold start scenario.

  • The usage of execute_concurrent() provides a clean and linear interface to submit a large number of queries while providing some level of concurrency control
  • Increasing the concurrency parameter from 100 to 512 provided additional throughput, but increasing it more looked useless
  • Protocol version 4 forbids the tuning of connection requests / number to some sort of auto configuration. All tentative to hand tune it (by lowering protocol version to 2) failed to achieve higher throughput
  • Installation of libev on the system allows the cassandra-driver to use it to handle concurrency instead of asyncore with a somewhat lower load footprint on the worker node but no noticeable change on the throughput
  • When reading a parquet file stored on HDFS, the hdfs3 + pyarrow combo provides an insane speed (less than 10s to fully load 10M rows of a single column)

Step 5: Play with High Availability

I was quite disappointed and surprised by the lack of maturity of the Cassandra community on this critical topic. Maybe the main reason is that the cassandra-driver allows for too many levels of configuration and strategies.

I wrote this simple bash script to allow me to simulate node failures. Then I could play with handling those failures and retries on the Python client code.

#!/bin/bash

iptables -t filter -X
iptables -t filter -F

ip="0.0.0.0/0"
for port in 9042 9160 9180 10000 7000; do
	iptables -t filter -A INPUT -p tcp --dport ${port} -s ${ip} -j DROP
	iptables -t filter -A OUTPUT -p tcp --sport ${port} -d ${ip} -j DROP
done

while true; do
	trap break INT
	clear
	iptables -t filter -vnL
	sleep 1
done

iptables -t filter -X
iptables -t filter -F
iptables -t filter -vnL

This topic is worth going in more details on a dedicated blog post that I shall write later on while providing code samples.

Concluding the evaluation

I’m happy to say that Scylla passed our production evaluation and will soon go live on our infrastructure!

As I said at the beginning of this post, the conclusion of the evaluation has not been driven by the good figures we got out of our test workloads. Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla.

Instead we decided on the following points of interest (in no particular order):

  • data consistency
  • production reliability
  • datacenter awareness
  • ease of operation
  • infrastructure rationalisation
  • developer friendliness
  • costs

On the side, I tried Scylla on two other different use cases which proved interesting to follow later on to displace MongoDB again…

Moving to production

Since our relationship was great we also decided to partner with ScyllaDB and support them by subscribing to their enterprise offerings. They also accepted to support us using Gentoo Linux!

We are starting with a three nodes heavy duty cluster:

  • DELL R640
    • dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB

I’m eager to see ScyllaDB building up and will continue to help with my modest contributions. Thanks again to the ScyllaDB team for their patience and support during the POC!

February 25, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
Fantasyland: in the world of IPv6 only networks (February 25, 2018, 21:04 UTC)

It seems to be the time of the year when geeks think that IPv6 is perfect, ready to be used, and the best thing after sliced bread (or canned energy drinks). Over on Twitter, someone pointed out to me that FontAwesome (which is used by the Hugo theme I’m using) is not accessible over an IPv6-only network, and as such the design of the site is broken. I’ll leave aside my comments on FontAwesome because they are not relevant to the rant at hand.

You may remember I called IPv6-only networks unrealistic two years ago, and I called IPv6 itself a geeks’ wet dream last year. You should then not be surprised to find me calling this Fantasyland an year later.

First of all, I want to make perfectly clear that I’m not advocating that IPv6 deployment should stop or slow down. I really wish it would be actually faster, for purely selfish reasons I’ll get to later. Unfortunately I had to take a setback when I moved to London, as Hyperoptic does not have IPv6 deployment, at least in my building, yet. But they provide a great service, for a reasonable price, so I have no intention to switch to something like A&A just to get a good IPv6 right now.

$ host hyperoptic.com
hyperoptic.com has address 52.210.255.19
hyperoptic.com has address 52.213.148.25
hyperoptic.com mail is handled by 0 hyperoptic-com.mail.eo.outlook.com.

$ host www.hyperoptic.com
www.hyperoptic.com has address 52.210.255.19
www.hyperoptic.com has address 52.213.148.25

$ host www.virginmedia.com
www.virginmedia.com has address 213.105.9.24

$ host www.bt.co.uk
www.bt.co.uk is an alias for www.bt.com.
www.bt.com has address 193.113.9.162
Host www.bt.com not found: 2(SERVFAIL)

$ host www.sky.com
www.sky.com is an alias for www.sky.com.edgekey.net.
www.sky.com.edgekey.net is an alias for e1264.g.akamaiedge.net.
e1264.g.akamaiedge.net has address 23.214.120.203

$ host www.aaisp.net.uk
www.aaisp.net.uk is an alias for www.aa.net.uk.
www.aa.net.uk has address 81.187.30.68
www.aa.net.uk has address 81.187.30.65
www.aa.net.uk has IPv6 address 2001:8b0:0:30::65
www.aa.net.uk has IPv6 address 2001:8b0:0:30::68

I’ll get back to this later.

IPv6 is great for complex backend systems: each host gets their own uniquely-addressable IP, so you don’t have to bother with jumphosts, proxycommands, and so on so forth. Depending on the complexity of your backend, you can containerize single applications and then have a single address per application. It’s a gorgeous thing. But as you move towards user facing frontends, things get less interesting. You cannot get rid of IPv4 on the serving side of any service, because most of your visitors are likely reaching you over IPv4, and that’s unlikely to change for quite a while longer still.

Of course the IPv4 address exhaustion is a real problem and it’s hitting ISPs all over the world right now. Mobile providers already started deploying networks that only provide users with IPv6 addresses, and then use NAT64 to allow them to connect to the rest of the world. This is not particularly different from using an old-school IPv4 carrier-grade NAT (CGN), which a requirement of DS-Lite, but I’m told it can get better performance and cost less to maintain. It also has the advantage of reducing the number of different network stacks that need to be involved.

And in general, having to deal with CGN and NAT64 add extra work, latency, and in general bad performance to a network, which is why gamers, as an example, tend to prefer having a single-stack network, one way or the other.

$ host store.steampowered.com
store.steampowered.com has address 23.214.51.115

$ host www.gog.com
www.gog.com is an alias for gog.com.edgekey.net.
gog.com.edgekey.net is an alias for e11072.g.akamaiedge.net.
e11072.g.akamaiedge.net has address 2.19.61.131

$ host my.playstation.com
my.playstation.com is an alias for my.playstation.com.edgekey.net.
my.playstation.com.edgekey.net is an alias for e14413.g.akamaiedge.net.
e14413.g.akamaiedge.net has address 23.214.116.40

$ host www.xbox.com
www.xbox.com is an alias for www.xbox.com.akadns.net.
www.xbox.com.akadns.net is an alias for wildcard.xbox.com.edgekey.net.
wildcard.xbox.com.edgekey.net is an alias for e1822.dspb.akamaiedge.net.
e1822.dspb.akamaiedge.net has address 184.28.57.89
e1822.dspb.akamaiedge.net has IPv6 address 2a02:26f0:a1:29e::71e
e1822.dspb.akamaiedge.net has IPv6 address 2a02:26f0:a1:280::71e

$ host www.origin.com
www.origin.com is an alias for ea7.com.edgekey.net.
ea7.com.edgekey.net is an alias for e4894.e12.akamaiedge.net.
e4894.e12.akamaiedge.net has address 2.16.57.118

But multiple other options started spawning around trying to tackle the address exhaustion problem, faster than the deployment of IPv6 is happening. As I already noted above, backend systems, where the end-to-end is under control of a single entity, are perfect soil for IPv6: there’s no need to allocate real IP addresses to these, even when they have to talk over the proper Internet (with proper encryption and access control, goes without saying). So we won’t see more allocations like Xerox’s or Ford’s of whole /8 for backend systems.

$ host www.xerox.com
www.xerox.com is an alias for www.xerox.com.edgekey.net.
www.xerox.com.edgekey.net is an alias for e1142.b.akamaiedge.net.
e1142.b.akamaiedge.net has address 23.214.97.123

$ host www.ford.com
www.ford.com is an alias for www.ford.com.edgekey.net.
www.ford.com.edgekey.net is an alias for e4213.x.akamaiedge.net.
e4213.x.akamaiedge.net has address 104.123.94.235

$ host www.xkcd.com
www.xkcd.com is an alias for xkcd.com.
xkcd.com has address 151.101.0.67
xkcd.com has address 151.101.64.67
xkcd.com has address 151.101.128.67
xkcd.com has address 151.101.192.67
xkcd.com has IPv6 address 2a04:4e42::67
xkcd.com has IPv6 address 2a04:4e42:200::67
xkcd.com has IPv6 address 2a04:4e42:400::67
xkcd.com has IPv6 address 2a04:4e42:600::67
xkcd.com mail is handled by 10 ASPMX.L.GOOGLE.com.
xkcd.com mail is handled by 20 ALT2.ASPMX.L.GOOGLE.com.
xkcd.com mail is handled by 30 ASPMX3.GOOGLEMAIL.com.
xkcd.com mail is handled by 30 ASPMX5.GOOGLEMAIL.com.
xkcd.com mail is handled by 30 ASPMX4.GOOGLEMAIL.com.
xkcd.com mail is handled by 30 ASPMX2.GOOGLEMAIL.com.
xkcd.com mail is handled by 20 ALT1.ASPMX.L.GOOGLE.com.

Another technique that slowed down the exhaustion is SNI. This TLS feature allows to share the same socket for applications having multiple certificates. Similarly to HTTP virtual hosts, that are now what just about everyone uses, SNI allows the same HTTP server instance to deliver secure connections for multiple websites that do not share their certificate. This may sound totally unrelated to IPv6, but before SNI became widely usable (it’s still not supported by very old Android devices, and Windows XP, but both of those are vastly considered irrelevant in 2018), if you needed to provide different certificates, you needed different sockets, and thus different IP addresses. It would not be uncommon for a company to lease a /28 and point it all at the same frontend system just to deliver per-host certificates — one of my old customers did exactly that, until XP became too old to support, after which they declared it so, and migrated all their webapps behind a single IP address with SNI.

Does this mean we should stop caring about the exhaustion? Of course not! But if you are a small(ish) company and you need to focus your efforts to modernize infrastructure, I would not expect you to focus on IPv6 deployment on the frontends. I would rather hope that you’d prioritize TLS (HTTPS) implementation instead, since I would rather not have malware (including but not limited to “coin” miners), to be executed on my computer while I read the news! And that is not simple either.

$ host www.bbc.co.uk
www.bbc.co.uk is an alias for www.bbc.net.uk.
www.bbc.net.uk has address 212.58.246.94
www.bbc.net.uk has address 212.58.244.70

$ host www.theguardian.com  
www.theguardian.com is an alias for guardian.map.fastly.net.
guardian.map.fastly.net has address 151.101.1.111
guardian.map.fastly.net has address 151.101.65.111
guardian.map.fastly.net has address 151.101.129.111
guardian.map.fastly.net has address 151.101.193.111

$ host www.independent.ie
www.independent.ie has address 54.230.14.45
www.independent.ie has address 54.230.14.191
www.independent.ie has address 54.230.14.196
www.independent.ie has address 54.230.14.112
www.independent.ie has address 54.230.14.173
www.independent.ie has address 54.230.14.224
www.independent.ie has address 54.230.14.242
www.independent.ie has address 54.230.14.38

Okay I know these snippets are getting old and probably beating a dead horse. But what I’m trying to bring home here is that there is very little to gain in supporting IPv6 on frontends today, unless you are an enthusiast or a technology company yourself. I work for a company that believes in it and provides tools, data, and its own services over IPv6. But it’s one company. And as a full disclosure, I have no involvement in this particular field whatsoever.

In all of the examples above, which are of course not complete and not statistically meaningful, you can see that there are a few interesting exceptions. In the gaming world, XBox appears to have IPv6 frontends enabled, which is not surprising when you remember that Microsoft even developed one of the first tunnelling protocols to kickstart adoption of IPv6. And of course XKCD, being ran by a technologist and technology enthusiast couldn’t possibly ignore IPv6, but that’s not what the average user needs from their Internet connection.

Of course, your average user spends a lot of time on platforms created and maintained by technology companies, and Facebook is another big player of the IPv6 landscape, so they have been available over it for a long while — though that’s not the case of Twitter. But at the same time, they need their connection to access their bank…

$ host www.chase.com
www.chase.com is an alias for wwwbcchase.gslb.bankone.com.
wwwbcchase.gslb.bankone.com has address 159.53.42.11

$ host www.ulsterbankanytimebanking.ie
www.ulsterbankanytimebanking.ie has address 155.136.22.57

$ host www.barclays.co.uk
www.barclays.co.uk has address 157.83.96.72

$ host www.tescobank.com
www.tescobank.com has address 107.162.133.159

$ host www.metrobank.co.uk
www.metrobank.co.uk has address 94.136.40.82

$ host www.finecobank.com
www.finecobank.com has address 193.193.183.189

$ host www.unicredit.it
www.unicredit.it is an alias for www.unicredit.it-new.gtm.unicreditgroup.eu.
www.unicredit.it-new.gtm.unicreditgroup.eu has address 213.134.65.14

$ host www.aib.ie
www.aib.ie has address 194.69.198.194

to pay their bills…

$ host www.mybills.ie
www.mybills.ie has address 194.125.152.178

$ host www.airtricity.ie
www.airtricity.ie has address 89.185.129.219

$ host www.bordgaisenergy.ie
www.bordgaisenergy.ie has address 212.78.236.235

$ host www.thameswater.co.uk
www.thameswater.co.uk is an alias for aerotwprd.trafficmanager.net.
aerotwprd.trafficmanager.net is an alias for twsecondary.westeurope.cloudapp.azure.com.
twsecondary.westeurope.cloudapp.azure.com has address 52.174.108.182

$ host www.edfenergy.com
www.edfenergy.com has address 162.13.111.217

$ host www.veritasenergia.it
www.veritasenergia.it is an alias for veritasenergia.it.
veritasenergia.it has address 80.86.159.101
veritasenergia.it mail is handled by 10 mail.ascopiave.it.
veritasenergia.it mail is handled by 30 mail3.ascotlc.it.

$ host www.enel.it
www.enel.it is an alias for bdzkx.x.incapdns.net.
bdzkx.x.incapdns.net has address 149.126.74.63

to do shopping…

$ host www.paypal.com
www.paypal.com is an alias for geo.paypal.com.akadns.net.
geo.paypal.com.akadns.net is an alias for hotspot-www.paypal.com.akadns.net.
hotspot-www.paypal.com.akadns.net is an alias for wlb.paypal.com.akadns.net.
wlb.paypal.com.akadns.net is an alias for www.paypal.com.edgekey.net.
www.paypal.com.edgekey.net is an alias for e3694.a.akamaiedge.net.
e3694.a.akamaiedge.net has address 2.19.62.129

$ host www.amazon.com
www.amazon.com is an alias for www.cdn.amazon.com.
www.cdn.amazon.com is an alias for d3ag4hukkh62yn.cloudfront.net.
d3ag4hukkh62yn.cloudfront.net has address 54.230.93.25

$ host www.ebay.com 
www.ebay.com is an alias for slot9428.ebay.com.edgekey.net.
slot9428.ebay.com.edgekey.net is an alias for e9428.b.akamaiedge.net.
e9428.b.akamaiedge.net has address 23.195.141.13

$ host www.marksandspencer.com
www.marksandspencer.com is an alias for prod.mands.com.edgekey.net.
prod.mands.com.edgekey.net is an alias for e2341.x.akamaiedge.net.
e2341.x.akamaiedge.net has address 23.43.77.99

$ host www.tesco.com
www.tesco.com is an alias for www.tesco.com.edgekey.net.
www.tesco.com.edgekey.net is an alias for e2008.x.akamaiedge.net.
e2008.x.akamaiedge.net has address 104.123.91.150

to organize fun with friends…

$ host www.opentable.com
www.opentable.com is an alias for ev-www.opentable.com.edgekey.net.
ev-www.opentable.com.edgekey.net is an alias for e9171.x.akamaiedge.net.
e9171.x.akamaiedge.net has address 84.53.157.26

$ host www.just-eat.co.uk
www.just-eat.co.uk is an alias for 72urm.x.incapdns.net.
72urm.x.incapdns.net has address 149.126.74.216

$ host www.airbnb.com
www.airbnb.com is an alias for cdx.muscache.com.
cdx.muscache.com is an alias for 2-01-57ab-0001.cdx.cedexis.net.
2-01-57ab-0001.cdx.cedexis.net is an alias for evsan.airbnb.com.edgekey.net.
evsan.airbnb.com.edgekey.net is an alias for e864.b.akamaiedge.net.
e864.b.akamaiedge.net has address 173.222.129.25

$ host www.odeon.co.uk
www.odeon.co.uk has address 194.77.82.23

and so on so forth.

This means that for an average user, an IPv6-only network is not feasible at all, and I think the idea that it’s a concept to validate is dangerous.

What it does not mean, is that we should just ignore IPv6 altogether. Instead we should make sure to prioritize it accordingly. We’re in a 2018 in which IoT devices are vastly insecure, so the idea of having a publicly-addressable IP for each of the devices in your home is not just uninteresting, but actively frightening to me. And for the companies that need the adoption, I would hope that the priority right now would be proper security, instead of adding an extra layer that would create more unknowns in their stack (because, and again it’s worth noting, as I had a discussion about this too, it’s not just the network that needs to support IPv6, it’s the full application!). And if that means that non-performance-critical backends are not going to be available over IPv6 this century, so be it.

One remark that I’m sure is going to arrive from at least a part of the readers of this, is that a significant part of the examples I’m giving here appear to all be hosted on Akamai’s content delivery network which, as we can tell from XBox’s website, supports IPv6 frontends. “It’s just a button to press, and you get IPv6, it’s not difficult, they are slackers!” is the follow up I expect. For anyone who has worked in the field long enough, this would be a facepalm.

The fact that your frontend can receive IPv6 connections does not mean that your backends can cope with it. Whether it is for session validation, for fraud detection, or just market analysis, lots of systems need to be able to tell what IP address a connection was coming from. If your backend can’t cope with IPv6 addresses being used, your experience may vary between being unable to buy services and receiving useless security alerts. It’s a full stack world.

February 23, 2018
Michal Hrusecky a.k.a. miska (homepage, bugs)
Honeypot as a service (February 23, 2018, 15:00 UTC)

HaaS logoI’m currently working at CZ.NIC, Czech domain registry on project Turris which are awesome open source WiFI (or WiFi free) routers. For those we developed quite some interesting features. One of them is honeypot that you don’t run on your own hardware (what if somebody managed to escape) but you basically do man in the middle on the attacker and forward him to the honeypot we are running behind many firewalls. We had this option for quite some time on our routers. But because plenty of people around the world found the idea really interesting and wanted to join, this part of our project got separated, has its own team of developers and maintainers and you can now join with your own server as well! And to make it super easy, packages are available in Tumbleweed already and also in security repo where they are being build for Leap as well.

How to get started, how it works and what will you get when you join? First step is register on HaaS website. You can also find there explanation what HaaS actually is. When you log in, you can create a new computer and generate a token for it. Once you have a token, it’s time to setup software on your server.

Second step would be obviously to install the software. Given you are using the cool Linux distribution openSUSE Tumbleweed it is pretty easy. Just zypper in haas-proxy.

Last step is configuration. You need to either disable or mive to different port your real ssh. You can do so easily in /etc/ssh/sshd_config, look for Port option and change it from 22 to some other fancy number. Don’t forget to open that port on firewall as well. After calling systemctl restart sshd you should be able to ssh on new port and your port 22 should be free.

Now do you still remember the token you generated on HaaS website? You need to enter it into /etc/haas-proxy, option TOKEN. And that’s all, call systemctl enable haas-proxy and systemctl start haas-proxy and the trap is set and all you need to do is wait for your victims to fall in.

Once they do (if you have public ipv4 than you should have plenty after just a day), you can go to HaaS website again and browse through the logs of trapped visitors or even view some statistics like which country attacks you the most!

HaaS mapSo enjoy the hunt and let’s trap a lot of bad guys 🙂 btw. Anonymized data from those honeypot sessions are later available to download and CZ.NIC has some security researchers from CSIRT team working on those, so you are having fun, don’t compromise your own security and helping the world at once! So win,win,win situation 🙂

February 21, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)

It feels like most of what I end up writing nowadays is my misadventures across a wide range of financial service companies. But here we go (I promise I’ll go back writing about reverse engineering Really Soon Now™).

The last post on this topic was my rant, about how Fineco lacks some basic tools to be used as sole, or primary bank account in the UK. Hopefully they will address this soon, and a sane bank will be available in this country, but for now I had to find alternatives.

Since the various Fintech companies also don’t provide the features I needed, I found myself having to find a “high street bank”. And since my experience up to this point both with Barclays and NatWest was not particularly positive, I decided to look for a different option. Since I have been a mostly-happy customer of Tesco Bank for nearly four years, I decided to give their UK service a try.

At first it appeared to have an online sign-up flow that looked sweet for this kind of problem… except at the end of it, they told me to wait for them to ask me for paperwork to send them through. Turns out the request was for proof of identity (which needs to be certified) and proof of address (which needs to be in original) — the letter and form I could swear is the same that they sent me when I applied for the Irish credit card, except the information is now correct (in Ireland, the Garda will not certify a passport copy, though it appears the UK police forces would).

Let’s ignore the fact that by mailing me at that address, Tesco Bank provided their own proof of address, and let’s focus instead on the fact that they do not accept online print outs, despite almost every service (and, as I found out now, themselves) defaulting to paperless bills and statements. I actually have had a number of bills being mailed to me, including from Hounslow Council, so I have a wide range of choices of what to provide them, but as it turns out, I like a challenge and having some fun with corner cases (particularly as I already solved the immediate need for a bank account by the time I looked into this, but that’s a story for another day).

Here is a part of the story I have not told yet. When I moved to the UK I expected to have to close every account I had still in Ireland, both because Ulster Bank Private is a bloody expensive service, and because at least in Italy I was told I was not entitled to keep credit cards open after I left the country. So as soon as I was in working order over here, I switched over all the billings to Revolut. Unfortunately I couldn’t do that for at least three services (Online.net, Vodafone Italy and Wind/3 Italy) — in two cases because they insist they do not accept anything but Italian cards, while somehow still accepting Tesco Ireland cards.

While trying to figure out an ad-interim solution I got to find out that Tesco Bank has no problem with me still having the “Irish” credit card, and they even allowed me to change the address (and phone number) on file to my new London one. We had some snag regarding the SEPA direct debit, but once I pointed out that they were suggesting breaching the SEPA directives, all was good and indeed the card is debited to the EUR Fineco account.

This also means i get that card’s statements to my London address. So of course I ended up sending, to Tesco Bank, as proof of address… a Tesco Bank Ireland credit card statement. As a way of saying “Do you feel silly enough, now?” to whoever had to manually verify my address and send the paperwork back to me. Turns out it worked just fine, and I got not even a passive aggressive note about it.

Now let’s put aside the registration and let’s take a look at the services provided. Because if I have to rant, I would like at least to rant with some information to others to make up their own mind.

First off, as I said, the first part of the registration is online, after which they get in touch with you to send them the proofs they need. It’s very nice that during the whole time, they “keep in touch” by SMS: they remind you to send the paperwork back, they tell you that the account was open before you receive the snail mail, and so on.

I got a lot of correspondence from Tesco Bank: in addition to the request of proofs, and the proofs being mailed back, I received a notification about the account being opened, the debit card PIN, and a “temporary access number” to sign up online. The debit card arrived separately and through a signature-required delivery. This is a first for me in the UK, as most other cards just got sent through normal mail — except for Fineco, as they used Fedex, and they let me receive it directly at the office, despite it not being the proof of address I sent them.

Once signing up for the online banking, they ask you for an 8-digits security code, a long(er) password, and a selection of verbal question/answers, that are the usual terrible security (so as usual I’ve answered them at random and noted down what I told them the answers were). They allow you to choose your username, but they suggest it to stay the email address on file.

The login for the first time from a different computer is extremely awkward: it starts with two digits of the security code, followed by a SMS second factor authentication, followed by the password (not a subset thereof, so you can use a password manager easily for this one), all through different forms. The same happens for the Mobile Banking application (which is at least linked directly from their website, and very easy to install). The mobile banking login appears to work fairly reliably (and you’ll see on the next post why I call this out explicitly).

I set up the rent standing order on this account, and it was a straightforward and painless process, which is the same as a one-time transaction, except for saying “I want to repeat this every month” checkbox. All in all, it looks to me like it’s a saner UI than Barclays, and proper enough for the needs I have. I will report back if there is anything particularly different from this that I find over time, of course.

Arun Raghavan a.k.a. ford_prefect (homepage, bugs)
Applicative Functors for Fun and Parsing (February 21, 2018, 07:24 UTC)

PSA: This post has a bunch of Haskell code, but I’m going to try to make it more broadly accessible. Let’s see how that goes.

I’ve been proceeding apace with my 3rd year in Abhinav’s Haskell classes at Nilenso, and we just got done with the section on Applicative Functors. I’m at that point when I finally “get” it, so I thought I’d document the process, and maybe capture my a-ha moment of Applicatives.

I should point out that the ideas and approach in this post are all based on Abhinav’s class material (and I’ve found them really effective in understanding the underlying concepts). Many thanks are due to him, and any lack of clarity you find ahead is in my own understanding.

Functors and Applicatives

Functors represent a type or a context on which we can meaningfully apply (map) a function. The Functor typeclass is pretty straightforward:

class Functor f where
  fmap :: (a -> b) -> f a -> f b

Easy enough. fmap takes a function that transforms something of type a to type b and a value of type a in a context f. It produces a value of type b in the same context.

The Applicative typeclass adds two things to Functor. Firstly, it gives us a means of putting things inside a context (also called lifting). The second is to apply a function within a context.

class Functor f => Applicative f where
  pure :: a -> f a
  (<*>) :: f (a -> b) -> f a -> f b

We can see pure lifts a given value into a context. The apply function (<*>) intuitively looks like fmap, with the difference that the function is within a context. This becomes key when we remember that Haskell functions are curried (and can thus be partially applied). This would then allow us to write something like:

maybeAdd :: Maybe Int -> Maybe Int -> Maybe Int
maybeAdd ma mb = pure (+) <*> ma <*> mb

This function takes two numbers in the Maybe context (that is, they either exist, or are Nothing), and adds them. The result will be the sum if both numbers exist, or Nothing if either or both do not.

Go ahead and convince yourself that it is painful to express this generically with just fmap.

Parsers

There are many ways of looking at what a parser is. Let’s work with one definition: A parser,

  • Takes some input
  • Converts some or all of it into something else if it can
  • Returns whatever input was not used in the conversion

How do we represent something that converts something to something else? It’s a function, of course. Let’s write that down as a type:

newtype Parser i o = Parser (i -> (Maybe o, i))

This more or less directly maps to what we just said. A Parser is a data type which has two type parameters — an input type and an output type. It contains a function that takes one argument of the input type, and produces a tuple of Maybe the output type (signifying if parsing succeeded) and the rest of the input.

We can name the field runParser, so it becomes easier to get a hold of the function inside our Parser type:

newtype Parser i o = Parser { runParser :: i -> (Maybe o, i) }

Parser combinators

The “rest” part is important for the reason that we would like to be able to chain small parsers together to make bigger parsers. We do this using “parser combinators” — functions that take one or more parsers and return a more complex parser formed by combining them in some way. We’ll see some of those ways as we go along.

Parser instances

Before we proceed, let’s define Functor and Applicative instances for our Parser type.

instance Functor (Parser i) where
  fmap f p = Parser $ \input ->
    let (mo, i) = runParser p input
    in (f <$> mo, i)

The intuition here is clear — if I have a parser that takes some input and provides some output, fmaping a function on that parser translates to applying that function on the output of the parser.

instance Applicative (Parser i) where
  pure x = Parser $ \input -> (Just x, input)

  pf <*> po = Parser $ \input ->
    case runParser pf input of
         (Just f, rest) -> case runParser po rest of
                                (Just o, rest') -> (Just (f o), rest')
                                (Nothing, _)    -> (Nothing, input)
         (Nothing, _)   -> (Nothing, input)

The Applicative instance is a bit more involved than Functor. What we’re doing first is “running” the first parser which gives us the function we want to apply (remember that this is a curried function, so rather than parsing out a function, we are most likely parsing out a value and creating a function with that). If we succeed, then we run the second parser to get a value to apply the function to. If this is also successful, we apply the function to the value, and return the result within the parser context (i.e. the result, and the rest of the input).

Implementing some parsers

Now let’s take our new data type and instances for a spin. Before we write a real parser, let’s write a helper function. A common theme while parsing a string is to match a single character on a predicate — for example, “is this character an alphabet”, or “is this character a semi-colon”. We write a function to take a predicate and return the corresponding parser:

satisfy :: (Char -> Bool) -> Parser String Char
satisfy p = Parser $ \input ->
  case input of
       (c:cs) | p c -> (Just c, cs)
       _            -> (Nothing, input)

Now let’s try to make a parser that takes a string, and if it finds a ASCII digit character, provides the corresponding integer value. We have a function from the Data.Char module to match ASCII digit characters — isDigit. We also have a function to take a digit character and give us an integer — digitToInt. Putting this together with satisfy above.

import Data.Char (digitToInt, isDigit)

digit :: Parser String Int
digit = digitToInt <$> satisfy isDigit

And that’s it! Note how we used our higher-order satisfy function to match a ASCII digit character and the Functor instance to apply digitToInt to the result of that parser (reminder: <$> is just the infix form of writing fmap — this is the same as fmap digitToInt (satisfy digit).

Another example — a character parser, which succeeds if the next character in the input is a specific character we choose.

char :: Char -> Parser String Char
char x = satisfy (x ==)

Once again, the satisfy function makes this a breeze. I must say I’m pleased with the conciseness of this.

Finally, let’s combine character parsers to create a word parser — a parser that succeeds if the input is a given word.

word :: String -> Parser String String
word ""     = Parser $ \input -> (Just "", input)
word (c:cs) = (:) <$> char c <*> word cs

A match on an empty word always succeeds. For anything else, we just break down the parser to a character parser of the first character and a recursive call to the word parser for the rest. Again, note the use of the Functor and Applicative instance. Let’s look at the type signature of the (:) (list cons) function, which prepends an element to a list:

(:) :: a -> [a] -> [a]

The function takes two arguments — a single element of type a, and a list of elements of type a. If we expand the types some more, we’ll see that the first argument we give it is a Parser String Char and the second is a Parser String [Char] (String is just an alias for [Char]).

In this way we are able to take the basic list prepend function and use it to construct a list of characters within the Parser context. (a-ha!?)

JSON

JSON is a relatively simple format to parse, and makes for a good example for building a parser. The JSON website has a couple of good depictions of the JSON language grammar front and center.

So that defines our parser problem then — we want to read a string input, and convert it into some sort of in-memory representation of the JSON value. Let’s see what that would look like in Haskell.

data JsonValue = JsonString String
               | JsonNumber JsonNum
               | JsonObject [(String, JsonValue)]
               | JsonArray [JsonValue]
               | JsonBool Bool
               | JsonNull

-- We represent a number as an infinite precision
-- floating point number with a base 10 exponent
data JsonNum = JsonNum { negative :: Bool
                       , signif   :: Integer
                       , expo     :: Integer
                       }

The JSON specification does not really tell us what type to use for numbers. We could just use a Double, but to make things interesting, we represent it as an arbitrary precision floating point number.

Note that the JsonArray and JsonObject constructors are recursive, as they should be — a JSON array is an array of JSON values, and a JSON object is a mapping from string keys to JSON values.

Parsing JSON

We now have the pieces we need to start parsing JSON. Let’s start with the easy bits.

null

To parse a null we literally just look for the word “null”.

jsonNull :: Parser String JsonValue
jsonNull = word "null" $> JsonNull

The $> operator is a flipped shortcut for fmap . const — it evaluates the argument on the left, and then fmaps the argument on the right onto it. If the word "null" parser is successful (Just "null"), we’ll fmap the JsonValue representing null to replace the string "null" (i.e. we’ll get a (Just JsonNull, <rest of the input>)).

true and false

First a quick detour:

instance Alternative (Parser i) where
  empty = Parser $ \input -> (Nothing, input)
  p1 <|> p2 = Parser $ \input ->
      case runParser p1 input of
           (Nothing, _) -> case runParser p2 input of
                                (Nothing, _) -> (Nothing, input)
                                justValue    -> justValue
           justValue    -> justValue

The Alternative instance is easy to follow once you understand Applicative. We define an empty parser that matches nothing. Then we define the alternative operator (<|>) as we might intuitively imagine.

We run the parser given as the first argument first, if it succeeds we are done. If it fails, we run the second parser on the whole input again, if it succeeds, we return that value. If both fail, we return Nothing.

Parsing true and false with this in our belt looks like:

jsonBool :: Parser String JsonValue
jsonBool =  (word "true" $> JsonBool True)
        <|> (word "false" $> JsonBool False)

We are easily able express the idea of trying to parse for the string “true”, and if that fails, trying again for the string “false”. If either matches, we have a boolean value, if not, Nothing. Again, nice and concise.

String

This is only slightly more complex. We need a couple of helper functions first:

hexDigit :: Parser String Int
hexDigit = digitToInt <$> satisfy isHexDigit

digitsToNumber :: Int -> [Int] -> Integer
digitsToNumber base digits = foldl (\num d -> num * fromIntegral base + fromIntegral d) 0 digits

hexDigit is easy to follow. It just matches anything from 0-9 and a-f or A-F.

digitsToNumber is a pure function that takes a list of digits, and interprets it as a number in the given base. We do some jumping through hoops with fromIntegral to take Int digits (mapping to a normal word-sized integer) and produce an Integer (arbitrary sized integer).

Now follow along one line at a time:

jsonString :: Parser String String
jsonString = (char '"' *> many jsonChar <* char '"')
  where
    jsonChar =  satisfy (\c -> not (c == '\"' || c == '\\' || isControl c))
            <|> word "\\\"" $> '"'
            <|> word "\\\\" $> '\\'
            <|> word "\\/"  $> '/'
            <|> word "\\b"  $> '\b'
            <|> word "\\f"  $> '\f'
            <|> word "\\n"  $> '\n'
            <|> word "\\r"  $> '\r'
            <|> word "\\t"  $> '\t'
            <|> chr . fromIntegral . digitsToNumber 16 <$> (word "\\u" *> replicateM 4 hexDigit)

A string is a valid JSON character, surrounded by quotes. The *> and <* operators allow us to chain parsers whose output we wish to discard (since the quotes are not part of the actual string itself). The many function comes from the Alternative typeclass. It represents zero or more instances of context. In our case, it tries to match zero or more jsonChar parsers.

So what does jsonChar do? Following the definition of a character in the JSON spec, first we try to match something that is not a quote ("), a backslash (\) or a control character. If that doesn’t match, we try to match the various escape characters that the specification mentions.

Finally, if we get a \u followed by 4 hexadecimal characters, we put them in a list (replicateM 4 hexDigit chains 4 hexDigit parsers and provides the output as a list), convert that list into a base 16 integer (digitsToNumber), and then convert that to a Unicode character (chr).

The order of chaining these parsers does matter for performance. The first parser in our <|> chain is the one that is most likely (most characters are not escaped). This follows from our definition of the Alternative instance. We run the first parser, then the second, and so on. We want this to succeed as early as possible so we don’t run more parsers than necessary.

Arrays

Arrays and objects have something in common — they have items which are separated by some value (commas for array values, commas for each key-value pair in an object, and colons separating keys and values). Let’s just factor this commonality out:

sepBy :: Parser i v -> Parser i s -> Parser i [v]
sepBy v s = (:) <$> v <*> many (s *> v) 
         <|> pure []

We take a parser for our values (v), and a parser for our separator (s). We try to parse one or more v separated by s, and or just return an empty list in the parser context if there are none.

Now we write our JSON array parser as:

jsonArray :: Parser String JsonValue
jsonArray = JsonArray <$> (char '[' *> (json `sepBy` char ',') <* char ']')

Nice, that’s really succinct. But wait! What is json?

Putting it all together

We know that arrays contain JSON values. And we know how to parse some JSON values. Let’s try to put those together for our recursive definition:

json :: Parser String JsonValue
json =  jsonNull
    <|> jsonBool
    <|> jsonString
    <|> jsonArray
--  <|> jsonNumber
--  <|> jsonObject

And that’s it!

The JSON object and number parsers follow the same pattern. So far we’ve ignored spaces in the input, but those can be consumed and ignored easily enough based on what we’ve learned.

You can find the complete code for this exercise on Github.

Some examples of what this looks like in the REPL:

*Json> runParser json "null"
(Just null,"")

*Json> runParser json "true"
(Just true,"")

*Json> runParser json "[null,true,\"hello!\"]"
(Just [null, true, "hello!" ],"")

Concluding thoughts

If you’ve made it this far, thank you! I realise this is long and somewhat dense, but I am very excited by how elegantly Haskell allows us to express these ideas, using fundamental aspects of its type(class) system.

A nice real world example of how you might use this is the optparse-applicative package which uses these ideas to greatly simplify the otherwise dreary task of parsing command line arguments.

I hope this post generates at least some of the excitement in you that it has in me. Feel free to leave your comments and thoughts below.

February 19, 2018
Jason A. Donenfeld a.k.a. zx2c4 (homepage, bugs)
WireGuard in Google Summer of Code (February 19, 2018, 14:55 UTC)

WireGuard is participating in Google Summer of Code 2018. If you're a student — bachelors, masters, PhD, or otherwise — who would like to be funded this summer for writing interesting kernel code, studying cryptography, building networks, making mobile apps, contributing to the larger open source ecosystem, doing web development, writing documentation, or working on a wide variety of interesting problems, then this may be appealing. You'll be mentored by world-class experts, and the summer will certainly boost your skills. Details are on this page — simply contact the WireGuard team to get a proposal into the pipeline.

Alice Ferrazzi a.k.a. alicef (homepage, bugs)

Gentoo has been accepted as a Google Summer of Code 2018 mentoring organization

Gentoo has been accepted into the Google Summer of Code 2018!

Contacts:
If you want to help in any way with Gentoo GSoC 2018 please add yourself here [1] as soon as possible, is important!. Someone from the Gentoo GSoC team will contact you back, as soon as possible. We are always searching for people willing to help with Gentoo GSoC 2018!

Students:
If you are a student and want to spend your summer on Gentoo projects and having fun writing code, you can start discussing ideas in the #gentoo-soc IRC Freenode channel [2].
You can find some ideas example here [3], but of course could be also something different.
Just remember that we strongly recommend that you work with a potential mentor to develop your idea before proposing it formally. Don’t waste any time because there’s typically some polishing which needs to occur before the deadline (March 28th)

I can assure you that Gentoo GSoC is really fun and a great experience for improving yourself. It is also useful for making yourself known in the Gentoo community. Contributing to the Gentoo GSoC will be contributing to Gentoo, a bleeding edge distribution, considered by many to be for experts and with near-unlimited adaptability.

If you require any further information, please do not hesitate to contact us [4] or check the Gentoo GSoC 2018 wiki page [5].

[1] https://wiki.gentoo.org/wiki/Google_Summer_of_Code/2018/Mentors
[2] http://webchat.freenode.net/?channels=gentoo-soc
[3] https://wiki.gentoo.org/wiki/Google_Summer_of_Code/2018/Ideas
[4] soc-mentors@gentoo.org
[5] https://wiki.gentoo.org/wiki/Google_Summer_of_Code/2018

Gentoo accepted into Google Summer of Code 2018 (February 19, 2018, 00:00 UTC)

Students who want to spend their summer having fun and writing code can do so now for Gentoo. Gentoo has been accepted as a mentoring organization for this year’s Google Summer of Code.

The GSoC is an excellent opportunity for gaining real-world experience in software design and making one’s self known in the broader open source community. It also looks great on a resume.

Initial project ideas can be found here, although new projects ideas are welcome. For new projects time is of the essence: there is typically some idea-polishing which must occur before the March 27th deadline. Because of this it is strongly recommended that students refine new project ideas with a mentor before proposing the idea formally.

GSoC students are encouraged to begin discussing ideas in the #gentoo-soc IRC channel on the Freenode network.

Further information can be found on the Gentoo GSoC 2018 wiki page. Those with unanswered questions should not hesitate to contact the Summer of Code mentors via the mailing list.

February 14, 2018
Luca Barbato a.k.a. lu_zero (homepage, bugs)
Rust-av: Rust and Multimedia (February 14, 2018, 19:44 UTC)

Recently I presented my new project at Fosdem and since I was afraid of not having enough time for the questions I trimmed the content to the bare minimum. This blog post should add some more details.

What is it?

Rust-av aims to be a complete multimedia toolkit written in rust.

Rust is a quite promising language that aims to offer high execution speed while granting a number of warranties on the code behavior that you cannot have in C, C++, Java and so on.

Its zero-cost abstraction feature coupled with the fact that the compiler actively prevents you from committing a large class of mistakes related to memory access seems a perfect match to implement a multimedia toolkit that is easy to use, fast enough and trustworthy.

Why something completely new?

Since rust code can be intermixed with C code, an evolutive approach of replacing little by little small components in a larger project is perfectly feasible, and it is what we are currently trying to do with vlc.

But rust is not just good to write some inner routines so they are fast and correct, its trait system is also quite useful to have a more expressive API.

Most of the multimedia concepts are pretty simple at the high level (e.g frame is just a picture or some sound with some timestamp) with an excruciating amount of quirk and details that require your toolkit to make choices for you or make you face a large amount of complexity.

That leads to API that are either easy but quite inflexible (and opinionated) or API providing all the flexibility, but forcing the user to have to learn a lot of information in order to achieve what the simpler API would let you implement in an handful of lines of code.

I wanted to leverage Rust to make the low level implementations with less bugs and, at the same time, try to provide a better API to use it.

Why now?

Since 2016 I kept bouncing ideas with Kostya and Geoffroy but between my work duties and other projects I couldn’t devote enough time to it. Thanks to the Mozilla Open Source Support initiative that awarded me with enough to develop it full time, now the project has already some components published and more will follow during the next months.

Philosophy

I’m trying to leverage the experience I have from contributing to vlc and libav and keep what is working well and try to not make the same mistakes.

Ease of use

I want that the whole toolkit to be useful to a wide audience. Developers often fight against the library in order to undo what is happening under the hood or end up vendoring some part of it since they need only a tiny subset of all the features that are provided.

Rust makes quite natural split large projects in independent components (called crates) and it is already quite common to have meta-crates re-exporting many smaller crates to provide some uniform access.

The rust-av code, as opposed to the rather monolithic approach taken in Libav, can be reused with the granularity of the bare codec or format:

  • Integrating it in a foreign toolkit won’t require to undo what the common utility code does.
  • Even when using it through the higher level layers, rust-av won’t force the developer to bring in any unrelated dependencies.
  • On the other hand users that enjoy a fully integrated and all-encompassing solution can simply depend on the meta-crates and get the support for everything.

Speed

Multimedia playback boils down to efficiently do complex computation so an arbitrary large amount of data can be rendered within a fraction of second, multimedia real time streaming requires to compress an equally large amount of data in the same time.

Speed in multimedia is important.

Rust provides high level idiomatic constructs that surprisingly lead to pretty decent runtime speed. The stdsimd effort and the seamless C ABI support make easier to leverage the SIMD instructions provided by the recent CPU architectures.

Trustworthy

Traditionally the most effective way to write fast multimedia code had been pairing C and assembly. Sadly the combination makes quite easy to overlook corner cases and have any kind of memory hazards (use-after-free, out of bound reads and writes, NULL-dereferences…).

Rust effectively prevents a good deal of those issues at compile time. Since its abstractions usually do not cause slowdowns it is possible to write code that is, arguably, less misleading and as fast.

Structure

The toolkit is composed of multiple, loosely coupled, crates. They can be grouped by level of abstraction.

Essential

av-data: Used by nearly all the other crates it provides basic data types and a minimal amount of functionality on top of it. It provides the following structs mainly:

  • Frame: it binds together a time reference and a buffer, representing either a video picture or some audio samples.
  • Packet: it bind together a time reference and a buffer, containing compressed data.
  • Value: Simple key value type abstraction, used to pass arbitrary data to the configuration functions.

Core

They provide the basic abstraction (traits) implemented by specific set of components.

  • av-format: It provides a set of traits to implement muxers and demuxers and an utility Context to bridge the normal rust I/O Write and Read traits and the actual muxers and demuxers.
  • av-codec: It provides a set of traits to implement encoders and decoders and an utility Context that wraps.

Utility

They provide building blocks that may be used to implement actual codecs and formats.

  • av-bitstream: Utility crate to write and read bits and bytes
  • av-audio: Audio-specific utilities
  • av-video: Video-specific utilities

Implementation

Actual implementations of codec and format, they can be used directly or through the utility Contexts.

The direct usage is suggested only if you are integrating it in larger frameworks that already implement, possibly in different ways, the integration code provided by the Context (e.g. binding it together with the I/O for the formats or internal queues for the codecs).

Higher-level

They provide higher level Contexts to playback or encode data through a simplified interface:

  • av-player reads bytes though a provided Read and outputs decoded Frames. Under the hood it probes the data, allocates and configures a Demuxer and a Decoder for each stream of interest.
  • av-encoder consumes Frames and outputs encoded and muxed data through a Write output. It automatically setup the encoders and the muxer.

Meta-crates

They ease the use in bulk of everything provided by rust-av.

There are 4 crates providing a list of specific components: av-demuxers, av-muxers, av-decoders and av-encoders; and 2 grouping them by type: av-formats and av-codecs.

Their use is suggested when you’d like to support every format and codec available.

So far

All the development happens on the github organization and so far the initial Core and Essential crates are ready to be used.

There is a nom-based matroska demuxer in working condition and some non-native wrappers providing implementations for some decoders and encoders.

Thanks to est31 we have native vorbis support.

I’m working on a native implementation of opus and soon I’ll move to a video codec.

There is a tiny player called avp and an encoder tool (named ave) will appear once the matroska muxer is complete.

What’s missing in rust-av

API-wise, right now rust-av provides only for simple decode and encoding, muxing and demuxing. There are already enough wrapped codecs to let people play with the library and, hopefully, help in polishing it.

For each crate I’m trying to prepare some easy tasks so people willing to contribute to the project can start from them, all help is welcome!

What’s missing in rust

So far my experience with rust had been quite positive, but there are a number of features that are missing or that could be addressed.

  • SIMD support is shaping up nicely and it is coming soon.
  • The natural fallback, going down to assembly, is available since rust supports the C ABI, inline assembly support on the other hand seems that is still pending some discussion before it reaches stable.
  • Arbitrarily aligned allocation is a MUST in order to support hardware acceleration and SIMD works usually better with aligned buffers.
  • I’d love to have const generics now, luckily associated constants with traits allow some workarounds that let you specialize by constants (and result in neat speedups).
  • I think that focusing a little more on array/slice support would lead to the best gains, since right now there isn’t an equivalent to collect() to fill arrays in an idiomatic way and in multimedia large lookup tables are pretty much a staple.

In closing

Rust and Multimedia seem a really good match, in my experience beside a number of missing features the language seems quite good for the purpose.

Once I have more native implementations complete I will be able to have better means to evaluate the speed difference from writing the same code in C.

Sebastian Pipping a.k.a. sping (homepage, bugs)
I love free software… and Gentoo does! #ilovefs (February 14, 2018, 14:25 UTC)

Some people care if software is free of cost or if it has the best features, above everything else. I don’t. I care that I can legally inspect its inner workings, modify and share modified versions. That’s why I happily avoid macOS, Windows, Skype, Photoshop.

I ran into these two pieces involving Gentoo in the Gallery of Free Software lovers and would like to share them with you:

Images are licensed under CC BY-SA 4.0 (with attribution going to Free Software Foundation Europe) as confirmed by Max Mehl.

February 12, 2018
Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)

I don’t really like the idea of having to write about proprietary software here, but I only found terrible alternative suggestions on the eb so I thought I would at least try to write down about it in the hope to avoid people falling for very bad advice.

The problem: after updating my BIOS, BitLocker asks for the key at boot, and PIN login for Windows 10 (Microsoft Account) fails, requiring to log in with the full Microsoft account password. Trying to re-enable the PIN fails with the error message “Sorry, there was a problem signing you in”.

The amount of misleading information I found on the Internet is astonishing, including a phrase from what appeared to be a Microsoft support person, stating «The operating system should always go with the BIOS. If the BIOS is freshly updated, then the OS has to be fresh as well.» Facepalms all over the place.

The solution (for me): go back in the BIOS and re-enable the TPM (“Security Module”).

Some background is needed. The 2017 Gamestation I’m using nowadays is built using a MSI X299 SLI PLUS with a plug-in TPM which is a requirement to use BitLocker (and if you think that makes it very safe, think again).

I had just updated the firmware of the motherboard (that at this point we all still call “BIOS” despite being clearly “UEFI” based), and it turns out that MSI just silently drop most of the customization to the BIOS settings after update. In particular this disabled a few required settings, including the TPM itself (and Secure Boot — I wonder if Matthew Garrett would have some words about the implementation of it in this particular board at this point).

I see reports on this for MSI and Gigabyte boards alike, so I can assume that Gigabyte does the same, and requires re-enabling the TPM in the settings when updating the BIOS version.

I would probably say that the MSI firmware engineering does not quite fully convince me. It’s not just the dropping all the settings on update (which I still find is a bad thing to do for users), but also the fact that one of the settings is explicitly marked as “crypto miner mode” — I’m sure looking forward for the time after the bubble bursts so that we don’t have to pay premium for decent hardware just because someone thinks they can make money from nothing. Oh well.

February 10, 2018
Matthew Thode a.k.a. prometheanfire (homepage, bugs)
Native ZFS encryption for your rootfs (February 10, 2018, 06:00 UTC)

Disclaimer

I'm not responsible if you ruin your system, this guide functions as documentation for future me. Remember to back up your data.

Why do this instead of luks

I wanted to remove a layer from the File to Disk layering, before it was ZFS -> LUKS -> disk, now it's ZFS -> disk.

Prework

I just got a new laptop and wanted to just migrate the data, luckily the old laptop was using ZFS as well, so the data could be sent/received though native ZFS means.

The actual setup

Set up your root pool with the encryption key, it will be inherited by all child datasets, no child datasets will be allowed to be unencrypted.

In my case the pool name was slaanesh-zp00, so I ran the following to create the fresh pool.

zpool create -O encryption=on -O keyformat=passphrase zfstest /dev/zvol/slaanesh-zp00/zfstest

After that just go on and create your datasets as normal, transfer old data as needed (it'll be encrypted as it's written). See https://wiki.gentoo.org/wiki/ZFS for a good general guide on setting up your datasets.

decrypting at boot

If you are using dracut it should just work. No changes to what you pass on the kernel command line are needed. The code is upstream in https://github.com/zfsonlinux/zfs/blob/master/contrib/dracut/90zfs/zfs-load-key.sh.in

notes

Make sure you install from git master, there was a disk format change for encrypted datasets that just went in a week or so ago.

Diego E. Pettenò a.k.a. flameeyes (homepage, bugs)
UK Banking, Fineco is not enough (February 10, 2018, 00:03 UTC)

You may remember that the last time I blogged about UK banking I had just dismissed Barclays in favour of Fineco, the Italian investment bank, branch of UniCredit, This seemed a good move, both because people spoke very good of Fineco in my circles, at least in Italy, and because the sign up flow seemed so good that it sounded like a good idea.

I found out almost right away that something was not quite perfect for the UK market, in particular because there was (and is) no way to set up a standing order, which is the standard way to pay for your rent in the UK (and Ireland, for what it’s worth). But it seemed a minor thing to worry about, as the rest of the features of the bank (ability to spend up to £10k in a single transaction by requesting an explicit lift on the card limits with SMS authentication, just to say one).

Unfortunately, a couple of months later I know for sure it is not possible to use Fineco as a primary account in the UK at all. There are two problems, the first being very much a problem to anyone, and the second being a problem for my situation. I’ll start with the first one: direct debit support.

The direct debit system, for those not used to it in Europe, is one where you give a “debtor” (usually, an utility service, such as your ISP or power company) your account details (Sort Code and Account Number in the case of the UK), and they will tell the bank to give them money at certain time of the month. And it is the way Jeremy Clarkson lost £200, ten years ago. There is a nearly identical system in the rest of the EU, called SEPA Direct Debit (with SDD Core being the more commonly known about, as it deals with B2C, business-to-consumer) debits.

After I opened the Fineco account, I checked on Payments UK’s Sort Code Checker which features were enabled for it (30-02-48) and then, as well as the time of writing, it says «Bacs Direct Debits can be set up on this sort code.» So I had no refrain in closing my Barclays account and moving all the money into the newly created account. All of my utilities were more than happy to do so, except for ThamesWater that refused to let me set up the debit online. Turns out they were the only ones with a clue.

Indeed, when in January the first debit was supposed to land, instead of seeing the debit on the statement, I saw a BACS credit of the same amount. I contacted my ISP (Hyperoptic, with the awesome customer support) to verify if something failed on their side, but they didn’t see anything amiss for them. When even Netflix showed up the same way, and both of the transaction showed up an “entry reversal” of the same amount, I knew something was off with the bank and contacted them, originally to no avail.

Indeed, a few more debits showed up the same way, so I have been waiting for the shoe to drop, which it did at the end of January, when Fineco sent me an email (or actually, a ticket, it even has a ticket number!) telling me that they processed the debits as a one-off, but to cancel them because they won’t do this again. This was professional of them, particularly as this way it does not hit my credit score at all, but it still is a major pain in the neck.

My first hope was to be able to just use Revolut to pay the direct debits, since they are all relatively low amounts, which fit my usual auto top-up strategy. When you look at the Revolut information page with your account details for GBP, the text says explicitly «Use this personal UK current account to get salary and to pay bills», which brought me hope, and indeed the Payment UK’s checker also confirmed that it supposedly accepts Bacs Direct Debit. But when I checked with the in-app chat support, I was told that, no Revolut does not support direct debits, which makes that phrase extremely misleading. At least TransferWise explicitly denies supporting Direct Debit in the sort code checker, kudos to them.

The next problem with Fineco is not actually their fault, but is still due to them not having the “full features” of a UK high street bank. I got contacted by Dexters, the real estate company that among other things manages my apartment and collects my rent. While they told me the money arrived all good when I moved to Fineco (I asked them explicitly), they sent me a scary and threatening email (after failing to reach me on the phone, I was going to be charged excessively high roaming charges to answer an unknown number)… because £12 were missing from my payment. The email exchange wasn’t particularly productive (I sent them a photo of the payment confirmation, they told me «[they] received a large sum of money[sic] however it is £12.00 that is outstanding on the account.» So I called them on Monday, and they managed to confirm that it was actually £6 missing in December, and another £6 missing in January.

Throwing this around with colleague, and then discussing with a more reasonable person from Dexters on the phone, we came to figure out that Barclays (as the bank used by Dexters to receive the rent) is charging them £6 to receive these transfers because they are “international” — despite the fact that they are indeed local, it appears Barclays apply that fee for any transfer received over the SWIFT network rather than through the Faster Payments system used by most of the other UK banks. I didn’t want to keep arguing with Dexters over the fact that it’s their bank charging them the fee, I paid the extra £12, and decided to switch the rent payment over to the new account as well. I really dislike Barclays.

I’ll post later this month on the following attempts with other bank accounts. For now I decided that I’ll keep getting my salary into Fineco, and keep a running balance on the “high street” account for the direct debits, and the rent. Right now for my only GBP credit card (American Express) I still pay the balance off Fineco via debit card payment anyway, because the credit limit they gave me is quite limited for my usual spending, particularly now that I can actually charge that card when booking flights on BA without having to spend extra money in fees.

February 06, 2018
Gentoo at FOSDEM 2018 (February 06, 2018, 19:49 UTC)

Gentoo Linux participated with a stand during this year's FOSDEM 2018, as has been the case for the past several years. Three Gentoo developers had talks this year, Haubi was back with a Gentoo-related talk on Unix? Windows? Gentoo! - POSIX? Win32? Native Portability to the max!, dilfridge talked about Perl in the Physics Lab … Continue reading "Gentoo at FOSDEM 2018"

February 05, 2018
Greg KH a.k.a. gregkh (homepage, bugs)
Linux Kernel Release Model (February 05, 2018, 17:13 UTC)

Note

This post is based on a whitepaper I wrote at the beginning of 2016 to be used to help many different companies understand the Linux kernel release model and encourage them to start taking the LTS stable updates more often. I then used it as a basis of a presentation I gave at the Linux Recipes conference in September 2017 which can be seen here.

With the recent craziness of Meltdown and Spectre , I’ve seen lots of things written about how Linux is released and how we handle handles security patches that are totally incorrect, so I figured it is time to dust off the text, update it in a few places, and publish this here for everyone to benefit from.

I would like to thank the reviewers who helped shape the original whitepaper, which has helped many companies understand that they need to stop “cherry picking” random patches into their device kernels. Without their help, this post would be a total mess. All problems and mistakes in here are, of course, all mine. If you notice any, or have any questions about this, please let me know.

Overview

This post describes how the Linux kernel development model works, what a long term supported kernel is, how the kernel developers approach security bugs, and why all systems that use Linux should be using all of the stable releases and not attempting to pick and choose random patches.

Linux Kernel development model

The Linux kernel is the largest collaborative software project ever. In 2017, over 4,300 different developers from over 530 different companies contributed to the project. There were 5 different releases in 2017, with each release containing between 12,000 and 14,500 different changes. On average, 8.5 changes are accepted into the Linux kernel every hour, every hour of the day. A non-scientific study (i.e. Greg’s mailbox) shows that each change needs to be submitted 2-3 times before it is accepted into the kernel source tree due to the rigorous review and testing process that all kernel changes are put through, so the engineering effort happening is much larger than the 8 changes per hour.

At the end of 2017 the size of the Linux kernel was just over 61 thousand files consisting of 25 million lines of code, build scripts, and documentation (kernel release 4.14). The Linux kernel contains the code for all of the different chip architectures and hardware drivers that it supports. Because of this, an individual system only runs a fraction of the whole codebase. An average laptop uses around 2 million lines of kernel code from 5 thousand files to function properly, while the Pixel phone uses 3.2 million lines of kernel code from 6 thousand files due to the increased complexity of a SoC.

Kernel release model

With the release of the 2.6 kernel in December of 2003, the kernel developer community switched from the previous model of having a separate development and stable kernel branch, and moved to a “stable only” branch model. A new release happened every 2 to 3 months, and that release was declared “stable” and recommended for all users to run. This change in development model was due to the very long release cycle prior to the 2.6 kernel (almost 3 years), and the struggle to maintain two different branches of the codebase at the same time.

The numbering of the kernel releases started out being 2.6.x, where x was an incrementing number that changed on every release The value of the number has no meaning, other than it is newer than the previous kernel release. In July 2011, Linus Torvalds changed the version number to 3.x after the 2.6.39 kernel was released. This was done because the higher numbers were starting to cause confusion among users, and because Greg Kroah-Hartman, the stable kernel maintainer, was getting tired of the large numbers and bribed Linus with a fine bottle of Japanese whisky.

The change to the 3.x numbering series did not mean anything other than a change of the major release number, and this happened again in April 2015 with the movement from the 3.19 release to the 4.0 release number. It is not remembered if any whisky exchanged hands when this happened. At the current kernel release rate, the number will change to 5.x sometime in 2018.

Stable kernel releases

The Linux kernel stable release model started in 2005, when the existing development model of the kernel (a new release every 2-3 months) was determined to not be meeting the needs of most users. Users wanted bugfixes that were made during those 2-3 months, and the Linux distributions were getting tired of trying to keep their kernels up to date without any feedback from the kernel community. Trying to keep individual kernels secure and with the latest bugfixes was a large and confusing effort by lots of different individuals.

Because of this, the stable kernel releases were started. These releases are based directly on Linus’s releases, and are released every week or so, depending on various external factors (time of year, available patches, maintainer workload, etc.)

The numbering of the stable releases starts with the number of the kernel release, and an additional number is added to the end of it.

For example, the 4.9 kernel is released by Linus, and then the stable kernel releases based on this kernel are numbered 4.9.1, 4.9.2, 4.9.3, and so on. This sequence is usually shortened with the number “4.9.y” when referring to a stable kernel release tree. Each stable kernel release tree is maintained by a single kernel developer, who is responsible for picking the needed patches for the release, and doing the review/release process. Where these changes are found is described below.

Stable kernels are maintained for as long as the current development cycle is happening. After Linus releases a new kernel, the previous stable kernel release tree is stopped and users must move to the newer released kernel.

Long-Term Stable kernels

After a year of this new stable release process, it was determined that many different users of Linux wanted a kernel to be supported for longer than just a few months. Because of this, the Long Term Supported (LTS) kernel release came about. The first LTS kernel was 2.6.16, released in 2006. Since then, a new LTS kernel has been picked once a year. That kernel will be maintained by the kernel community for at least 2 years. See the next section for how a kernel is chosen to be a LTS release.

Currently the LTS kernels are the 4.4.y, 4.9.y, and 4.14.y releases, and a new kernel is released on average, once a week. Along with these three kernel releases, a few older kernels are still being maintained by some kernel developers at a slower release cycle due to the needs of some users and distributions.

Information about all long-term stable kernels, who is in charge of them, and how long they will be maintained, can be found on the kernel.org release page.

LTS kernel releases average 9-10 patches accepted per day, while the normal stable kernel releases contain 10-15 patches per day. The number of patches fluctuates per release given the current time of the corresponding development kernel release, and other external variables. The older a LTS kernel is, the less patches are applicable to it, because many recent bugfixes are not relevant to older kernels. However, the older a kernel is, the harder it is to backport the changes that are needed to be applied, due to the changes in the codebase. So while there might be a lower number of overall patches being applied, the effort involved in maintaining a LTS kernel is greater than maintaining the normal stable kernel.

Choosing the LTS kernel

The method of picking which kernel the LTS release will be, and who will maintain it, has changed over the years from an semi-random method, to something that is hopefully more reliable.

Originally it was merely based on what kernel the stable maintainer’s employer was using for their product (2.6.16.y and 2.6.27.y) in order to make the effort of maintaining that kernel easier. Other distribution maintainers saw the benefit of this model and got together and colluded to get their companies to all release a product based on the same kernel version without realizing it (2.6.32.y). After that was very successful, and allowed developers to share work across companies, those companies decided to not do that anymore, so future LTS kernels were picked on an individual distribution’s needs and maintained by different developers (3.0.y, 3.2.y, 3.12.y, 3.16.y, and 3.18.y) creating more work and confusion for everyone involved.

This ad-hoc method of catering to only specific Linux distributions was not beneficial to the millions of devices that used Linux in an embedded system and were not based on a traditional Linux distribution. Because of this, Greg Kroah-Hartman decided that the choice of the LTS kernel needed to change to a method in which companies can plan on using the LTS kernel in their products. The rule became “one kernel will be picked each year, and will be maintained for two years.” With that rule, the 3.4.y, 3.10.y, and 3.14.y kernels were picked.

Due to a large number of different LTS kernels being released all in the same year, causing lots of confusion for vendors and users, the rule of no new LTS kernels being based on an individual distribution’s needs was created. This was agreed upon at the annual Linux kernel summit and started with the 4.1.y LTS choice.

During this process, the LTS kernel would only be announced after the release happened, making it hard for companies to plan ahead of time what to use in their new product, causing lots of guessing and misinformation to be spread around. This was done on purpose as previously, when companies and kernel developers knew ahead of time what the next LTS kernel was going to be, they relaxed their normal stringent review process and allowed lots of untested code to be merged (2.6.32.y). The fallout of that mess took many months to unwind and stabilize the kernel to a proper level.

The kernel community discussed this issue at its annual meeting and decided to mark the 4.4.y kernel as a LTS kernel release, much to the surprise of everyone involved, with the goal that the next LTS kernel would be planned ahead of time to be based on the last kernel release of 2016 in order to provide enough time for companies to release products based on it in the next holiday season (2017). This is how the 4.9.y and 4.14.y kernels were picked as the LTS kernel releases.

This process seems to have worked out well, without many problems being reported against the 4.9.y tree, despite it containing over 16,000 changes, making it the largest kernel to ever be released.

Future LTS kernels should be planned based on this release cycle (the last kernel of the year). This should allow SoC vendors to plan ahead on their development cycle to not release new chipsets based on older, and soon to be obsolete, LTS kernel versions.

Stable kernel patch rules

The rules for what can be added to a stable kernel release have remained almost identical for the past 12 years. The full list of the rules for patches to be accepted into a stable kernel release can be found in the Documentation/process/stable_kernel_rules.rst kernel file and are summarized here. A stable kernel change:

  • must be obviously correct and tested.
  • must not be bigger than 100 lines.
  • must fix only one thing.
  • must fix something that has been reported to be an issue.
  • can be a new device id or quirk for hardware, but not add major new functionality
  • must already be merged into Linus’s tree

The last rule, “a change must be in Linus’s tree”, prevents the kernel community from losing fixes. The community never wants a fix to go into a stable kernel release that is not already in Linus’s tree so that anyone who upgrades should never see a regression. This prevents many problems that other projects who maintain a stable and development branch can have.

Kernel Updates

The Linux kernel community has promised its userbase that no upgrade will ever break anything that is currently working in a previous release. That promise was made in 2007 at the annual Kernel developer summit in Cambridge, England, and still holds true today. Regressions do happen, but those are the highest priority bugs and are either quickly fixed, or the change that caused the regression is quickly reverted from the Linux kernel tree.

This promise holds true for both the incremental stable kernel updates, as well as the larger “major” updates that happen every three months.

The kernel community can only make this promise for the code that is merged into the Linux kernel tree. Any code that is merged into a device’s kernel that is not in the kernel.org releases is unknown and interactions with it can never be planned for, or even considered. Devices based on Linux that have large patchsets can have major issues when updating to newer kernels, because of the huge number of changes between each release. SoC patchsets are especially known to have issues with updating to newer kernels due to their large size and heavy modification of architecture specific, and sometimes core, kernel code.

Most SoC vendors do want to get their code merged upstream before their chips are released, but the reality of project-planning cycles and ultimately the business priorities of these companies prevent them from dedicating sufficient resources to the task. This, combined with the historical difficulty of pushing updates to embedded devices, results in almost all of them being stuck on a specific kernel release for the entire lifespan of the device.

Because of the large out-of-tree patchsets, most SoC vendors are starting to standardize on using the LTS releases for their devices. This allows devices to receive bug and security updates directly from the Linux kernel community, without having to rely on the SoC vendor’s backporting efforts, which traditionally are very slow to respond to problems.

It is encouraging to see that the Android project has standardized on the LTS kernels as a “minimum kernel version requirement”. Hopefully that will allow the SoC vendors to continue to update their device kernels in order to provide more secure devices for their users.

Security

When doing kernel releases, the Linux kernel community almost never declares specific changes as “security fixes”. This is due to the basic problem of the difficulty in determining if a bugfix is a security fix or not at the time of creation. Also, many bugfixes are only determined to be security related after much time has passed, so to keep users from getting a false sense of security by not taking patches, the kernel community strongly recommends always taking all bugfixes that are released.

Linus summarized the reasoning behind this behavior in an email to the Linux Kernel mailing list in 2008:

On Wed, 16 Jul 2008, pageexec@freemail.hu wrote:
>
> you should check out the last few -stable releases then and see how
> the announcement doesn't ever mention the word 'security' while fixing
> security bugs

Umm. What part of "they are just normal bugs" did you have issues with?

I expressly told you that security bugs should not be marked as such,
because bugs are bugs.

> in other words, it's all the more reason to have the commit say it's
> fixing a security issue.

No.

> > I'm just saying that why mark things, when the marking have no meaning?
> > People who believe in them are just _wrong_.
>
> what is wrong in particular?

You have two cases:

 - people think the marking is somehow trustworthy.

   People are WRONG, and are misled by the partial markings, thinking that
   unmarked bugfixes are "less important". They aren't.

 - People don't think it matters

   People are right, and the marking is pointless.

In either case it's just stupid to mark them. I don't want to do it,
because I don't want to perpetuate the myth of "security fixes" as a
separate thing from "plain regular bug fixes".

They're all fixes. They're all important. As are new features, for that
matter.

> when you know that you're about to commit a patch that fixes a security
> bug, why is it wrong to say so in the commit?

It's pointless and wrong because it makes people think that other bugs
aren't potential security fixes.

What was unclear about that?

    Linus

This email can be found here, and the whole thread is recommended reading for anyone who is curious about this topic.

When security problems are reported to the kernel community, they are fixed as soon as possible and pushed out publicly to the development tree and the stable releases. As described above, the changes are almost never described as a “security fix”, but rather look like any other bugfix for the kernel. This is done to allow affected parties the ability to update their systems before the reporter of the problem announces it.

Linus describes this method of development in the same email thread:

On Wed, 16 Jul 2008, pageexec@freemail.hu wrote:
>
> we went through this and you yourself said that security bugs are *not*
> treated as normal bugs because you do omit relevant information from such
> commits

Actually, we disagree on one fundamental thing. We disagree on
that single word: "relevant".

I do not think it's helpful _or_ relevant to explicitly point out how to
tigger a bug. It's very helpful and relevant when we're trying to chase
the bug down, but once it is fixed, it becomes irrelevant.

You think that explicitly pointing something out as a security issue is
really important, so you think it's always "relevant". And I take mostly
the opposite view. I think pointing it out is actually likely to be
counter-productive.

For example, the way I prefer to work is to have people send me and the
kernel list a patch for a fix, and then in the very next email send (in
private) an example exploit of the problem to the security mailing list
(and that one goes to the private security list just because we don't want
all the people at universities rushing in to test it). THAT is how things
should work.

Should I document the exploit in the commit message? Hell no. It's
private for a reason, even if it's real information. It was real
information for the developers to explain why a patch is needed, but once
explained, it shouldn't be spread around unnecessarily.

    Linus

Full details of how security bugs can be reported to the kernel community in order to get them resolved and fixed as soon as possible can be found in the kernel file Documentation/admin-guide/security-bugs.rst

Because security bugs are not announced to the public by the kernel team, CVE numbers for Linux kernel-related issues are usually released weeks, months, and sometimes years after the fix was merged into the stable and development branches, if at all.

Keeping a secure system

When deploying a device that uses Linux, it is strongly recommended that all LTS kernel updates be taken by the manufacturer and pushed out to their users after proper testing shows the update works well. As was described above, it is not wise to try to pick and choose various patches from the LTS releases because:

  • The releases have been reviewed by the kernel developers as a whole, not in individual parts
  • It is hard, if not impossible, to determine which patches fix “security” issues and which do not. Almost every LTS release contains at least one known security fix, and many yet “unknown”.
  • If testing shows a problem, the kernel developer community will react quickly to resolve the issue. If you wait months or years to do an update, the kernel developer community will not be able to even remember what the updates were given the long delay.
  • Changes to parts of the kernel that you do not build/run are fine and can cause no problems to your system. To try to filter out only the changes you run will cause a kernel tree that will be impossible to merge correctly with future upstream releases.

Note, this author has audited many SoC kernel trees that attempt to cherry-pick random patches from the upstream LTS releases. In every case, severe security fixes have been ignored and not applied.

As proof of this, I demoed at the Kernel Recipes talk referenced above how trivial it was to crash all of the latest flagship Android phones on the market with a tiny userspace program. The fix for this issue was released 6 months prior in the LTS kernel that the devices were based on, however none of the devices had upgraded or fixed their kernels for this problem. As of this writing (5 months later) only two devices have fixed their kernel and are now not vulnerable to that specific bug.

February 02, 2018
Sergei Trofimovich a.k.a. slyfox (homepage, bugs)
Linker script weird tricks or EFI on ia64 (February 02, 2018, 00:00 UTC)

trofi's blog: Linker script weird tricks or EFI on ia64

Linker script weird tricks or EFI on ia64

One of gentoo bugs was about gentoo install CD not being able to boot Itanium boxes.

The bug

The symptom is EFI loader error:

Loading.: Gentoo Linux
ImageAddress: pointer is outside of image
ImageAddress: pointer is outside of image
LoadPe: Section 1 was not loaded
Load of Gentoo Linux failed: Not Found
Paused - press any key to continue

which looks like a simple error if you have the hardware and source code of the loader that prints the error.

I had neither.

My plan was to extend ski emulator to be able to handle early EFI stuff. I started reading docs on early itanium boot life: on SALs (system abstraction layers), PALs (platform abstration layers), monarch processors, rendevouz protocols to handle MCAs (machine check aborts) and expected system state when hand off to EFI happens.

But SAL spec touches EFI only at SAL->EFI interface boundary. I dreaded to open EFI spec before finishing reading SAL paper.

How ia64 actually boots

But recently I got access to real management processor (sometimes also called MP, BMC (baseboard managenet controller), iLO (integrated lights-out)) on rx3600 machine and I gave up on extending ski for a while.

The very first thing I attempted is to get to serial console of operating system from iLO. It happened to be not as straightforward as passing console=ttyS0 to kernel. I ended up learning a bit about EFI shell, ia64-specific serial port numbering and elilo.

Eventually I was able to boot arbitrary kernel directly from EFI shell and get to kernel’s console! I needed that as a fallback in case I screw default boot loader, boot loader config or default boot options as I don’t have physical access to ia64 machine.

Here is an example of booting from CD and controlling it from iLO virtual serial console:

fs0:\efi\boot\bootia64.efi -i gentoo.igz gentoo initrd=gentoo.igz root=/dev/ram0 init=/linuxrc dokeymap looptype=squashfs loop=/image.squashfs cdroot console=ttyS1,115200n8

here fs0 is a FAT file system (on CD, could be on HDD) with EFI applications. bootia64.efi is a sys-boot/elilo application in disguse. console=ttyS1,115200n8 matches EFI setup of second serial console.

After successful boot from CD of known good old kernel (from 2009) I was not afraid of messing with main HDD install or even upgrading to brand new kernels, I even managed to squash another ia64-specific kernel and GCC bug! I’ll try to write another post about that.

This poking also helped me to understand boot process in more detail. ia64 boots in the following way:

  1. SAL: when machine powers on it executes SAL (and PAL) code to initialize System and Processors. This code is not easily updateable and usually stays the same across OS updates.
  2. EFI: Then SAL hands control to EFI (also not easily updateable) where I can interactively pick boot device or even get EFI shell to run EFI applications (programs for EFI environment).
  3. Bootloader (EFI application): Normally EFI is set up to start boot loader after a few seconds. It’s up to the user (finally!) to provide a bootloader implementation. Gentoo ia64 handbook suggests sys-boot/elilo package.
  4. OS kernel: bootloader finds OS kernel, loads it and hands control off to kernel.

Searching for the clues

Our problem happened at stage 3 where EFI failed to load elilo application.

Gentoo builds .iso images automatically and continuously. Here you can find ia64 isos. Older disks are available on actual builder machines.

As elilo used to work on images from 2008 and even 2016 (!) I passed through a few autobuilt ISO images and collected a few working and non-working samples and started comparing them.

I extracted elilo.efi files from 3 disks:

  • elilo-2014-works.efi: good, picked from machine I have access to
  • elilo-2016-works.efi: good, picked from 2016 install CD
  • elilo-2018-does-not-work.efi: bad, freshly built on modern software

Normally I would start from running readelf -a on each executable and diff for suspicious changes. The files however are not ELFs:

$ file *.efi
elilo-2014-works.efi:         PE32+ executable (EFI application) Intel Itanium (stripped to external PDB), for MS Windows
elilo-2016-works.efi:         PE32 executable (EFI application) Intel Itanium (stripped to external PDB), for MS Windows
elilo-2018-does-not-work.efi: PE32+ executable (EFI application) Intel Itanium (stripped to external PDB), for MS Windows

One of them is not even PE32+ but still happens to boot.

Binutils has more generic readelf -a equivalent: it’s objdump -x. Comparing two good files:

$ objdump -x elilo-2014-works.efi > 2014.good
$ objdump -x elilo-2016-works.efi > 2016.good
--- 2014.good 2018-01-27 23:34:10.118197637 +0000
+++ 2016.good 2018-01-27 23:34:23.590191456 +0000
@@ -2,2 +2,2 @@
-elilo-2014-works.efi: file format pei-ia64
-elilo-2014-works.efi
+elilo-2016-works.efi: file format pei-ia64
+elilo-2016-works.efi
@@ -6 +6 @@
-start address 0x0000000000043a20
+start address 0x000000000003a6a0
@@ -14,2 +14,2 @@
-Time/Date Tue Jun 24 22:05:17 2014
-Magic 020b (PE32+)
+Time/Date Mon Jan 9 21:18:46 2006
+Magic 010b (PE32)
@@ -17,3 +17,3 @@
-MinorLinkerVersion 23
-SizeOfCode 00036e00
-SizeOfInitializedData 00020800
+MinorLinkerVersion 56
+SizeOfCode 0002e000
+SizeOfInitializedData 00028a00
@@ -21 +21 @@
-AddressOfEntryPoint 0000000000043a20
+AddressOfEntryPoint 000000000003a6a0
@@ -34,2 +34,2 @@
-SizeOfHeaders 000002c0
-CheckSum 00067705
+SizeOfHeaders 00000400
+CheckSum 00069054

There is a lot of odd going on here: the file on 2016 live CD is actually from 2006 and it’s actually older than file from 2014. It has different PE type and as a result different file alignment. Thus I discarded elilo-2016-works.efi as too old.

Comparing bad/good:

$ objdump -x elilo-2014-works.efi > 2014.good
$ objdump -x elilo-2018-does-not-work.efi > 2018.bad
$ diff -U0 2014.good 2018.bad
--- 2014.good 2018-01-27 23:42:58.355002114 +0000
+++ 2018.bad 2018-01-27 23:43:02.042000991 +0000
@@ -2,2 +2,2 @@
-elilo-2014-works.efi: file format pei-ia64
-elilo-2014-works.efi
+elilo-2018-does-not-work.efi: file format pei-ia64
+elilo-2018-does-not-work.efi
@@ -6 +6 @@
-start address 0x0000000000043a20
+start address 0x0000000000046d80
@@ -14 +14 @@
-Time/Date Tue Jun 24 22:05:17 2014
+Time/Date Thu Jan 1 01:00:00 1970
@@ -17,3 +17,3 @@
-MinorLinkerVersion 23
-SizeOfCode 00036e00
-SizeOfInitializedData 00020800
+MinorLinkerVersion 29
+SizeOfCode 0003a200
+SizeOfInitializedData 00020e00
@@ -21,2 +21,2 @@
-AddressOfEntryPoint 0000000000043a20
-BaseOfCode 0000000000001000
+AddressOfEntryPoint 0000000000046d80
+BaseOfCode 0000000000000000
@@ -33 +33 @@
-SizeOfImage 0005c000
+SizeOfImage 0005f000
@@ -35 +35 @@
-CheckSum 00067705
+CheckSum 0005f6a3
@@ -51 +51 @@
-Entry 5 0000000000058000 0000000c Base Relocation Directory [.reloc]
+Entry 5 000000000005b000 0000000c Base Relocation Directory [.reloc]
@@ -66,3 +66,3 @@
-Virtual Address: 00043a20 Chunk size 12 (0xc) Number of fixups 2
- reloc 0 offset 0 [43a20] DIR64
- reloc 1 offset 8 [43a28] DIR64
+Virtual Address: 00046d80 Chunk size 12 (0xc) Number of fixups 2
+ reloc 0 offset 0 [46d80] DIR64
+ reloc 1 offset 8 [46d88] DIR64
...
@@ -87,571 +87,585 @@
-[ 0](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x0000000000006a04 edd30_guid
-[ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x00000000000001f8 done_fixups
...
-[570](sec 2)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000110 Optind
+[ 0](sec 3)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x0000000000006ccc edd30_guid
+[ 1](sec 2)(fl 0x00)(ty 0)(scl 3) (nx 0) 0x0000000000000208 done_fixups
...
+[584](sec 2)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000110 Optind

This looks better. A few notable things:

  • Time/Date field is not initialized for the bad file: sane date vs. all zeros
  • BaseOfCode looks uninitialized: 0x1000 vs. 0x0000
  • MinorLinkerVersion says good file was built with binutils-2.23, bad file was built with binutils-2.29
  • File has only 2 relocations: Number of fixups 2. We could check them manually!
  • And a lot of symbols: 570 vs. 580

I tried to build elilo with binutils-2.25: it produced the same binary as elilo-2018-does-not-work.efi.

My only clue was that BaseOfCode is zero. It felt like something used to reside in the first page before code section and now it does not anymore. What could it be?

GNU binutils linker scripts

Time to look at how decision is made what to put into the first page at link time!

The build process of elilo.efi is truly unusual. Let’s run emerge -1 sys-boot/elilo and check what commands are being executed to yield it:

# emerge -1 sys-boot/elilo
...
make -j1 ... ARCH=ia64
...
ia64-unknown-linux-gnu-gcc \
-I. -I. -I/usr/include/efi -I/usr/include/efi/ia64 -I/usr/include/efi/protocol -I./efi110 \
-O2 -fno-stack-protector -fno-strict-aliasing -fpic -fshort-wchar \
-Wall \
-DENABLE_MACHINE_SPECIFIC_NETCONFIG -DCONFIG_LOCALFS -DCONFIG_NETFS -DCONFIG_CHOOSER_SIMPLE \
-DCONFIG_CHOOSER_TEXTMENU \
-frename-registers -mfixed-range=f32-f127 \
-DCONFIG_ia64 \
-c glue_netfs.c -o glue_netfs.o
ia64-unknown-linux-gnu-ld \
-nostdlib -znocombreloc \
-T /usr/lib/elf_ia64_efi.lds \
-shared -Bsymbolic \
-L/usr/lib -L/usr/lib \
/usr/lib/crt0-efi-ia64.o elilo.o getopt.o strops.o loader.o fileops.o util.o vars.o alloc.o \
chooser.o config.o initrd.o alternate.o bootparams.o gunzip.o console.o fs/fs.o choosers/choosers.o \
devschemes/devschemes.o ia64/sysdeps.o glue_localfs.o glue_netfs.o \
\
-o elilo.so \
\
-lefi -lgnuefi \
/usr/lib/gcc/ia64-unknown-linux-gnu/7.2.0/libgcc.a
ia64-unknown-linux-gnu-ld: warning: creating a DT_TEXTREL in a shared object.
objcopy -j .text -j .sdata -j .data -j .dynamic -j .dynsym -j .rel \
-j .rela -j .reloc --target=efi-app-ia64 elilo.so elilo.efi
>>> Source compiled.

Here we see the following steps:

  • With exception of -frename-registers -mfixed-range=f32-f127 the process of building EFI application is almost like building normal shared library.
  • Non-standard /usr/lib/elf_ia64_efi.lds linker script is used.
  • objcopy –target=efi-app-ia64 is used to create PE32+ file out of ELF64. Very dark magic.

Luckily gnu-efi and elilo have superb documentation (10 pages). All the obscure corners are explained in every detail. Part 2: Inner Workings is the short description of how every dynamic linker works with a light touch of ELF -> PE32+ conversion. I wish I have seen this doc years ago :)

Let’s looks at the elf_ia64_efi.lds linker script to get full understanding of where every byte comes from when elilo.so is being linked (sourceforge viewer):

OUTPUT_FORMAT("elf64-ia64-little")
OUTPUT_ARCH(ia64)
ENTRY(_start_plabel)
SECTIONS
{
. = 0;
ImageBase = .;
.hash : { *(.hash) }/* this MUST come first! */
. = ALIGN(4096);
.text :
{
_text = .;
*(.text)
*(.text.*)
*(.gnu.linkonce.t.*)
. = ALIGN(16);
}
_etext = .;
_text_size = . - _text;
. = ALIGN(4096);
__gp = ALIGN (8) + 0x200000;
.sdata :
{
_data = .;
*(.got.plt)
*(.got)
*(.srodata)
*(.sdata)
*(.sbss)
*(.scommon)
}
. = ALIGN(4096);
.data :
{
*(.rodata*)
*(.ctors)
*(.data*)
*(.gnu.linkonce.d*)
*(.plabel)/* data whose relocs we want to ignore */
/* the EFI loader doesn't seem to like a .bss section, so we stick
it all into .data: */
*(.dynbss)
*(.bss)
*(COMMON)
}
.note.gnu.build-id : { *(.note.gnu.build-id) }
. = ALIGN(4096);
.dynamic : { *(.dynamic) }
. = ALIGN(4096);
.rela :
{
*(.rela.text)
*(.rela.data*)
*(.rela.sdata)
*(.rela.got)
*(.rela.gnu.linkonce.d*)
*(.rela.stab)
*(.rela.ctors)
}
_edata = .;
_data_size = . - _etext;
. = ALIGN(4096);
.reloc :/* This is the PECOFF .reloc section! */
{
*(.reloc)
}
. = ALIGN(4096);
.dynsym : { *(.dynsym) }
. = ALIGN(4096);
.dynstr : { *(.dynstr) }
/DISCARD/ :
{
*(.rela.plabel)
*(.rela.reloc)
*(.IA_64.unwind*)
*(.IA64.unwind*)
}
}

Even though the file is very big it’s easy to read. Linker script defines symbols (as symbol = expression, . (dot) means “current address”) and output section (as .output-section : { expressions }) in terms of input sections.

Here is what linker script tries to achieve:

  • . = 0; sets load address to 0. It does not really matter. At load time module will be relocated anyway.
  • ImageBase = . means that new symbol ImageBase is created and pointed at current address: the very start of the whole output as nothing was collected yet.
  • .hash : { *(.hash) } means to collect all DT_HASH input symbol sections into output DT_HASH section. DT_HASH defines mandatory section of fast symbol lookup. Linker uses that section to resolve symbol name to symbol type and offset in the file.
  • . = ALIGN(4096) forces linker to pad output (with zeros) to 4K boundary (EFI defines page size as 4K).
  • Only then goes .text : … section. BaseOfCode is the very first byte of .text section.
  • Other things go below.

GNU hash mystery

Later objcopy is used to produce final (PE32+) binary by copying whitelisted sections passed via -j:

objcopy -j .text -j .sdata -j .data -j .dynamic -j .dynsym -j .rel \
        -j .rela -j .reloc --target=efi-app-ia64 elilo.so elilo.efi

While objcopy does not copy .hash section into final binary it’s mere presence in elilo.so file changes .text offset as linker already allocated space for it in elilo.so and resolved other reloactions taking offset into account.

So why offset disappeared? Simple! Because gentoo does not generate .hash sections since 2014! .gnu.hash (DT_GNU_HASH) is being used instead. DT_GNU_HASH was added to binutils/glibc around 2006 as an optional mechanism to speed up dynamic linking and dynamic loading.

But linker script does not deal with .gnu.hash sections!

It’s easy to mimic handling of both section types:

--- a/gnuefi/elf_ia64_efi.lds
+++ b/gnuefi/elf_ia64_efi.lds
@@ -7,2 +7,3 @@ SECTIONS
ImageBase = .;
.hash : { *(.hash) } /* this MUST come first! */
+ .gnu.hash : { *(.gnu.hash) }

This fix alone was enough to restore elilo.efi! A few other architectures did not handle it either. See full upstream fix.

Breakage mechanics

But why does it matter? What does it mean to drop .hash section completely? PE format does not have a .hash equivalent.

Let’s inspect what actually changes in elilo.so file before and after the patch:

$ objdump -x elilo.efi.no.gnu.hash > elilo.efi.no.gnu.hash.od
$ objdump -x elilo.efi.gnu.hash > elilo.efi.gnu.hash.od
--- elilo.efi.no.gnu.hash.od 2018-01-29 23:05:25.776000000 +0000
+++ elilo.efi.gnu.hash.od 2018-01-29 23:05:31.700000000 +0000
@@ -2,2 +2,2 @@
-elilo.efi.no.gnu.hash: file format pei-ia64
-elilo.efi.no.gnu.hash
+elilo.efi.gnu.hash: file format pei-ia64
+elilo.efi.gnu.hash
@@ -6 +6 @@
-start address 0x0000000000046d80
+start address 0x0000000000047d80
@@ -21,2 +21,2 @@
-AddressOfEntryPoint 0000000000046d80
-BaseOfCode 0000000000000000
+AddressOfEntryPoint 0000000000047d80
+BaseOfCode 0000000000001000
@@ -33 +33 @@
-SizeOfImage 0005f000
+SizeOfImage 00060000
...
@@ -633,39 +633,38 @@
-[546](sec 1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000000 ImageBase
-[547](sec 3)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000007560 Udp4ServiceBindingProtocol
...
-[584](sec 2)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000110 Optind
+[546](sec 3)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000007560 Udp4ServiceBindingProtocol
...
+[583](sec 2)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000110 Optind
...

Here internal ImageBase symbol became external symbol! Which means EFI would have to resolve ImageBase at application startup. That’s why we have seen loader error (as opposed to elilo.efi runtime errors or kernel boot errors).

ELF spec says shared objects and dynamic executables are required to have .hash (or .gnu.hash) section to be valid executable but linker script did not maintain this requirement and all hell broke loose.

Perhaps linker could be tweaked to report warnings when symbol table is missing from output file. But as you can see linker scripts are much more powerful than just implementing fixed output format.

Fun details

gnu-efi has a lot of other jewels!

For example, entry point has to point to ia64 function descriptor (FDESCR). FDESCR is a pair of pointers: pointer to code section (actual entry point) and value of gp register (global pointer, base pointer used by PIC code).

These two pointers are both absolute addresses. But EFI application needs to be relocatable (loadable at different addresses).

Entry point FDESCR needs to be relocated by EFI loader.

How would you inject ia64 relocation to two 64-bit pointers in PE32+ format? gnu-efi does a very crazy thing (even more crazy than relying on objcopy to Just Work)): it injects PE32+ relocation directly into ia64 ELF code! That’s how it does the trick (the snippet below is the very tail of crt0-efi-ia64.S file):

// PE32+ wants a PLABEL, not the code address of the entry point:
.align 16
.global _start_plabel
.section .plabel, "a"
_start_plabel:
data8 _start
data8 __gp
// hand-craft a .reloc section for the plabel:
#define IMAGE_REL_BASED_DIR64 10
.section .reloc, "a"
data4 _start_plabel // Page RVA
data4 12 // Block Size (2*4+2*2)
data2 (IMAGE_REL_BASED_DIR64<<12) + 0 // reloc for plabel's entry point
data2 (IMAGE_REL_BASED_DIR64<<12) + 8 // reloc for plabel's global pointer

This code generates two sections:

  • .plabel with yet unrelocated _start and _gp pointers (crafts FDESCR itself)
  • .reloc with synthesised raw bytes that look like PE32+ relocation. It’s format is slightly more complicated and is defined by PE32+ spec:
    • [4 bytes] 32-bit offset to some blob where relocations are to be applied, in our case it’s _start_plabel
    • [4 bytes] full size of this relocation entry
    • [2 bytes * N relocations] list of tuples: (4 bits for relocation type, 12-bit offset to data to relocate)

Here we have two relocations that add ImageBase: to _start and to _gp. And it’s precisely these two relocations that EFI loader reported as invalid:

ImageAddress: pointer is outside of image
ImageAddress: pointer is outside of image
LoadPe: Section 1 was not loaded

Before .gnu.hash fix ImageBase (ImageAddress in EFI terminology) was indeed pointing somewhere else.

How about searching internets for source of this EFI loader error? tianocode has one hit in commented out code:

//...
Base = EfiLdrPeCoffImageAddress (Image, (UINTN)Section->VirtualAddress);
End = EfiLdrPeCoffImageAddress (Image, (UINTN)(Section->VirtualAddress + Section->Misc.VirtualSize));
if (EFI_ERROR(Status) || !Base || !End) {
// DEBUG((D_LOAD|D_ERROR, "LoadPe: Section %d was not loaded\n", Index));
PrintHeader ('L');
return EFI_LOAD_ERROR;
}
//...

Not very useful but still fun :)

Parting words

Itanium was the first system EFI was targeted at and was later morphed into UEFI. Surprisingly I managed to ignore (U)EFI-based boot on modern machines and this bug was my first experience to deal with it. And it was not too bad! :)

I found out a few things along the way:

  • ia64 EFI is a rich interface to OS and iLO to control the machine remotely
  • Linker scripts can do a lot more than I realized:
    • merge sections
    • discard sections
    • inject symbols
    • handle alignments
    • rename sections
  • With certain care objcopy can convert binaries across different object file formats. In this case ELF64 to PE32+
  • Assembler allows you to inject absolutely arbitrary sections into object files. Just make sure you handle those in your linker script.
  • Modern GNU toolchain still can breathe life into decade old hardware to teach us a trick or two.
  • objdump -x is a cool equivalent of readelf.

Have fun!

Posted on February 2, 2018
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> comments powered by Disqus

January 30, 2018
FOSDEM is near (January 30, 2018, 00:00 UTC)

Excitement is building with FOSDEM 2018 only a few days away. There are now 14 current and one former developer in planned attendance, along with many from the Gentoo community.

This year one Gentoo related talk titled Unix? Windows? Gentoo! will be given by Michael Haubenwallner (haubi ) of the Gentoo Prefix project.

Two more non-Gentoo related talks will be given by current Gentoo developers:

If you attend don’t miss out on your opportunity to build the Gentoo web-of-trust by bringing a valid governmental ID, a printed key fingerprint, and various UIDs you want certified. See you at FOSDEM!

January 24, 2018
Sven Vermeulen a.k.a. swift (homepage, bugs)
Documenting a rule (January 24, 2018, 19:40 UTC)

In the first post I talked about why configuration documentation is important. In the second post I looked into a good structure for configuration documentation of a technological service, and ended with an XCCDF template in which this documentation can be structured.

The next step is to document the rules themselves, i.e. the actual content of a configuration baseline.

January 23, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Evaluating ScyllaDB for production 1/2 (January 23, 2018, 17:31 UTC)

I have recently been conducting a quite deep evaluation of ScyllaDB to find out if we could benefit from this database in some of our intensive and latency critical data streams and jobs.

I’ll try to share this great experience within two posts:

  1. The first one (you’re reading) will walk through how to prepare yourself for a successful Proof Of Concept based evaluation with the help of the ScyllaDB team.
  2. The second post will cover the technical aspects and details of the POC I’ve conducted with the various approaches I’ve followed to find the most optimal solution.

But let’s start with how I got into this in the first place…


Selecting ScyllaDB

I got interested in ScyllaDB because of its philosophy and engagement and I quickly got into it by being a modest contributor and its Gentoo Linux packager (not in portage yet).

Of course, I didn’t pick an interest in that technology by chance:

We’ve been using MongoDB in (mass) production at work for a very very long time now. I can easily say we were early MongoDB adopters. But there’s no wisdom in saying that MongoDB is not suited for every use case and the Hadoop stack has come very strong in our data centers since then, with a predominance of Hive for the heavy duty and data hungry workflows.

One thing I was never satisfied with MongoDB was its primary/secondary architecture which makes you lose write throughput and is even more horrible when you want to set up what they call a “cluster” which is in fact some mediocre abstraction they add on top of replica-sets. To say the least, it is inefficient and cumbersome to operate and maintain.

So I obviously had Cassandra on my radar for a long time, but I was pushed back by its Java stack, heap size and silly tuning… Also, coming from the versatile MongoDB world, Cassandra’s CQL limitations looked dreadful at that time…

The day I found myself on ScyllaDB’s webpage and read their promises, I was sure to be challenging our current use cases with this interesting sea monster.


Setting up a POC with the people at ScyllaDB

Through my contributions around my packaging of ScyllaDB for Gentoo Linux, I got to know a bit about the people behind the technology. They got interested in why I was packaging this in the first place and when I explained my not-so-secret goal of challenging our production data workflows using Scylla, they told me that they would love to help!

I was a bit surprised at first because this was the first time I ever saw a real engagement of the people behind a technology into someone else’s POC.

Their pitch is simple, they will help (for free) anyone conducting a serious POC to make sure that the outcome and the comprehension behind it is the best possible. It is a very mature reasoning to me because it is easy to make false assumptions and conclude badly when testing a technology you don’t know, even more when your use cases are complex and your expectations are very high like us.

Still, to my current knowledge, they’re the only ones in the data industry to have this kind of logic in place since the start. So I wanted to take this chance to thank them again for this!

The POC includes:

  • no bullshit, simple tech-to-tech relationship
  • a private slack channel with multiple ScyllaDB’s engineers
  • video calls to introduce ourselves and discuss our progress later on
  • help in schema design and logic
  • fast answers to every question you have
  • detailed explanations on the internals of the technology
  • hardware sizing help and validation
  • funny comments and French jokes (ok, not suitable for everyone)

 

 

 

 

 

 

 

 

 


Lessons for a successful POC

As I said before, you’ve got to be serious in your approach to make sure your POC will be efficient and will lead to an unbiased and fair conclusion.

This is a list of the main things I consider important to have prepared before you start.

Have some background

Make sure to read some literature to have the key concepts and words in mind before you go. It is even more important if like me you do not come from the Cassandra world.

I found that the Cassandra: The Definitive Guide book at O’Reilly is a great read. Also, make sure to go around ScyllaDB’s documentation.

Work with a shared reference document

Make sure you share with the ScyllaDB guys a clear and detailed document explaining exactly what you’re trying to achieve and how you are doing it today (if you plan on migrating like we did).

I made a google document for this because it felt the easiest. This document will be updated as you go and will serve as a reference for everyone participating in the POC.

This shared reference document is very important, so if you don’t know how to construct it or what to put in it, here is how I structured it:

  1. Who’s participating at <your company>
    • photo + name + speciality
  2. Who’s participating at ScyllaDB
  3. POC hardware
    • if you have your own bare metal machines you want to run your POC on, give every detail about their number and specs
    • if not, explain how you plan to setup and run your scylla cluster
  4. Reference infrastructure
    • give every details on the technologies and on the hardware of the servers that are currently responsible for running your workflows
    • explain your clusters and their speciality
  5. Use case #1 : <name>
    • Context
      • give context about your use case by explaining it without tech words, think from the business / user point of view
    • Current implementations
      • that’s where you get technical
      • technology names and where they come into play in your current stack
      • insightful data volumes and cardinality
      • current schema models
    • Workload related to this use case
      • queries per second per data source / type
      • peek hours or no peek hours?
      • criticality
    • Questions we want to answer to
      • remember, the NoSQL world is lead by query-based-modeling schema design logic, cassandra/scylla is no exception
      • write down the real questions you want your data model(s) to be able to answer to
      • group them and rate them by importance
    • Validated models
      • this one comes during the POC when you have settled on the data models
      • write them down, explain them or relate them to the questions they answer to
      • copy/paste some code showcasing how to work with them
    • Code examples
      • depending on the complexity of your use case, you may have multiple constraints or ways to compare your current implementation with your POC
      • try to explain what you test and copy/paste the best code you came up with to validate each point

Have monitoring in place

ScyllaDB provides a monitoring platform based on Docker, Prometheus and Grafana that is efficient and simple to set up. I strongly recommend that you set it up, as it provides valuable insights almost immediately, and on an ongoing basis.

Also you should strive to give access to your monitoring to the ScyllaDB guys, if that’s possible for you. They will provide with a fixed IP which you can authorize to access your grafana dashboards so they can have a look at the performances of your POC cluster as you go. You’ll learn a great deal about ScyllaDB’s internals by sharing with them.

Know when to stop

The main trap in a POC is to work without boundaries. Since you’re looking for the best of what you can get out of a technology, you’ll get tempted to refine indefinitely.

So this is good to have at least an idea on the minimal figures you’d like to reach to get satisfied with your tests. You can always push a bit further but not for too long!

Plan some high availability tests

Even if you first came to ScyllaDB for its speed, make sure to test its high availability capabilities based on your experience.

Most importantly, make sure you test it within your code base and guidelines. How will your code react and handle a failure, partial and total? I was very surprised and saddened to discover so little literature on the subject in the Cassandra community.

POC != production

Remember that even when everything is right on paper, production load will have its share of surprises and unexpected behaviours. So keep a good deal of flexibility in your design and your capacity planning to absorb them.

Make time

Our POC lasted almost 5 months instead of estimated 3, mostly because of my agenda’s unwillingness to cooperate…

As you can imagine this interruption was not always optimal, for either me or the ScyllaDB guys, but they were kind not to complain about it. So depending on how thorough you plan to be, make sure you make time matching your degree of demands. The reference document is also helpful to get back to speed.


Feedback for the ScyllaDB guys

Here are the main points I noted during the POC that the guys from ScyllaDB could improve on.

They are subjective of course but it’s important to give feedback so here it goes. I’m fully aware that everyone is trying to improve, so I’m not pointing any fingers at all.

I shared those comments already with them and they acknowledged them very well.

More video meetings on start

When starting the POC, try to have some pre-scheduled video meetings to set it right in motion. This will provide a good pace as well as making sure that everyone is on the same page.

Make a POC kick starter questionnaire

Having a minimal plan to follow with some key points to set up just like the ones I explained before would help. Maybe also a minimal questionnaire to make sure that the key aspects and figures have been given some thought since the start. This will raise awareness on the real answers the POC aims to answer.

To put it simpler: some minimal formalism helps to check out the key aspects and questions.

Develop a higher client driver expertise

This one was the most painful to me, and is likely to be painful for anyone who, like me, is not coming from the Cassandra world.

Finding good and strong code examples and guidelines on the client side was hard and that’s where I felt the most alone. This was not pleasant because a technology is definitely validated through its usage which means on the client side.

Most of my tests were using python and the python-cassandra driver so I had tons of questions about it with no sticking answers. Same thing went with the spark-cassandra-connector when using scala where some key configuration options (not documented) can change the shape of your results drastically (more details on the next post).

High Availability guidelines and examples

This one still strikes me as the most awkward on the Cassandra community. I literally struggled with finding clear and detailed explanations about how to handle failure more or less gracefully with the python driver (or any other driver).

This is kind of a disappointment to me for a technology that position itself as highly available… I’ll get into more details about it on the next post.

A clearer sizing documentation

Even if there will never be a magic formula, there are some rules of thumb that exist for sizing your hardware for ScyllaDB. They should be written down more clearly in a maybe dedicated documentation (sizing guide is labeled as admin guide at time of writing).

Some examples:

  • RAM per core ? what is a core ? relation to shard ?
  • Disk / RAM maximal ratio ?
  • Multiple SSDs vs one NMVe ?
  • Hardware RAID vs software RAID ? need a RAID controller at all ?

Maybe even provide a bare metal complete example from two different vendors such as DELL and HP.

What’s next?

In the next post, I’ll get into more details on the POC itself and the technical learnings we found along the way. This will lead to the final conclusion and the next move we engaged ourselves with.

Marek Szuba a.k.a. marecki (homepage, bugs)
Randomness in virtual machines (January 23, 2018, 15:47 UTC)

I always felt that entropy available to the operating system must be affected by running said operating system in a virtual environment – after all, unpredictable phenomena used to feed the entropy pool are commonly based on hardware and in a VM most hardware either is simulated or has the hypervisor mediate access to it. While looking for something tangentially related to the subject, I have recently stumbled upon a paper commissioned by the German Federal Office for Information Security which covers this subject, with particular emphasis on entropy sources used by the standard Linux random-number generator (i.e. what feeds /dev/random and /dev/urandom), in extreme detail:

https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Publikationen/Studien/ZufallinVMS/Randomness-in-VMs.pdf?__blob=publicationFile&v=3

Personally I have found this paper very interesting but since not everyone can stomach a 142-page technical document, here are some of the main points made by its authors regarding entropy.

  1. As a reminder – the RNG in recent versions of Linux uses five sources of random noise, three of which do not require dedicated hardware (emulated or otherwise): Human-Interface Devices, rotational block devices, and interrupts.
  2. Running in a virtual machine does not affect entropy sources implemented purely in hardware or purely in software. However, it does affect hybrid sources – which in case of Linux essentially means all of the default ones.
  3. Surprisingly enough, virtualisation seems to have no negative effect on the quality of delivered entropy. This is at least in part due to additional CPU execution-time jitter introduced by the hypervisor compensating for increased predictability of certain emulated devices.
  4. On the other hand, virtualisation can strongly affect the quantity of produced entropy. More on this below.
  5. Low quantity of available entropy means among other things that it takes virtual machines visibly longer to bring /dev/urandom to usable state. This is a problem if /dev/urandom is used by services started at boot time because they can be initialised using low-quality random data.

Why exactly is the quantity of entropy low in virtual machines? The problem is that in a lot of configurations, only the last of the three standard noise sources will be active. On the one hand, even physical servers tend to be fairly inactive on the HID front. On the other, the block-device source does nothing unless directly backed by a rotational device – which has been becoming less and less likely, especially when we talk about large cloud providers who, chances are, hold your persistent storage on distributed networked file systems which are miles away from actual hard drives. This leaves interrupts as the only available noise source. Now take a machine configured this way and have it run a VPN endpoint, a HTTPS server, a DNSSEC-enabled DNS server… You get the idea.

But wait, it gets worse. Chances are many devices in your VM, especially ones like network cards which are expected to be particularly active and therefore the main source of interrupts, are in fact paravirtualised rather than fully emulated. Will such devices still generate enough interrupts? That depends on the underlying hypervisor, or to be precise on the paravirtualisation framework it uses. The BSI paper discusses this for KVM/QEMU, Oracle VirtualBox, Microsoft Hyper-V, and VMWare ESXi:

  • the former two use the the VirtIO framework, which is integrated with the Linux interrupt-handling code;
  • VMWare drivers trigger the interrupt handler similarly to physical devices;
  • Hyper-V drivers use a dedicated communication channel called VMBus which does not invoke the LRNG interrupt handler. This means it is entirely feasible for a Linux Hyper-V guest to have all noise sources disabled, and reenabling the interrupt one comes at a cost of performance loss caused by the use of emulated rather than paravirtualised devices.

 

All in all, the paper in question has surprised me with the news of unchanged quality of entropy in VMs and confirmed my suspicions regarding its quantity. It also briefly mentions (which is how I have ended up finding it) the way I typically work around this problem, i.e. by employing VirtIO-RNG – a paravirtualised hardware RNG in the KVM/VirtualBox guests which which interfaces with a source of entropy (typically /dev/random, although other options are possible too) on the host. Combine that with haveged on the host, sprinkle a bit of rate limiting on top (I typically set it to 1 kB/s per guest, even though in theory haveged is capable of acquiring entropy at a rate several orders of magnitude higher) and chances are you will never have to worry about lack of entropy in your virtual machines again.

January 19, 2018
Greg KH a.k.a. gregkh (homepage, bugs)

I keep getting a lot of private emails about my previous post about the latest status of the Linux kernel patches to resolve both the Meltdown and Spectre issues.

These questions all seem to break down into two different categories, “What is the state of the Spectre kernel patches?”, and “Is my machine vunlerable?”

State of the kernel patches

As always, lwn.net covers the technical details about the latest state of the kernel patches to resolve the Spectre issues, so please go read that to find out that type of information.

And yes, it is behind a paywall for a few more weeks. You should be buying a subscription to get this type of thing!

Is my machine vunlerable?

For this question, it’s now a very simple answer, you can check it yourself.

Just run the following command at a terminal window to determine what the state of your machine is:

1
$ grep . /sys/devices/system/cpu/vulnerabilities/*

On my laptop, right now, this shows:

1
2
3
4
$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: Minimal generic ASM retpoline

This shows that my kernel is properly mitigating the Meltdown problem by implementing PTI (Page Table Isolation), and that my system is still vulnerable to the Spectre variant 1, but is trying really hard to resolve the variant 2, but is not quite there (because I did not build my kernel with a compiler to properly support the retpoline feature).

If your kernel does not have that sysfs directory or files, then obviously there is a problem and you need to upgrade your kernel!

Some “enterprise” distributions did not backport the changes for this reporting, so if you are running one of those types of kernels, go bug the vendor to fix that, you really want a unified way of knowing the state of your system.

Note that right now, these files are only valid for the x86-64 based kernels, all other processor types will show something like “Not affected”. As everyone knows, that is not true for the Spectre issues, except for very old CPUs, so it’s a huge hint that your kernel really is not up to date yet. Give it a few months for all other processor types to catch up and implement the correct kernel hooks to properly report this.

And yes, I need to go find a working microcode update to fix my laptop’s CPU so that it is not vulnerable against Spectre…

January 18, 2018
Matthew Thode a.k.a. prometheanfire (homepage, bugs)
Undervolting your CPU for fun and profit (January 18, 2018, 06:00 UTC)

Disclaimer

I'm not responsible if you ruin your system, this guide functions as documentation for future me. While this should just cause MCEs or system resets if done wrong it might also be able to ruin things in a more permanent way.

Why do this

It lowers temps generally and if you hit thermal throttling (as happens with my 5th Generation X1 Carbon) you may thermally throttle less often or at a higher frequency.

Prework

Repaste first, it's generally easier to do and can result in better gains (and it's all about them gains). For instance, my Skull Canyon NUC was resetting itself thermally when stress testing. Repasting lowered the max temps by 20°C.

Undervolting - The How

I based my info on https://github.com/mihic/linux-intel-undervolt which seems to work on my Intel Kaby Lake based laptop.

Using the MSR registers to write the values via msr-tools wrmsr binary.

The following python3 snippet is how I got the values to write.

# for a -110mv offset I run the following
format(0xFFE00000&( (round(-110*1.024)&0xFFF) <<21), '08x')

What you are actually writing is actually as follows, with the plane index being for cpu, gpu or cache for 0, 1 or 2 respectively.

constant plane index constant write/read offset
80000 0 1 1 F1E00000

So we end up with 0x80000011F1E00000 to write to MSR register 0x150.

Undervolting - The Integration

I made a script in /opt/bin/, where I place my custom system scripts called undervolt.sh (make sure it's executable).

#!/usr/bin/env sh
/usr/sbin/wrmsr 0x150 0x80000011F1E00000  # cpu core  -110
/usr/sbin/wrmsr 0x150 0x80000211F1E00000  # cpu cache -110
/usr/sbin/wrmsr 0x150 0x80000111F4800000  # gpu core  -90

I then made a custom systemd unit in /etc/systemd/system called undervolt.service with the following content.

[Unit]
Description=Undervolt Service

[Service]
ExecStart=/opt/bin/undervolt.sh

[Install]
WantedBy=multi-user.target

I then systemctl enable undervolt.service so I'll get my settings on boot.

The following script was also placed in /usr/lib/systemd/system-sleep/undervolt.sh to get the settings after recovering from sleep, as they are lost at that point.

#!/bin/sh
if [ "${1}" == "post" ]; then
  /opt/bin/undervolt.sh
fi

That's it, everything after this point is what I went through in testing.

Testing for stability

I stress tested with mprime in stress mode for the CPU and glxmark or gputest for the GPU. I only recorded the results for the CPU, as that's what I care about.

No changes:

  • 15-17W package tdp
  • 15W core tdp
  • 3.06-3.22 Ghz

50mv Undervolt:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.22-3.43 Ghz

70mv undervolt:

  • 15-16W package tdp
  • 14W core tdp
  • 3.22-3.43 Ghz

90mv undervolt:

  • 15W package tdp
  • 13-14W core tdp
  • 3.32-3.54 Ghz

100mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.42-3.67 Ghz

110mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.42-3.67 Ghz

started getting gfx artifacts, switched the gfx undervolt to 100 here

115mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.48-3.72 Ghz

115mv undervolt with repaste:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.63-3.81 Ghz

120mv undervolt with repaste:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.63-3.81 Ghz

I decided on 110mv cpu and 90mv gpu undervolt for stability, with proper ventilation I get about 3.7-3.8 Ghz out of a max of 3.9 Ghz.

Other notes: Undervolting made the cpu max speed less spiky.

January 17, 2018
Sven Vermeulen a.k.a. swift (homepage, bugs)
Structuring a configuration baseline (January 17, 2018, 08:10 UTC)

A good configuration baseline has a readable structure that allows all stakeholders to quickly see if the baseline is complete, as well as find a particular setting regardless of the technology. In this blog post, I'll cover a possible structure of the baseline which attempts to be sufficiently complete and technology agnostic.

If you haven't read the blog post on documenting configuration changes, it might be a good idea to do so as it declares the scope of configuration baselines and why I think XCCDF is a good match for this.

January 08, 2018
Andreas K. Hüttel a.k.a. dilfridge (homepage, bugs)
FOSDEM 2018 talk: Perl in the Physics Lab (January 08, 2018, 17:07 UTC)

FOSDEM 2018, the "Free and Open Source Developers' European Meeting", takes place 3-4 February at Universite Libre de Bruxelles, Campus Solbosch, Brussels - and our measurement control software Lab::Measurement will be presented there in the Perl devrooom! As all of FOSDEM, the talk will also be streamed live and archived; more details on this follow later. Here's the abstract:

Perl in the Physics Lab
Andreas K. Hüttel
    Track: Perl Programming Languages devroom
    Room: K.4.601
    Day: Sunday
    Start: 11:00
    End: 11:40 
Let's visit our university lab. We work on low-temperature nanophysics and transport spectroscopy, typically measuring current through experimental chip structures. That involves cooling and temperature control, dc voltage sources, multimeters, high-frequency sources, superconducting magnets, and a lot more fun equipment. A desktop computer controls the experiment and records and evaluates data.

Some people (like me) want to use Linux, some want to use Windows. Not everyone knows Perl, not everyone has equal programming skills, not everyone knows equally much about the measurement hardware. I'm going to present our solution for this, Lab::Measurement. We implement a layer structure of Perl modules, all the way from the hardware access and the implementation of device-specific command sets to high level measurement control with live plotting and metadata tracking. Current work focuses on a port of the entire stack to Moose, to simplify and improve the code.

FOSDEM

January 07, 2018
Sven Vermeulen a.k.a. swift (homepage, bugs)
Documenting configuration changes (January 07, 2018, 20:20 UTC)

IT teams are continuously under pressure to set up and maintain infrastructure services quickly, efficiently and securely. As an infrastructure architect, my main concerns are related to the manageability of these services and the secure setup. And within those realms, a properly documented configuration setup is in my opinion very crucial.

In this blog post series, I'm going to look into using the Extensible Configuration Checklist Description Format (XCCDF) as the way to document these. This first post is an introduction to XCCDF functionally, and what I position it for.

January 06, 2018
Greg KH a.k.a. gregkh (homepage, bugs)
Meltdown and Spectre Linux kernel status (January 06, 2018, 12:36 UTC)

By now, everyone knows that something “big” just got announced regarding computer security. Heck, when the Daily Mail does a report on it , you know something is bad…

Anyway, I’m not going to go into the details about the problems being reported, other than to point you at the wonderfully written Project Zero paper on the issues involved here. They should just give out the 2018 Pwnie award right now, it’s that amazingly good.

If you do want technical details for how we are resolving those issues in the kernel, see the always awesome lwn.net writeup for the details.

Also, here’s a good summary of lots of other postings that includes announcements from various vendors.

As for how this was all handled by the companies involved, well this could be described as a textbook example of how NOT to interact with the Linux kernel community properly. The people and companies involved know what happened, and I’m sure it will all come out eventually, but right now we need to focus on fixing the issues involved, and not pointing blame, no matter how much we want to.

What you can do right now

If your Linux systems are running a normal Linux distribution, go update your kernel. They should all have the updates in them already. And then keep updating them over the next few weeks, we are still working out lots of corner case bugs given that the testing involved here is complex given the huge variety of systems and workloads this affects. If your distro does not have kernel updates, then I strongly suggest changing distros right now.

However there are lots of systems out there that are not running “normal” Linux distributions for various reasons (rumor has it that it is way more than the “traditional” corporate distros). They rely on the LTS kernel updates, or the normal stable kernel updates, or they are in-house franken-kernels. For those people here’s the status of what is going on regarding all of this mess in the upstream kernels you can use.

Meltdown – x86

Right now, Linus’s kernel tree contains all of the fixes we currently know about to handle the Meltdown vulnerability for the x86 architecture. Go enable the CONFIG_PAGE_TABLE_ISOLATION kernel build option, and rebuild and reboot and all should be fine.

However, Linus’s tree is currently at 4.15-rc6 + some outstanding patches. 4.15-rc7 should be out tomorrow, with those outstanding patches to resolve some issues, but most people do not run a -rc kernel in a “normal” environment.

Because of this, the x86 kernel developers have done a wonderful job in their development of the page table isolation code, so much so that the backport to the latest stable kernel, 4.14, has been almost trivial for me to do. This means that the latest 4.14 release (4.14.12 at this moment in time), is what you should be running. 4.14.13 will be out in a few more days, with some additional fixes in it that are needed for some systems that have boot-time problems with 4.14.12 (it’s an obvious problem, if it does not boot, just add the patches now queued up.)

I would personally like to thank Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Peter Zijlstra, Josh Poimboeuf, Juergen Gross, and Linus Torvalds for all of the work they have done in getting these fixes developed and merged upstream in a form that was so easy for me to consume to allow the stable releases to work properly. Without that effort, I don’t even want to think about what would have happened.

For the older long term stable (LTS) kernels, I have leaned heavily on the wonderful work of Hugh Dickins, Dave Hansen, Jiri Kosina and Borislav Petkov to bring the same functionality to the 4.4 and 4.9 stable kernel trees. I had also had immense help from Guenter Roeck, Kees Cook, Jamie Iles, and many others in tracking down nasty bugs and missing patches. I want to also call out David Woodhouse, Eduardo Valentin, Laura Abbott, and Rik van Riel for their help with the backporting and integration as well, their help was essential in numerous tricky places.

These LTS kernels also have the CONFIG_PAGE_TABLE_ISOLATION build option that should be enabled to get complete protection.

As this backport is very different from the mainline version that is in 4.14 and 4.15, there are different bugs happening, right now we know of some VDSO issues that are getting worked on, and some odd virtual machine setups are reporting strange errors, but those are the minority at the moment, and should not stop you from upgrading at all right now. If you do run into problems with these releases, please let us know on the stable kernel mailing list.

If you rely on any other kernel tree other than 4.4, 4.9, or 4.14 right now, and you do not have a distribution supporting you, you are out of luck. The lack of patches to resolve the Meltdown problem is so minor compared to the hundreds of other known exploits and bugs that your kernel version currently contains. You need to worry about that more than anything else at this moment, and get your systems up to date first.

Also, go yell at the people who forced you to run an obsoleted and insecure kernel version, they are the ones that need to learn that doing so is a totally reckless act.

Meltdown – ARM64

Right now the ARM64 set of patches for the Meltdown issue are not merged into Linus’s tree. They are staged and ready to be merged into 4.16-rc1 once 4.15 is released in a few weeks. Because these patches are not in a released kernel from Linus yet, I can not backport them into the stable kernel releases (hey, we have rules for a reason…)

Due to them not being in a released kernel, if you rely on ARM64 for your systems (i.e. Android), I point you at the Android Common Kernel tree All of the ARM64 fixes have been merged into the 3.18, 4.4, and 4.9 branches as of this point in time.

I would strongly recommend just tracking those branches as more fixes get added over time due to testing and things catch up with what gets merged into the upstream kernel releases over time, especially as I do not know when these patches will land in the stable and LTS kernel releases at this point in time.

For the 4.4 and 4.9 LTS kernels, odds are these patches will never get merged into them, due to the large number of prerequisite patches required. All of those prerequisite patches have been long merged and tested in the android-common kernels, so I think it is a better idea to just rely on those kernel branches instead of the LTS release for ARM systems at this point in time.

Also note, I merge all of the LTS kernel updates into those branches usually within a day or so of being released, so you should be following those branches no matter what, to ensure your ARM systems are up to date and secure.

Spectre

Now things get “interesting”…

Again, if you are running a distro kernel, you might be covered as some of the distros have merged various patches into them that they claim mitigate most of the problems here. I suggest updating and testing for yourself to see if you are worried about this attack vector

For upstream, well, the status is there is no fixes merged into any upstream tree for these types of issues yet. There are numerous patches floating around on the different mailing lists that are proposing solutions for how to resolve them, but they are under heavy development, some of the patch series do not even build or apply to any known trees, the series conflict with each other, and it’s a general mess.

This is due to the fact that the Spectre issues were the last to be addressed by the kernel developers. All of us were working on the Meltdown issue, and we had no real information on exactly what the Spectre problem was at all, and what patches were floating around were in even worse shape than what have been publicly posted.

Because of all of this, it is going to take us in the kernel community a few weeks to resolve these issues and get them merged upstream. The fixes are coming in to various subsystems all over the kernel, and will be collected and released in the stable kernel updates as they are merged, so again, you are best off just staying up to date with either your distribution’s kernel releases, or the LTS and stable kernel releases.

It’s not the best news, I know, but it’s reality. If it’s any consolation, it does not seem that any other operating system has full solutions for these issues either, the whole industry is in the same boat right now, and we just need to wait and let the developers solve the problem as quickly as they can.

The proposed solutions are not trivial, but some of them are amazingly good. The Retpoline post from Paul Turner is an example of some of the new concepts being created to help resolve these issues. This is going to be an area of lots of research over the next years to come up with ways to mitigate the potential problems involved in hardware that wants to try to predict the future before it happens.

Other arches

Right now, I have not seen patches for any other architectures than x86 and arm64. There are rumors of patches floating around in some of the enterprise distributions for some of the other processor types, and hopefully they will surface in the weeks to come to get merged properly upstream. I have no idea when that will happen, if you are dependant on a specific architecture, I suggest asking on the arch-specific mailing list about this to get a straight answer.

Conclusion

Again, update your kernels, don’t delay, and don’t stop. The updates to resolve these problems will be continuing to come for a long period of time. Also, there are still lots of other bugs and security issues being resolved in the stable and LTS kernel releases that are totally independent of these types of issues, so keeping up to date is always a good idea.

Right now, there are a lot of very overworked, grumpy, sleepless, and just generally pissed off kernel developers working as hard as they can to resolve these issues that they themselves did not cause at all. Please be considerate of their situation right now. They need all the love and support and free supply of their favorite beverage that we can provide them to ensure that we all end up with fixed systems as soon as possible.

January 03, 2018
FOSDEM 2018 (January 03, 2018, 00:00 UTC)

FOSDEM 2018 logo

Put on your cow bells and follow the herd of Gentoo developers to Université libre de Bruxelles in Brussels, Belgium. This year FOSDEM 2018 will be held on February 3rd and 4th.

Our developers will be ready to candidly greet all open source enthusiasts at the Gentoo stand in building K. Visit this year’s wiki page to see which developer will be running the stand during the different visitation time slots. So far seven developers have specified their attendance, with most-likely more on the way!

Unlike past years, Gentoo will not be hosting a keysigning party, however participants are encouraged to exchange and verify OpenPGP key data to continue building the Gentoo Web of Trust. See the wiki article for more details.

January 01, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)

Escaping Docker container using waitid() – CVE-2017-5123 (twistlock.com)

December 28, 2017
Michael Palimaka a.k.a. kensington (homepage, bugs)
Helper scripts for CMake-based ebuilds (December 28, 2017, 14:13 UTC)

I recently pushed two new scripts to the KDE overlay to help with CMake-based ebuild creation and maintenance.

cmake_dep_check inspects all CMakeLists.txt files under the current directory and maps find_package calls to Gentoo packages.

cmake_ebuild_check builds on cmake_dep_check and produces a list of CMake dependencies that are not present in the ebuild (and optionally vice versa).

If you find it useful or run into any issues feel free to drop me a line.

December 23, 2017
Sergei Trofimovich a.k.a. slyfox (homepage, bugs)
GCC bit shift handling on ia64 final part (December 23, 2017, 00:00 UTC)

trofi's blog: GCC bit shift handling on ia64 final part

GCC bit shift handling on ia64 final part

In my previous post I’ve tracked down gcc RTL optimisation bug around instruction combiner but did not get to very bottom of it. gcc bug report advanced a bit and a few things got clearer.

To restate the problem:

Posted on December 23, 2017
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> comments powered by Disqus

GCC bit shift handling on ia64 (December 23, 2017, 00:00 UTC)

trofi's blog: GCC bit shift handling on ia64

GCC bit shift handling on ia64

Over the past week I’ve spent quite some time poking at toolchain packages in Gentoo and fixing fresh batch of bugs. The bugs popped up after recent introduction of gcc profiles that enable PIE (position independent executables) and SSP (stack smash protected executables) by default in Gentoo. One would imagine SSP is more invasive as it changes on-stack layout of things. But somehow PIE yields bugs in every direction you throw it at:

  • [not really from last week] i386 static binaries SIGSEGV at startup: bug (TL;DR: calling vdso-implemented SYSENTER syscall entry point needs TLS to be already initialised in glibc. In this case brk() syscall was called before TLS was initialised.)
  • ia64’s glibc miscompiles librt.so: bug (Tl;DR: pie-by-default gcc tricked glibc into believing binutils supports IFUNC on ia64. It does not, exactly as in sparc bug 7 years ago.)
  • sparc’s glibc produces broken static binaries on pie-by-default gcc: bug. (TL;DR: C startup files for sparc were PIC-unfriendly)
  • powerpc’s gcc-7.2.0 and older produces broken binaries: bug (TL;DR: two bugs: binutils generates relocations to non-existent symbols, gcc hardcoded wrong C startup files to be used)
  • mips64 gcc fails to produce glibc that binutils can link without assert() failures (TODO: debug and fix)
  • m68k glibc fails to compile (TODO: debug and fix)
  • sh4 gcc fails to compile by crashing with stack smash detection (TODO: debug and fix)
  • more I forgot about

These are only toolchain bugs, and I did not yet try things on arm. I’m sure other packages will uncover more interesting corner cases once we fix the toolchain.

I won’t write about any of the above here but it’s a nice reference point if you want to backport something to older toolchains to get PIE/SSP work better. The bugs might be a good reference point on how to debug that kind of bugs.

Today’s story will be about gcc code generation bug on ia64 that has nothing to do with PIE or SSP.

Bug

On #gentoo-ia64 Jason Duerstock asked if openssl fails tests on modern gcc-7.2.0. It used to pass on gcc-6.4.0 for him.

Having successfully dealt with IFUNC bug recently I was confident things are mostly working in ia64 land and gave it a try.

openssl DES tests were indeed failing, claiming that encryption/decryption roundtrip did not yield original data with errors like:

des_ede3_cbc_encrypt decrypt error ...

What does DES test do?

The test lives at https://github.com/openssl/openssl/blob/OpenSSL_1_0_2-stable/crypto/des/destest.c#L421 and looks like that (simplified):

printf("Doing cbcm\n");
DES_set_key_checked(&cbc_key, &ks);
DES_set_key_checked(&cbc2_key, &ks2);
DES_set_key_checked(&cbc3_key, &ks3);
i = strlen((char *)cbc_data) + 1;
DES_ede3_cbcm_encrypt(cbc_data, cbc_out, 16L, &ks, &ks2, &ks3, &iv3, &iv2, DES_ENCRYPT);
DES_ede3_cbcm_encrypt(&cbc_data[16], &cbc_out[16], i - 16, &ks, &ks2, &ks3, &iv3, &iv2, DES_ENCRYPT);
DES_ede3_cbcm_encrypt(cbc_out, cbc_in, i, &ks, &ks2, &ks3, &iv3, &iv2, DES_DECRYPT);
if (memcmp(cbc_in, cbc_data, strlen((char *)cbc_data) + 1) != 0) {
printf("des_ede3_cbcm_encrypt decrypt error\n");
// ...
}

Test encrypts a few bytes of data in two takes (1: encrypt 16 bytes, 2: encrypt rest) and then decrypts the result in one go. It works on static data. Very straightforward.

Fixing variables

My first suspect was gcc change as 7.2.0 fails, 6.4.0 works and I tried to build openssl with optimisations completely disabled:

# CFLAGS=-O1 FEATURES=test emerge -1 =dev-libs/openssl-1.0.2n
...
des_ede3_cbc_encrypt decrypt error ...

# CFLAGS=-O0 FEATURES=test emerge -1 =dev-libs/openssl-1.0.2n
...
OK!

Hah! At least -O0 seems to work. It means it will be easier to cross-check which optimisation pass affects code generation and find out if it’s an openssl bug or something else.

Setting up A/B test

Debugging cryptographic algorithms like DES is very easy from this standpoint because they don’t need any external state: no services running, no files created, input data is not randmized. They are a pure function of input bit stream(s).

I unpacked single openssl source tree and started building it in two directories: one with -O0 optimisations, another with -O1:

~/openssl/openssl-1.0.2n-O0/
    `openssl-1.0.2n-.ia64/
        `crypto/des/destest.c (file-1)
        ...
~/openssl/openssl-1.0.2n-O0/
    `openssl-1.0.2n-.ia64/
        `crypto/des/destest.c (symlink to file-1)
        ...

Then I started sprinking printf() statements in destest.c and other crypto/des/ files, then ran destest and diffed text stdouts to find where exactly difference appears first.

Relatively quickly I nailed it down to the following trace:

unsigned int u;
...
# define D_ENCRYPT(LL,R,S) {\
LOAD_DATA_tmp(R,S,u,t,E0,E1); \
t=ROTATE(t,4); \
LL^=\
DES_SPtrans[0][(u>> 2L)&0x3f]^ \
DES_SPtrans[2][(u>>10L)&0x3f]^ \
DES_SPtrans[4][(u>>18L)&0x3f]^ \
DES_SPtrans[6][(u>>26L)&0x3f]^ \
DES_SPtrans[1][(t>> 2L)&0x3f]^ \
DES_SPtrans[3][(t>>10L)&0x3f]^ \
DES_SPtrans[5][(t>>18L)&0x3f]^ \
DES_SPtrans[7][(t>>26L)&0x3f]; }

See an error? Me neither.

Minimizing test

printf() debugging suggested DES_SPtrans[6][(u>>26L)&0x3f] returns different data in -O0 and -O1 cases. Namely the following expression did change:

  • -O0: (u>>26L)&0x3f yielded 0x33
  • -O1: (u>>26L)&0x3f yielded 0x173

Note how it’s logically infeasible to get anything more than 0x3f from that expression. And yet here we are with our 0x173 value.

I spent some time deleting lines one by one from all the macros as long as the result kept producing the diff. Removing lines is safe in most of DES code because all the test does is flipping a few bits and rotating them within a single unsigned int u local variable.

After a while I came up with the following minimal reproducer:

#include <stdio.h>
typedef unsigned int u32;
u32 bug (u32 * result) __attribute__((noinline));
u32 bug (u32 * result)
{
// non-static and volatile to inhibit constant propagation
volatile u32 ss = 0xFFFFffff;
volatile u32 d = 0xEEEEeeee;
u32 tt = d & 0x00800000;
u32 r = tt << 8;
// rotate
r = (r >> 31)
| (r << 1);
u32 u = r^ss;
u32 off = u >> 1;
// seemingly unrelated but bug-triggering side-effect
*result = tt;
return off;
}
int main() {
u32 l;
u32 off = bug(&l);
printf ("off>>: %08x\n", off);
return 0;
}
$ gcc -O0 a.c -o a-O0 && ./a-O0 > o0
$ gcc -O1 a.c -o a-O1 && ./a-O1 > o1
$ diff -U0 o0 o1

-off>>: 7fffffff
+off>>: ffffffff

The test itself is very straightforward: it does only 32-bit arightmetics on unsigned int r and prints the result. This is a very fragile test: if you try to remove seemingly unrelated code like *result = tt; the bug will disappear.

What gcc actually does

I tried to look at the assembly code. If I could spot an obvious problem I could inspect intermediate gcc steps to get the idea which pass precisely introduces wrong resulting code. Despite being Itanium the code is not that complicated (added detailed comments):

Dump of assembler code for function bug:
mov r14=r12 ; r12: register holding stack pointer
; r14 = r12 (new temporary variable)
mov r15=-1 ; r15=0xFFFFffff ('ss' variable)
st4.rel [r14]=r15,4 ; write 'ss' on stack address 'r14 - 4'
movl r15=0xffffffffeeeeeeee ; r15=0xEEEEeeee ('d' variable)
st4.rel [r14]=r15 ; write 'd' on stack address 'r14'
ld4.acq r15=[r14] ; and quickly read 'd' back into 'r15' :)
movl r14=0x800000 ; r14=0x00800000
and r15=r14,r15 ; u32 tt = d & 0x00800000;
; doing u32 r = tt << 8;
dep.z r14=r15,8,24 ; "deposit" 24 bits from r14 into 15
; starting at offset 8 and zeroing the rest.
; Or in pictures (with offsets):
; r14 = 0xAABBCCDD11223344
; r15 = 0x0000000022334400
ld4.acq r8=[r12] ; read 'ss' into r8
st4 [r32]=r15 ; *result = tt
; // rotate
; r = (r >> 31)
; | (r << 1);
mix4.r r14=r14,r14 ; This one is tricky: mix duplicates lower32
; bits into lower and upper 32 bits of 14.
; Or in pictures:
; r14 = 0xAABBCCDDF1223344 ->
; r14 = 0xF1223344F1223344
shr.u r14=r14,31 ; And shift right for 31 bit (with zero padding)
; r14 = 0x00000001E2446688
; Note how '1' is in a position of 33-th bit
xor r8=r8,r14 ; u32 u = r^ss;
extr.u r8=r8,1,32 ; u32 off = u >> 1
; "extract" 32 bits at offset 1 from r8 and put them to r8
; Or in pictures:
; r8 = 0x0000000100000000 -> (note how all lower 32 bits are 0)
; r8 = 0x0000000080000000
br.ret.sptk.many b0 ; return r8 as a computation result

Tl;DR: extr.u r8=r8,1,32 extracts 32 bits from 64-bit register at offset 1 from r8 and puts them into r8 back. The problem is that for u32 off = u >> 1 to work correctly it should extract only 31 bit, not 32.

If I patch assembly file to contain extr.u r8=r8,1,31 the sample will work correctly.

At this point I shared my sample with Jason and filed a gcc bug as it was clear that compiler does something very unexpected. Jason pulled in James and they produced a working gcc patch for me while I was having dinner(!) :)

gcc passes

What surprised me is the fact that gcc recognised “rotate” pattern and used clever mix4.r/shr.u trick to achieve the bit rotation effect.

Internally gcc has a few frameworks to perform optimisations:

  • high-level (relatively) targed independent “tree” optimisations (working on GIMPLE representation)
  • low-level target-specific “register” optimisations (working on RTL representation)

GIMPLE passes run before RTL passes. Let’s check how many passes are being ran on our small sample by using -fdump-tree-all and -fdump-rtl-all:

$ ia64-unknown-linux-gnu-gcc -O1 -fdump-tree-all -fdump-rtl-all -S a.c && ls -1 | nl
1   a.c
2   a.c.001t.tu
3   a.c.002t.class
4   a.c.003t.original
5   a.c.004t.gimple
6   a.c.006t.omplower
7   a.c.007t.lower
8   a.c.010t.eh
9   a.c.011t.cfg
10  a.c.012t.ompexp
11  a.c.019t.fixup_cfg1
12  a.c.020t.ssa
13  a.c.022t.nothrow
14  a.c.027t.fixup_cfg3
15  a.c.028t.inline_param1
16  a.c.029t.einline
17  a.c.030t.early_optimizations
18  a.c.031t.objsz1
19  a.c.032t.ccp1
20  a.c.033t.forwprop1
21  a.c.034t.ethread
22  a.c.035t.esra
23  a.c.036t.ealias
24  a.c.037t.fre1
25  a.c.039t.mergephi1
26  a.c.040t.dse1
27  a.c.041t.cddce1
28  a.c.046t.profile_estimate
29  a.c.047t.local-pure-const1
30  a.c.049t.release_ssa
31  a.c.050t.inline_param2
32  a.c.086t.fixup_cfg4
33  a.c.091t.ccp2
34  a.c.094t.backprop
35  a.c.095t.phiprop
36  a.c.096t.forwprop2
37  a.c.097t.objsz2
38  a.c.098t.alias
39  a.c.099t.retslot
40  a.c.100t.fre3
41  a.c.101t.mergephi2
42  a.c.105t.dce2
43  a.c.106t.stdarg
44  a.c.107t.cdce
45  a.c.108t.cselim
46  a.c.109t.copyprop1
47  a.c.110t.ifcombine
48  a.c.111t.mergephi3
49  a.c.112t.phiopt1
50  a.c.114t.ch2
51  a.c.115t.cplxlower1
52  a.c.116t.sra
53  a.c.118t.dom2
54  a.c.120t.phicprop1
55  a.c.121t.dse2
56  a.c.122t.reassoc1
57  a.c.123t.dce3
58  a.c.124t.forwprop3
59  a.c.125t.phiopt2
60  a.c.126t.ccp3
61  a.c.127t.sincos
62  a.c.129t.laddress
63  a.c.130t.lim2
64  a.c.131t.crited1
65  a.c.134t.sink
66  a.c.138t.dce4
67  a.c.139t.fix_loops
68  a.c.167t.no_loop
69  a.c.170t.veclower21
70  a.c.172t.printf-return-value2
71  a.c.173t.reassoc2
72  a.c.174t.slsr
73  a.c.178t.dom3
74  a.c.182t.phicprop2
75  a.c.183t.dse3
76  a.c.184t.cddce3
77  a.c.185t.forwprop4
78  a.c.186t.phiopt3
79  a.c.187t.fab1
80  a.c.191t.dce7
81  a.c.192t.crited2
82  a.c.194t.uncprop1
83  a.c.195t.local-pure-const2
84  a.c.226t.nrv
85  a.c.227t.optimized
86  a.c.229r.expand
87  a.c.230r.vregs
88  a.c.231r.into_cfglayout
89  a.c.232r.jump
90  a.c.233r.subreg1
91  a.c.234r.dfinit
92  a.c.235r.cse1
93  a.c.236r.fwprop1
94  a.c.243r.ce1
95  a.c.244r.reginfo
96  a.c.245r.loop2
97  a.c.246r.loop2_init
98  a.c.247r.loop2_invariant
99  a.c.249r.loop2_doloop
100 a.c.250r.loop2_done
101 a.c.254r.dse1
102 a.c.255r.fwprop2
103 a.c.256r.auto_inc_dec
104 a.c.257r.init-regs
105 a.c.259r.combine
106 a.c.260r.ce2
107 a.c.262r.outof_cfglayout
108 a.c.263r.split1
109 a.c.264r.subreg2
110 a.c.267r.asmcons
111 a.c.271r.ira
112 a.c.272r.reload
113 a.c.273r.postreload
114 a.c.275r.split2
115 a.c.279r.pro_and_epilogue
116 a.c.280r.dse2
117 a.c.282r.jump2
118 a.c.286r.ce3
119 a.c.288r.cprop_hardreg
120 a.c.289r.rtl_dce
121 a.c.290r.bbro
122 a.c.296r.alignments
123 a.c.298r.mach
124 a.c.299r.barriers
125 a.c.303r.shorten
126 a.c.304r.nothrow
127 a.c.306r.final
128 a.c.307r.dfinish
129 a.c.308t.statistics
130 a.s

128 passes! Quite a few. t letter after pass number means “tree” pass, r is an “rtl” pass. Note how all “tree” passes precede “rtl” ones.

Latest “tree” pass outputs the following:

;; cat a.c.227t.optimized
;; Function bug (bug, funcdef_no=23, decl_uid=2078, cgraph_uid=23, symbol_order=23)

__attribute__((noinline))
bug (u32 * result)
{
  u32 off;
  u32 u;
  u32 r;
  u32 tt;
  volatile u32 d;
  volatile u32 ss;
  unsigned int d.0_1;
  unsigned int ss.1_2;

  <bb 2> [100.00%]:
  ss ={v} 4294967295;
  d ={v} 4008636142;
  d.0_1 ={v} d;
  tt_6 = d.0_1 & 8388608;
  r_7 = tt_6 << 8;
  r_8 = r_7 r>> 31;
  ss.1_2 ={v} ss;
  u_9 = ss.1_2 ^ r_8;
  off_10 = u_9 >> 1;
  *result_11(D) = tt_6;
  return off_10;

}



;; Function main (main, funcdef_no=24, decl_uid=2088, cgraph_uid=24, symbol_order=24) (executed once)

main ()
{
  u32 off;
  u32 l;

  <bb 2> [100.00%]:
  off_3 = bug (&l);
  __printf_chk (1, "off>>: %08x\n", off_3);
  l ={v} {CLOBBER};
  return 0;

}

Code does not differ much from originally written code because I specifically tried to write something that won’t trigger any high-level transformations.

RTL dumps are even more verbose. I’ll show an incomplete snippet of bug() function in a.c.229r.expand file:

;;
;; Full RTL generated for this function:
;;
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v/f:DI 347 [ result ])
(reg:DI 112 in0 [ result ])) "a.c":5 -1
(nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SI 348)
(const_int -1 [0xffffffffffffffff])) "a.c":7 -1
(nil))
(insn 7 6 8 2 (set (mem/v/c:SI (reg/f:DI 335 virtual-stack-vars) [1 ss+0 S4 A128])
(reg:SI 348)) "a.c":7 -1
(nil))
(insn 8 7 9 2 (set (reg:DI 349)
(reg/f:DI 335 virtual-stack-vars)) "a.c":8 -1
(nil))
(insn 9 8 10 2 (set (reg/f:DI 350)
(plus:DI (reg/f:DI 335 virtual-stack-vars)
(const_int 4 [0x4]))) "a.c":8 -1
(nil))
(insn 10 9 11 2 (set (reg:SI 351)
(const_int -286331154 [0xffffffffeeeeeeee])) "a.c":8 -1
(nil))
(insn 11 10 12 2 (set (mem/v/c:SI (reg/f:DI 350) [1 d+0 S4 A32])
(reg:SI 351)) "a.c":8 -1
(nil))
(insn 12 11 13 2 (set (reg:DI 352)
(reg/f:DI 335 virtual-stack-vars)) "a.c":9 -1
(nil))
(insn 13 12 14 2 (set (reg/f:DI 353)
(plus:DI (reg/f:DI 335 virtual-stack-vars)
(const_int 4 [0x4]))) "a.c":9 -1
(nil))
(insn 14 13 15 2 (set (reg:SI 340 [ d.0_1 ])
(mem/v/c:SI (reg/f:DI 353) [1 d+0 S4 A32])) "a.c":9 -1
(nil))
(insn 15 14 16 2 (set (reg:DI 355)
(const_int 8388608 [0x800000])) "a.c":9 -1
(nil))
(insn 16 15 17 2 (set (reg:DI 354)
(and:DI (subreg:DI (reg:SI 340 [ d.0_1 ]) 0)
(reg:DI 355))) "a.c":9 -1
(nil))
(insn 17 16 18 2 (set (reg/v:SI 342 [ tt ])
(subreg:SI (reg:DI 354) 0)) "a.c":9 -1
(nil))
(insn 18 17 19 2 (set (reg/v:SI 343 [ r ])
(ashift:SI (reg/v:SI 342 [ tt ])
(const_int 8 [0x8]))) "a.c":10 -1
(nil))
(insn 19 18 20 2 (set (reg:SI 341 [ ss.1_2 ])
(mem/v/c:SI (reg/f:DI 335 virtual-stack-vars) [1 ss+0 S4 A128])) "a.c":16 -1
(nil))
(insn 20 19 21 2 (set (mem:SI (reg/v/f:DI 347 [ result ]) [1 *result_11(D)+0 S4 A32])
(reg/v:SI 342 [ tt ])) "a.c":20 -1
(nil))
(insn 21 20 22 2 (set (reg:SI 357 [ r ])
(rotate:SI (reg/v:SI 343 [ r ])
(const_int 1 [0x1]))) "a.c":13 -1
(nil))
(insn 22 21 23 2 (set (reg:DI 358)
(xor:DI (subreg:DI (reg:SI 357 [ r ]) 0)
(subreg:DI (reg:SI 341 [ ss.1_2 ]) 0))) "a.c":16 -1
(nil))
(insn 23 22 24 2 (set (reg:DI 359)
(zero_extract:DI (reg:DI 358)
(const_int 31 [0x1f])
(const_int 1 [0x1]))) "a.c":17 -1
(nil))
(insn 24 23 25 2 (set (subreg:DI (reg:SI 356 [ off ]) 0)
(reg:DI 359)) "a.c":17 -1
(nil))
(insn 25 24 29 2 (set (reg/v:SI 346 [ <retval> ])
(reg:SI 356 [ off ])) "a.c":21 -1
(nil))
(insn 29 25 30 2 (set (reg/i:SI 8 r8)
(reg/v:SI 346 [ <retval> ])) "a.c":22 -1
(nil))
(insn 30 29 0 2 (use (reg/i:SI 8 r8)) "a.c":22 -1
(nil))

The above is RTL representation of our bug() function. S-expressions look like machine instructions but not quite.

For example the following snippet (single S-expression) introduces new virtual register 351 which should receive literal value of “0xeeeeeeee”. SI means SImode, or 32-bit signed integer:

More on modes is here.

(insn 10 9 11 2 (set (reg:SI 351)
(const_int -286331154 [0xffffffffeeeeeeee])) "a.c":8 -1
(nil))

Or just r351 = 0xeeeeeeee :) Note it also mentions source file line numbers. Useful when mapping RTL logs back to source files (and I guess gcc uses the same to emit dwarf debugging sections).

Another example of RTL instruction:

(insn 22 21 23 2 (set (reg:DI 358)
(xor:DI (subreg:DI (reg:SI 357 [ r ]) 0)
(subreg:DI (reg:SI 341 [ ss.1_2 ]) 0))) "a.c":16 -1
(nil))

Here we see an RTL equivalent of u_9 = ss.1_2 ^ r_8; code (GIMPLE). There is more subtlety here: xor itself operates on 64-bit integers (DI) that contain 32-bit values in lower 32-bits (subregs) which I don’t really understand.

Upstream bug attempts to decide which assumption is violated in generic RTL optimisation pass (or backend implementation). At the time of writing this post a few candidate patches were posted to address the bug.

gcc RTL optimisations

I was interested in the optimisation that converted extr.u r8=r8,1,31 (valid) to extr.u r8=r8,1,32 (invalid).

It’s GIMPLE representation is:

// ...
off_10 = u_9 >> 1;
*result_11(D) = tt_6;
return off_10;

Let’s try to find RTL representation of this construct in the very first RTL dump (a.c.229r.expand file):

(insn 23 22 24 2 (set (reg:DI 359)
(zero_extract:DI (reg:DI 358)
(const_int 31 [0x1f])
(const_int 1 [0x1]))) "a.c":17 -1
(nil))
;; ...
(insn 24 23 25 2 (set (subreg:DI (reg:SI 356 [ off ]) 0)
(reg:DI 359)) "a.c":17 -1
(nil))
(insn 25 24 29 2 (set (reg/v:SI 346 [ <retval> ])
(reg:SI 356 [ off ])) "a.c":21 -1
(nil))
(insn 29 25 30 2 (set (reg/i:SI 8 r8)
(reg/v:SI 346 [ <retval> ])) "a.c":22 -1
(nil))
(insn 30 29 0 2 (use (reg/i:SI 8 r8)) "a.c":22 -1
(nil))

Or in a slightly more concise form:

reg359 = zero_extract(reg358, offset=1, length=31)
reg356 = (u32)reg359;
reg346 = (u32)reg356;
reg8   = (u32)reg346;

Very straightforward (and still correct). zero_extract is some RTL virtual operation that will have to be expressed as real instruction at some point.

In the absence of other optimisations it’s done with the following rule (at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/ia64/ia64.md;h=b7cd52ba366ba3d63e98479df5f4be44ffd17ca6;hb=HEAD#l1381)

(define_insn "extzv"
[(set (match_operand:DI 0 "gr_register_operand" "=r")
(zero_extract:DI (match_operand:DI 1 "gr_register_operand" "r")
(match_operand:DI 2 "extr_len_operand" "n")
(match_operand:DI 3 "shift_count_operand" "M")))]
""
"extr.u %0 = %1, %3, %2"
[(set_attr "itanium_class" "ishf")])

In unoptimised case we see all the intermediate assignments are stored at r14 address and loaded back.

;; gcc -O0 a.c
...
extr.u r15 = r15, 1, 31
;;
st4 [r14] = r15
mov r14 = r2
;;
ld8 r14 = [r14]
adds r16 = -32, r2
;;
ld4 r15 = [r16]
;;
st4 [r14] = r15
adds r14 = -20, r2
;;
ld4 r14 = [r14]
;;
mov r8 = r14
.restore sp
mov r12 = r2
br.ret.sptk.many b0

To get rid of all the needless operations RTL applies extensive list of optimisations. Each RTL dump contains summary of the pass effect on the file. Let’s look at the a.c.232r.jump:

Deleted 2 trivially dead insns
3 basic blocks, 2 edges.

Around a.c.257r.init-regs pass our RTL representation shrinks into the following:

(insn 23 22 24 2 (set (reg:DI 359)
(zero_extract:DI (reg:DI 358)
(const_int 31 [0x1f])
(const_int 1 [0x1]))) "a.c":17 159 {extzv}
(expr_list:REG_DEAD (reg:DI 358)
(nil)))
(insn 24 23 29 2 (set (subreg:DI (reg:SI 356 [ off ]) 0)
(reg:DI 359)) "a.c":17 6 {movdi_internal}
(expr_list:REG_DEAD (reg:DI 359)
(nil)))
(insn 29 24 30 2 (set (reg/i:SI 8 r8)
(reg:SI 356 [ off ])) "a.c":22 5 {movsi_internal}
(expr_list:REG_DEAD (reg:SI 356 [ off ])
(nil)))
(insn 30 29 0 2 (use (reg/i:SI 8 r8)) "a.c":22 -1
(nil))

Or in a slightly more concise form:

reg359 = zero_extract(reg358, offset=1, length=31)
reg356 = (u32)reg359;
reg8   = (u32)reg356;

Note how reg346 = (u32)reg356; already gone away.

The most magic happened in a single a.c.259r.combine pass. Here is it’s report:

;; Function bug (bug, funcdef_no=23, decl_uid=2078, cgraph_uid=23, symbol_order=23)

starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
insn_cost 2: 4
insn_cost 6: 4
insn_cost 32: 4
insn_cost 7: 4
insn_cost 10: 4
insn_cost 11: 4
insn_cost 14: 4
insn_cost 15: 4
insn_cost 16: 4
insn_cost 18: 4
insn_cost 19: 4
insn_cost 20: 4
insn_cost 21: 4
insn_cost 22: 4
insn_cost 23: 4
insn_cost 24: 4
insn_cost 29: 4
insn_cost 30: 0
allowing combination of insns 2 and 20
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 2.
modifying insn i3    20: [in0:DI]=r354:DI#0
      REG_DEAD in0:DI
      REG_DEAD r354:DI
deferring rescan insn with uid = 20.
allowing combination of insns 23 and 24
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 23.
modifying insn i3    24: r356:SI#0=r358:DI 0>>0x1
  REG_DEAD r358:DI
deferring rescan insn with uid = 24.
allowing combination of insns 24 and 29
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 24.
modifying insn i3    29: r8:DI=zero_extract(r358:DI,0x20,0x1)
      REG_DEAD r358:DI
deferring rescan insn with uid = 29.
starting the processing of deferred insns
rescanning insn with uid = 20.
rescanning insn with uid = 29.
ending the processing of deferred insns

The essential output relevant to our instruction is:

allowing combination of insns 23 and 24
allowing combination of insns 24 and 29

This combines three instructions:

reg359 = zero_extract(reg358, offset=1, length=31)
reg356 = (u32)reg359;
reg8   = (u32)reg356;

Into one:

reg8 = zero_extract(reg358, offset=1, length=32)

The problem is in combiner that decided higher bit is zero anyway (false assumption) and applied zero_extract to full 32-bits. Why exactly it decided 32 upper bits are zero (they are not) is not clear to me :) It happens somewhere in https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/combine.c;h=31e6a4f68254fab551300252688a52d8c3dcaaa4;hb=HEAD#l2628 where some if() statements take about a page. Hopefully gcc RTL and ia64 maintainers will help and guide us here.

Parting words

  • Having good robust tests in packages is always great as it helps gaining confidence package is not completely broken on new target platforms (or toolchain versions).
  • printf() is still good enough tool to trace compiler bugs :)
  • -fdump-tree-all and -fdump-rtl-all are useful gcc flags to get the idea why optimisations fire (or don’t).
  • gcc performs very unobvious optimisations!

Have fun!

Posted on December 23, 2017
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript> comments powered by Disqus

December 22, 2017
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Talks page (December 22, 2017, 13:39 UTC)

I finally decided to put up a page on this website referencing the talks I’ve been giving over the last few years.

You’ll see that I’m quite obsessed with fault tolerance and scalability designs… I guess I can’t deny it any more 🙂