Welcome to Planet Gentoo, an aggregation of Gentoo-related weblog articles written by Gentoo developers. For a broader range of topics, you might be interested in Gentoo Universe.

Disclaimer:
Views expressed in the content published here do not necessarily represent the views of Gentoo Linux or the Gentoo Foundation.
   
March 20, 2019
Install Gentoo in less than one minute (March 20, 2019, 18:35 UTC)

I’m pretty sure that the title of this post will catch your attention…and/or maybe your curiosity.

Well..this is something I’m doing since years…and since did not cost too much to make it in a public and usable state, I decided to share my work, to help some people to avoid waste of time and to avoid to be angry when your cloud provider does not offer the gentoo image.

So what are the goals of this project?

  1. Install gentoo on cloud providers that do not offer a Gentoo image (e.g Hetzner)
  2. Install gentoo everywhere in few seconds.

To do a fast installation, we need a stage4….but what is exactly a stage4? In this case the stage4 is composed by the official gentoo stage3 plus grub, some more utilities and some file already configured.

So since the stage4 has already everything to complete the installation, we just need to make some replacement (fstab, grub and so on), install grub on the disk………..and…..it’s done (by the auto-installer script)!

At this point I’d expect some people to say….”yeah…it’s so simply and logical…why I didn’t think about that” – Well, I guess that every gentoo user didn’t discover that just after the first installation…so you don’t need to blame yourself 🙂

The technical details are covered by the README in the gentoo-stage4 git repository

As said in the README:

  • If you have any request, feel free to contact me
  • A star on the project will give me the idea of the usage and then the effort to put here.

So what’s more? Just a screenshot of the script in action 🙂

# Gentoo hetzner cloud
# Gentoo stage4
# Gentoo cloud

February 22, 2019
Thomas Raschbacher a.k.a. lordvan (homepage, bugs)
Postgresql major version upgrade (gentoo) (February 22, 2019, 10:37 UTC)

Just did an upgrade from postgres 10.x to 11.x on a test machine..

The guide on the Gentoo Wiki is pretty good, but a few things I forgot at first:

First off when initializing the new cluster with "emerge --config =dev-db/postgresql-11.1" making sure the DB init options are the same as the old cluster. They are stored in /etc/conf.d/postgresql-XX.Y so just make sure PG_INITDB_OPTS collation ,.. match - if not delete the new cluster and re-run emerge --config ;)

The second thing was pg_hba.conf: make sure to re-add extra user/db/connection permissions again (in my case I ran diff and then just copied the old config file as the only difference was the extra permissions I had added)

The third thing was postgresql.conf: here I forgot to make sure listen_addresses and port are the same as in the old config (I did not copy this one as there are a lot more differences here. -- and of course check the rest of the config file too (diff is your friend ;) )

other than that pg_upgrade worked really well for me and it is now up and running agian.

February 20, 2019
Michał Górny a.k.a. mgorny (homepage, bugs)

Traditionally, OpenPGP revocation certificates are used as a last resort. You are expected to generate one for your primary key and keep it in a secure location. If you ever lose the secret portion of the key and are unable to revoke it any other way, you import the revocation certificate and submit the updated key to keyservers. However, there is another interesting use for revocation certificates — revoking shared organization keys.

Let’s take Gentoo, for example. We are using a few keys needed to perform automated signatures on servers. For this reason, the key is especially exposed to attacks and we want to be able to revoke it quickly if the need arises. Now, we really do not want to have every single Infra member hold a copy of the secret primary key. However, we can give Infra members revocation certificates instead. This way, they maintain the possibility of revoking the key without unnecessarily increasing its exposure.

The problem with traditional revocation certificates is that they are supported for the purpose of revoking the primary key only. In our security model, the primary key is well protected, compared to subkeys that are totally exposed. Therefore, it is superfluous to revoke the complete key when only a subkey is compromised. To resolve this limitation, gen-revoke tool was created that can create exported revocation signatures for both the primary key and subkeys.

Technical background

The OpenPGP key (v4, as defined by RFC 4880) consists of a primary key, one or more UIDs and zero or more subkeys. Each of those keys and UIDs can include zero or more signature packets. Those packets bind information to the specific key or UID, and their authenticity is confirmed by a signature made using the secret portion of a primary key.

Signatures made by the key’s owner are called self-signatures. The most basic form of them serve as a binding between the primary key and its subkeys and UIDs. Since both those classes of objects are created independently of the primary key, self-signatures are necessary to distinguish authentic subkeys and UIDs created by the key owner from potential fakes. Appropriately, GnuPG will only accept subkeys and UIDs that have valid self-signature.

One specific type of signatures are revocation signatures. Those signatures indicate that the relevant key, subkey or UID has been revoked. If a revocation signature is found, it takes precedence over any other kinds of signatures and prevents the revoked object from being further used.

Key updates are means of distributing new data associated with the key. What’s important is that during an update the key is not replaced by a new one. Instead, GnuPG collects all the new data (subkeys, UIDs, signatures) and adds it to the local copy of the key. The validity of this data is verified against appropriate signatures. Appropriately, anyone can submit a key update to the keyserver, provided that the new data includes valid signatures. Similarly to local GnuPG instance, the keyserver is going to update its copy of the key rather than replacing it.

Revocation certificates specifically make use of this property. Technically, a revocation certificate is simply an exported form of a revocation signature, signed using the owner’s primary key. As long as it’s not on the key (i.e. GnuPG does not see it), it does not do anything. When it’s imported, GnuPG adds it to the key. Further submissions and exports include it, effectively distributing it to all copies of the key.

gen-revoke builds on this idea. It creates and exports revocation signatures for the primary key and subkeys. Due to implementation limitations (and for better compatibility), rather than exporting the signature alone it exports a minimal copy of the relevant key. This copy can be imported just like any other key export, and it causes the revocation signature to be added to the key. Afterwards, it can be exported and distributed just like a revocation done directly on the key.

Usage

To use the script, you need to have the secret portion of the primary key available, and public encryption keys for all the people who are supposed to obtain a copy of the revocation signatures (recipients).

The script takes at least two parameters: an identifier of the key for which revocation signatures should be created, followed by one or more e-mail addresses of signature recipients. It creates revocation signatures both for the primary key and for all valid subkeys, for all the people specified.

The signatures are written into the current directory as key exports and are encrypted to each specified person. They should be distributed afterwards, and kept securely by all the individuals. If a need to revoke either a subkey or the primary key arises, the first person available can decrypt the signature, import it and send the resulting key to keyservers.

Additionally, each signature includes a comment specifying the person it was created for. This comment will afterwards be displayed by GnuPG if one of the revocation signatures is imported. This provides a clear audit trace as to who revoked the key.

Security considerations

Each of the revocation signatures can be used by an attacker to disable the key in question. The signatures are protected through encryption. Therefore, the system is vulnerable to the key of a single signature owner being compromised.

However, this is considerably safer than the equivalent option of distributing the secret portion of the primary key. In the latter case, the attacker would be able to completely compromise the key and use it for malicious purposes; in the former, it is only capable of revoking the key and therefore causing some frustration. Furthermore, the revocation comment helps identifying the compromised user.

The tradeoff between reliability and security can be adjusted by changing the number of revocation signature holders.

January 31, 2019
Michał Górny a.k.a. mgorny (homepage, bugs)

This article describes the UI deficiency of Evolution mail client that extrapolates the trust of one of OpenPGP key UIDs into the key itself, and reports it along with the (potentially untrusted) primary UID. This creates the possibility of tricking the user into trusting a phished mail via adding a forged UID to a key that has a previously trusted UID.

Continue reading

January 29, 2019
Michał Górny a.k.a. mgorny (homepage, bugs)
Identity with OpenPGP trust model (January 29, 2019, 13:50 UTC)

Let’s say you want to send a confidential message to me, and possibly receive a reply. Through employing asymmetric encryption, you can prevent a third party from reading its contents, even if it can intercept the ciphertext. Through signatures, you can verify the authenticity of the message, and therefore detect any possible tampering. But for all this to work, you need to be able to verify the authenticity of the public keys first. In other words, we need to be able to prevent the aforementioned third party — possibly capable of intercepting your communications and publishing a forged key with my credentials on it — from tricking you into using the wrong key.

This renders key authenticity the fundamental problem of asymmetric cryptography. But before we start discussing how key certification is implemented, we need to cover another fundamental issue — identity. After all, who am I — who is the person you are writing to? Are you writing to a person you’ve met? Or to a specific Gentoo developer? Author of some project? Before you can distinguish my authentic key from a forged key, you need to be able to clearly distinguish me from an impostor.

Forms of identity

Identity via e-mail address

If your primary goal is to communicate with the owner of the particular e-mail address, it seems obvious to associate the identity with the owner of the e-mail address. However, how in reality would you distinguish a ‘rightful owner’ of the e-mail address from a cracker who managed to obtain access to it, or to intercept your network communications and inject forged mails?

The truth is, the best you can certify is that the owner of a particular key is able to read and/or send mails from a particular e-mail address, at a particular point in time. Then, if you can certify the same for a long enough period of time, you may reasonably assume the address is continuously used by the same identity (which may qualify as a legitimate owner or a cracker with a lot of patience).

Of course, all this relies on your trust in mail infrastructure not being compromised.

Identity via personal data

A stronger protection against crackers may be provided by associating the identity with personal data, as confirmed by government-issued documents. In case of OpenPGP, this is just the real name; X.509 certificates also provide fields for street address, phone number, etc.

The use of real names seems to be based on two assumptions: that your real name is reasonable well-known (e.g. it can be established with little risk of being replaced by a third party), and that the attacker does not wish to disclose his own name. Besides that, using real names meets with some additional criticism.

Firstly, requiring one to use his real name may be considered an invasion on privacy. Most notably, some people wish not to disclose or use their real names, and this effectively prevents them from ever being certified.

Secondly, real names are not unique. After all, the naming systems developed from the necessity of distinguishing individuals in comparatively small groups, and they simply don’t scale to the size of the Internet. Therefore, name collisions are entirely possible and we are relying on sheer luck that the attacker wouldn’t happen to have the same name as you do.

Thirdly and most importantly, verifying identity documents is non-trivial and untrained individuals are likely to fall victim of mediocre quality fakes. After all, we’re talking about people who hopefully read some article on verifying a particular kind of document but have no experience recognizing forgery, no specialized hardware (I suppose most of you don’t carry a magnifying glass and a UV light on yourself) and who may lack skills in comparing signatures or photographs (not to mention some people have really old photographs in documents). Some countries don’t even issue any official documentation for document verification in English!

Finally, even besides the point of forged documents, this relies on trust in administration.

Identity via photographs

This one I’m mentioning merely for completeness. OpenPGP keys allow adding a photo as one of your UIDs. However, this is rather rarely used (out of the keys my GnuPG fetched so far, less than 10% have photographs). The concerns are similar as for personal data: it assumes that others are reliably able to know how you look like, and that they are capable of reliably comparing faces.

Online identity

An interesting concept is to use your public online activity to prove your identity — such as websites or social media. This is generally based on cross-referencing multiple resources with cryptographically proven publishing access, and assuming that an attacker would not be able to compromise all of them simultaneously.

A form of this concept is utilized by keybase.io. This service builds trust in user profiles via cryptographically cross-linking your profiles on some external sites and/or your websites. Furthermore, it actively encourages other users to verify those external proofs as well.

This identity model entirely relies on trust in network infrastructure and external sites. The likeliness of it being compromised is reduced by (potentially) relying on multiple independent sites.

Web of Trust model

Most of time, you won’t be able to directly verify the identity of everyone you’d like to communicate with. This creates a necessity of obtaining indirect proof of authenticity, and the model normally used for that purpose in OpenPGP is the Web of Trust. I won’t be getting into the fine details — you can find them e.g. in the GNU Privacy Handbook. For our purposes, it suffices to say that in WoT the authenticity of keys you haven’t verified may be assessed by people whose keys you trust already, or people they know, with a limited level of recursion.

The more key holders you can trust, the more keys you can have verified indirectly and the more likely it is that your future recipient will be in that group. Or that you will be able to get someone from across the world into your WoT by meeting someone residing much closer to yourself. Therefore, you’d naturally want the WoT to grow fast and include more individuals. You’d want to preach OpenPGP onto non-crypto-aware people. However, this comes with inherent danger: can you really trust that they will properly verify the identity of the keys they sign?

I believe this is the most fundamental issue with WoT model: for it to work outside of small specialized circles, it has to include more and more individuals across the world. But this growth inevitable makes it easier for a malicious third party to find people that can be tricked into certifying keys with forged identities.

Conclusion

The fundamental problem in OpenPGP usage is finding the correct key and verifying its authenticity. This becomes especially complex given there is no single clear way of determining one’s identity in the Internet. Normally, OpenPGP uses a combination of real name and e-mail address, optionally combined with a photograph. However, all of them have their weaknesses.

Direct identity verification for all recipients is non-practical, and therefore requires indirect certification solutions. While the WoT model used by OpenPGP attempts to avoid centralized trust specific to PKI, it is not clear whether it’s practically manageable. On one hand, it requires trusting more people in order to improve coverage; on the other, it makes it more vulnerable to fraud.

Given all the above, the trust-via-online-presence concept may be of some interest. Most importantly, it establishes a closer relationship between the identity you actually need and the identity you verify — e.g. you want to mail the person being an open source developer, author of some specific projects rather than arbitrary person with a common enough name. However, this concept is not established broadly yet.

January 26, 2019
Michał Górny a.k.a. mgorny (homepage, bugs)

This article shortly explains the historical git weakness regarding handling commits with multiple OpenPGP signatures in git older than v2.20. The method of creating such commits is presented, and the results of using them are described and analyzed.

Continue reading

January 20, 2019
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
py3status v3.16 (January 20, 2019, 21:10 UTC)

Two py3status versions in less than a month? That’s the holidays effect but not only!

Our community has been busy discussing our way forward to 4.0 (see below) and organization so it was time I wrote a bit about that.

Community

A new collaborator

First of all we have the great pleasure and honor to welcome Maxim Baz @maximbaz as a new collaborator on the project!

His engagement, numerous contributions and insightful reviews to py3status has made him a well known community member, not to mention his IRC support 🙂

Once again, thank you for being there Maxim!

Zen of py3status

As a result of an interesting discussion, we worked on defining better how to contribute to py3status as well as a set of guidelines we agree on to get the project moving on smoothly.

Here is born the zen of py3status which extends the philosophy from the user point of view to the contributor point of view!

This allowed us to handle the numerous open pull requests and get their number down to 5 at the time of writing this post!

Even our dear @lasers don’t have any open PR anymore 🙂

3.15 + 3.16 versions

Our magic @lasers has worked a lot on general modules options as well as adding support for i3-gaps added features such as border coloring and fine tuning.

Also interesting is the work of Thiago Kenji Okada @m45t3r around NixOS packaging of py3status. Thanks a lot for this work and for sharing Thiago!

I also liked the question of Andreas Lundblad @aioobe asking if we could have a feature allowing to display a custom graphical output, such as a small PNG or anything upon clicking on the i3bar, you might be interested in following up the i3 issue he opened.

Make sure to read the amazing changelog for details, a lot of modules have been enhanced!

Highlights

  • You can now set a background, border colors and their urgent counterparts on a global scale or per module
  • CI now checks for black format on modules, so now all the code base obey the black format style!
  • All HTTP requests based modules now have a standard way to define HTTP timeout as well as the same 10 seconds default timeout
  • py3-cmd now allows sending click events with modifiers
  • The py3status -n / –interval command line argument has been removed as it was obsolete. We will ignore it if you have set it up, but better remove it to be clean
  • You can specify your own i3status binary path using the new -u, –i3status command line argument thanks to @Dettorer and @lasers
  • Since Yahoo! decided to retire its public & free weather API, the weather_yahoo module has been removed

New modules

  • new conky module: display conky system monitoring (#1664), by lasers
  • new module emerge_status: display information about running gentoo emerge (#1275), by AnwariasEu
  • new module hueshift: change your screen color temperature (#1142), by lasers
  • new module mega_sync: to check for MEGA service synchronization (#1458), by Maxim Baz
  • new module speedtest: to check your internet bandwidth (#1435), by cyrinux
  • new module usbguard: control usbguard from your bar (#1376), by cyrinux
  • new module velib_metropole: display velib metropole stations and (e)bikes (#1515), by cyrinux

A word on 4.0

Do you wonder what’s gonna be in the 4.0 release?
Do you have ideas that you’d like to share?
Do you have dreams that you’d love to become true?

Then make sure to read and participate in the open RFC on 4.0 version!

Development has not started yet; we really want to hear from you.

Thank you contributors!

There would be no py3status release without our amazing contributors, so thank you guys!

  • AnwariasEu
  • cyrinux
  • Dettorer
  • ecks
  • flyingapfopenguin
  • girst
  • Jack Doan
  • justin j lin
  • Keith Hughitt
  • L0ric0
  • lasers
  • Maxim Baz
  • oceyral
  • Simon Legner
  • sridhars
  • Thiago Kenji Okada
  • Thomas F. Duellmann
  • Till Backhaus

January 09, 2019
FOSDEM 2019 (January 09, 2019, 00:00 UTC)

FOSDEM logo

It’s FOSDEM time again! Join us at Université libre de Bruxelles, Campus du Solbosch, in Brussels, Belgium. This year’s FOSDEM 2019 will be held on February 2nd and 3rd.

Our developers will be happy to greet all open source enthusiasts at our Gentoo stand in building K. Visit this year’s wiki page to see who’s coming. So far eight developers have specified their attendance, with most likely many more on the way!

December 06, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Scylla Summit 2018 write-up (December 06, 2018, 22:53 UTC)

It’s been almost one month since I had the chance to attend and speak at Scylla Summit 2018 so I’m relieved to finally publish a short write-up on the key things I wanted to share about this wonderful event!

Make Scylla boring

This statement of Glauber Costa sums up what looked to me to be the main driver of the engineering efforts put into Scylla lately: making it work so consistently well on any kind of workload that it’s boring to operate 🙂

I will follow up on this statement to highlight the things I heard and (hopefully) understood during the summit. I hope you’ll find it insightful.

Reduced operational efforts

The thread-per-core and queues design still has a lot of possibilities to be leveraged.

The recent addition of RPC streaming capabilities to seastar allows a drastic reduction in the time it takes the cluster to grow or shrink (data rebalancing / resynchronization).

Incremental compaction is also very promising as this background process is one of the most expensive there is in the database’s design.

I was happy to hear that scylla-manager will soon be made available and free to use with basic features while retaining more advanced ones for enterprise version (like backup/restore).
I also noticed that the current version was not supporting SSL enabled clusters to store its configuration. So I directly asked Michał for it and I’m glad that it will be released on version 1.3.1.

Performant multi-tenancy

Why choose between real-time OLTP & analytics OLAP workloads?

The goal here is to be able to run both on the same cluster by giving users the ability to assign “SLA” shares to ROLES. That’s basically like pools on Hadoop at a much finer grain since it will create dedicated queues that will be weighted by their share.

Having one queue per usage and full accounting will allow to limit resources efficiently and users to have their say on their latency SLAs.

But Scylla also has a lot to do in the background to run smoothly. So while this design pattern was already applied to tamper compactions, a lot of work has also been done on automatic flow control and back pressure.

For instance, Materialized Views are updated asynchronously which means that while we can interact and put a lot of pressure on the table its based on (called the Main Table), we could overwhelm the background work that’s needed to keep MVs View Tables in sync. To mitigate this, a smart back pressure approach was developed and will throttle the clients to make sure that Scylla can manage to do everything at the best performance the hardware allows!

I was happy to hear that work on tiered storage is also planned to better optimize disk space costs for certain workloads.

Last but not least, columnar storage optimized for time series and analytics workloads are also something the developers are looking at.

Latency is expensive

If you care for latency, you might be happy to hear that a new polling API (named IOCB_CMD_POLL) has been contributed by Christoph Hellwig and Avi Kivity to the 4.19 Linux kernel which avoids context switching I/O by using a shared ring between kernel and userspace. Scylla will be using it by default if the kernel supports it.

The iotune utility has been upgraded since 2.3 to generate an enhanced I/O configuration.

Also, persistent (disk backed) in-memory tables are getting ready and are very promising for latency sensitive workloads!

A word on drivers

ScyllaDB has been relying on the Datastax drivers since the start. While it’s a good thing for the whole community, it’s important to note that the shard-per-CPU approach on data that Scylla is using is not known and leveraged by the current drivers.

Discussions took place and it seems that Datastax will not allow the protocol to evolve so that drivers could discover if the connected cluster is shard aware or not and then use this information to be more clever in which write/read path to use.

So for now ScyllaDB has been forking and developing their shard aware drivers for Java and Go (no Python yet… I was disappointed).

Kubernetes & containers

The ScyllaDB guys of course couldn’t avoid the Kubernetes frenzy so Moreno Garcia gave a lot of feedback and tips on how to operate Scylla on docker with minimal performance degradation.

Kubernetes has been designed for stateless applications, not stateful ones and Docker does some automatic magic that have rather big performance hits on Scylla. You will basically have to play with affinities to dedicate one Scylla instance to run on one server with a “retain” reclaim policy.

Remember that the official Scylla docker image runs with dev-mode enabled by default which turns off all performance checks on start. So start by disabling that and look at all the tips and literature that Moreno has put online!

Scylla 3.0

A lot has been written on it already so I will just be short on things that important to understand in my point of view.

  • Materialized Views do back fill the whole data set
    • this job is done by the view building process
    • you can watch its progress in the system_distributed.view_build_status table
  • Secondary Indexes are Materialized Views under the hood
    • it’s like a reverse pointer to the primary key of the Main Table
    • so if you read the whole row by selecting on the indexed column, two reads will be issued under the hood: one on the indexed MV view table to get the primary key and one on the main table to get the rest of the columns
    • so if your workload is mostly interested by the whole row, you’re better off creating a complete MV to read from than using a SI
    • this is even more true if you plan to do range scans as this double query could lead you to read from multiple nodes instead of one
  • Range scan is way more performant
    • ALLOW FILTERING finally allows a great flexibility by providing server-side filtering!

Random notes

Support for LWT (lightweight transactions) will be relying on a future implementation of the Raft consensus algorithm inside Scylla. This work will also benefits Materialized Views consistency. Duarte Nunes will be the one working on this and I envy him very much!

Support for search workloads is high in the ScyllaDB devs priorities so we should definitely hear about it in the coming months.

Support for “mc” sstables (new generation format) is done and will reduce storage requirements thanks to metadata / data compression. Migration will be transparent because Scylla can read previous formats as well so it will upgrade your sstables as it compacts them.

ScyllaDB developers have not settled on how to best implement CDC. I hope they do rather soon because it is crucial in their ability to integrate well with Kafka!

Materialized Views, Secondary Indexes and filtering will benefit from the work on partition key and indexes intersections to avoid server side filtering on the coordinator. That’s an important optimization to come!

Last but not least, I’ve had the pleasure to discuss with Takuya Asada who is the packager of Scylla for RedHat/CentOS & Debian/Ubuntu. We discussed Gentoo Linux packaging requirements as well as the recent and promising work on a relocatable package. We will collaborate more closely in the future!

November 25, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)
Portability of tar features (November 25, 2018, 14:26 UTC)

The tar format is one of the oldest archive formats in use. It comes as no surprise that it is ugly — built as layers of hacks on the older format versions to overcome their limitations. However, given the POSIX standarization in late 80s and the popularity of GNU tar, you would expect the interoperability problems to be mostly resolved nowadays.

This article is directly inspired by my proof-of-concept work on new binary package format for Gentoo. My original proposal used volume label to provide user- and file(1)-friendly way of distinguish our binary packages. While it is a GNU tar extension, it falls within POSIX ustar implementation-defined file format and you would expect that non-compliant implementations would extract it as regular files. What I did not anticipate is that some implementation reject the whole archive instead.

This naturally raised more questions on how portable various tar formats actually are. To verify that, I have decided to analyze the standards for possible incompatibility dangers and build a suite of test inputs that could be used to check how various implementations cope with that. This article describes those points and provides test results for a number of implementations.

Please note that this article is focused merely on read-wise format compatibility. In other words, it establishes how tar files should be written in order to achieve best probability that it will be read correctly afterwards. It does not investigate what formats the listed tools can write and whether they can correctly create archives using specific features.

Continue reading

November 10, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
py3status v3.14 (November 10, 2018, 21:08 UTC)

I’m happy to announce this release as it contains some very interesting developments in the project. This release was focused on core changes.

IMPORTANT notice

There are now two optional dependencies to py3status:

  • gevent
    • will monkey patch the code to make it concurrent
    • the main benefit is to use an asynchronous loop instead of threads
  • pyudev
    • will enable a udev monitor if a module asks for it (only xrandr so far)
    • the benefit is described below

To install them all using pip, simply do:

pip install py3status[all]

Modules can now react/refresh on udev events

When pyudev is available, py3status will allow modules to subscribe and react to udev events!

The xrandr module uses this feature by default which allows the module to instantly refresh when you plug in or off a secondary monitor. This also allows to stop running the xrandr command in the background and saves a lot of CPU!

Highlights

  • py3status core uses black formatter
  • fix default i3status.conf detection
    • add ~/.config/i3 as a default config directory, closes #1548
    • add .config/i3/py3status in default user modules include directories
  • add markup (pango) support for modules (#1408), by @MikaYuoadas
  • py3: notify_user module name in the title (#1556), by @lasers
  • print module information to sdtout instead of stderr (#1565), by @robertnf
  • battery_level module: default to using sys instead of acpi (#1562), by @eddie-dunn
  • imap module: fix output formatting issue (#1559), by @girst

Thank you contributors!

  • eddie-dunn
  • girst
  • MikaYuoadas
  • robertnf
  • lasers
  • maximbaz
  • tobes

October 04, 2018

CLIP OS logo ANSSI, the National Cybersecurity Agency of France, has released the sources of CLIP OS, that aims to build a hardened, multi-level operating system, based on the Linux kernel and a lot of free and open source software. We are happy to hear that it is based on Gentoo Hardened!

September 28, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
py3status v3.13 (September 28, 2018, 11:56 UTC)

I am once again lagging behind the release blog posts but this one is an important one.

I’m proud to announce that our long time contributor @lasers has become an official collaborator of the py3status project!

Dear @lasers, your amazing energy and overwhelming ideas have served our little community for a while. I’m sure we’ll have a great way forward as we learn to work together with @tobes 🙂 Thank you again very much for everything you do!

This release is as much dedicated to you as it is yours 🙂

IMPORTANT notice

After this release, py3status coding style CI will enforce the ‘black‘ formatter style.

Highlights

Needless to say that the changelog is huge, as usual, here is a very condensed view:

  • documentation updates, especially on the formatter (thanks @L0ric0)
  • py3 storage: use $XDG_CACHE_HOME or ~/.cache
  • formatter: multiple variable and feature fixes and enhancements
  • better config parser
  • new modules: lm_sensors, loadavg, mail, nvidia_smi, sql, timewarrior, wanda_the_fish

Thank you contributors!

  • lasers
  • tobes
  • maximbaz
  • cyrinux
  • Lorenz Steinert @L0ric0
  • wojtex
  • horgix
  • su8
  • Maikel Punie

September 27, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)
New copyright policy explained (September 27, 2018, 06:47 UTC)

On 2018-09-15 meeting, the Trustees have given the final stamp of approval to the new Gentoo copyright policy outlined in GLEP 76. This policy is the result of work that has been slowly progressing since 2005, and that has taken considerable speed by the end of 2017. It is a major step forward from the status quo that has been used since the forming of Gentoo Foundation, and that mostly has been inherited from earlier Gentoo Technologies.

The policy aims to cover all copyright-related aspects, bringing Gentoo in line with the practices used in many other large open source projects. Most notably, it introduces a concept of Gentoo Certificate of Origin that requires all contributors to confirm that they are entitled to submit their contributions to Gentoo, and corrects the copyright attribution policy to be viable under more jurisdictions.

This article aims to shortly reiterate over the most important points in the new copyright policy, and provide a detailed guide on following it in Q&A form.

Continue reading

September 15, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)

With Qt5 gaining support for high-DPI displays, and applications starting to exercise that support, it’s easy for applications to suddenly become unusable with some screens. For example, my old Samsung TV reported itself as 7″ screen. While this used not to really matter with websites forcing you to force the resolution of 96 DPI, the high-DPI applications started scaling themselves to occupy most of my screen, with elements becoming really huge (and ugly, apparently due to some poor scaling).

It turns out that it is really hard to find a solution for this. Most of the guides and tips are focused either on proprietary drivers or on getting custom resolutions. The DisplaySize specification in xorg.conf apparently did not change anything either. Finally, I was able to resolve the issue by overriding the EDID data for my screen. This guide explains how I did it.

Step 1: dump EDID data

Firstly, you need to get the EDID data from your monitor. Supposedly read-edid tool could be used for this purpose but it did not work for me. With only a little bit more effort, you can get it e.g. from xrandr:

$ xrandr --verbose
[...]
HDMI-0 connected primary 1920x1080+0+0 (0x57) normal (normal left inverted right x axis y axis) 708mm x 398mm
[...]
  EDID:
    00ffffffffffff004c2dfb0400000000
    2f120103804728780aee91a3544c9926
    0f5054bdef80714f8100814081809500
    950fb300a940023a801871382d40582c
    4500c48e2100001e662150b051001b30
    40703600c48e2100001e000000fd0018
    4b1a5117000a2020202020200000000a
    0053414d53554e470a20202020200143
    020323f14b901f041305140312202122
    2309070783010000e2000f67030c0010
    00b82d011d007251d01e206e285500c4
    8e2100001e011d00bc52d01e20b82855
    40c48e2100001e011d8018711c162058
    2c2500c48e2100009e011d80d0721c16
    20102c2580c48e2100009e0000000000
    00000000000000000000000000000029
[...]

If you have multiple displays connected, make sure to use the EDID for the one you’re overriding. Copy the hexdump and convert it to a binary blob. You can do this by passing it through xxd -p -r (installed by vim).

Step 2: fix screen dimensions

Once you have the EDID blob ready, you need to update the screen dimensions inside it. Initially, I did it using hex editor which involved finding all the occurrences, updating them (and manually encoding into the weird split-integers) and correcting the checksums. Then, I’ve written edid-fixdim so you wouldn’t have to repeat that experience.

First, use --get option to verify that your EDID is supported correctly:

$ edid-fixdim -g edid.bin
EDID structure: 71 cm x 40 cm
Detailed timing desc: 708 mm x 398 mm
Detailed timing desc: 708 mm x 398 mm
CEA EDID found
Detailed timing desc: 708 mm x 398 mm
Detailed timing desc: 708 mm x 398 mm
Detailed timing desc: 708 mm x 398 mm
Detailed timing desc: 708 mm x 398 mm

So your EDID consists of basic EDID structure, followed by one extension block. The screen dimensions are stored in 7 different blocks you’d have to update, and referenced in two checksums. The tool will take care of updating it all for you, so just pass the correct dimensions to --set:

$ edid-fixdim -s 1600x900 edid.bin
EDID structure updated to 160 cm x 90 cm
Detailed timing desc updated to 1600 mm x 900 mm
Detailed timing desc updated to 1600 mm x 900 mm
CEA EDID found
Detailed timing desc updated to 1600 mm x 900 mm
Detailed timing desc updated to 1600 mm x 900 mm
Detailed timing desc updated to 1600 mm x 900 mm
Detailed timing desc updated to 1600 mm x 900 mm

Afterwards, you can use --get again to verify that the changes were made correctly.

Step 3: overriding EDID data

Now it’s just the matter of putting the override in motion. First, make sure to enable CONFIG_DRM_LOAD_EDID_FIRMWARE in your kernel:

Device Drivers  --->
  Graphics support  --->
    Direct Rendering Manager (XFree86 4.1.0 and higher DRI support)  --->
      [*] Allow to specify an EDID data set instead of probing for it

Then, determine the correct connector name. You can find it in dmesg output:

$ dmesg | grep -C 1 Connector
[   15.192088] [drm] ib test on ring 5 succeeded
[   15.193461] [drm] Radeon Display Connectors
[   15.193524] [drm] Connector 0:
[   15.193580] [drm]   HDMI-A-1
--
[   15.193800] [drm]     DFP1: INTERNAL_UNIPHY1
[   15.193857] [drm] Connector 1:
[   15.193911] [drm]   DVI-I-1
--
[   15.194210] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[   15.194267] [drm] Connector 2:
[   15.194322] [drm]   VGA-1

Copy the new EDID blob into location of your choice inside /lib/firmware:

$ mkdir /lib/firmware/edid
$ cp edid.bin /lib/firmware/edid/samsung.bin

Finally, add the override to your kernel command-line:

drm.edid_firmware=HDMI-A-1:edid/samsung.bin

If everything went fine, xrandr should report correct screen dimensions after next reboot, and dmesg should report that EDID override has been loaded:

$ dmesg | grep EDID
[   15.549063] [drm] Got external EDID base block and 1 extension from "edid/samsung.bin" for connector "HDMI-A-1"

If it didn't, check dmesg for error messages.

September 07, 2018
Gentoo congratulates our GSoC participants (September 07, 2018, 00:00 UTC)

GSOC logo Gentoo would like to congratulate Gibix and JSteward for finishing and passing Google’s Summer of Code for the 2018 calendar year. Gibix contributed by enhancing Rust (programming language) support within Gentoo. JSteward contributed by making a full Gentoo GNU/Linux distribution, managed by Portage, run on devices which use the original Android-customized kernel.

The final reports of their projects can be reviewed on their personal blogs:

August 24, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)

I have recently worked on enabling 2-step authentication via SSH on the Gentoo developer machine. I have selected google-authenticator-libpam amongst different available implementations as it seemed the best maintained and having all the necessary features, including a friendly tool for users to configure it. However, its design has a weakness: it stores the secret unprotected in user’s home directory.

This means that if an attacker manages to gain at least temporary access to the filesystem with user’s privileges — through a malicious process, vulnerability or simply because someone left the computer unattended for a minute — he can trivially read the secret and therefore clone the token source without leaving a trace. It would completely defeat the purpose of the second step, and the user may not even notice until the attacker makes real use of the stolen secret.

In order to protect against this, I’ve created google-authenticator-wrappers (as upstream decided to ignore the problem). This package provides a rather trivial setuid wrapper that manages a write-only, authentication-protected secret store for the PAM module. Additionally, it comes with a test program (so you can test the OTP setup without jumping through the hoops or risking losing access) and friendly wrappers for the default setup, as used on Gentoo Infra.

The recommended setup (as utilized by sys-auth/google-authenticator-wrappers package) is to use a dedicated user for the password store. In this scenario, the users are unable to read their secrets, and all secret operations (including authentication via the PAM module) are done using an unprivileged user. Furthermore, any operation regarding the configuration (either updating it or removing the second step) require regular PAM authentication (e.g. typing your own password).

This is consistent with e.g. how shadow operates (users can’t read their passwords, nor update them without authenticating first), how most sites using 2-factor authentication operate (again, users can’t read their secrets) and follows the RFC 6238 recommendation (that keys […] SHOULD be protected against unauthorized access and usage). It solves the aforementioned issue by preventing user-privileged processes from reading the secrets and recovery codes. Furthermore, it prevents the attacker with this particular level of access from disabling 2-step authentication, changing the secret or even weakening the configuration.

August 17, 2018
Luca Barbato a.k.a. lu_zero (homepage, bugs)
Gentoo on Integricloud (August 17, 2018, 22:44 UTC)

Integricloud gave me access to their infrastructure to track some issues on ppc64 and ppc64le.

Since some of the issues are related to the compilers, I obviously installed Gentoo on it and in the process I started to fix some issues with catalyst to get a working install media, but that’s for another blogpost.

Today I’m just giving a walk-through on how to get a ppc64le (and ppc64 soon) VM up and running.

Preparation

Read this and get your install media available to your instance.

Install Media

I’m using the Gentoo installcd I’m currently refining.

Booting

You have to append console=hvc0 to your boot command, the boot process might figure it out for you on newer install medias (I still have to send patches to update livecd-tools)

Network configuration

You have to manually setup the network.
You can use ifconfig and route or ip as you like, refer to your instance setup for the parameters.

ifconfig enp0s0 ${ip}/16
route add -net default gw ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip a add ${ip}/16 dev enp0s0
ip l set enp0s0 up
ip r add default via ${gw}
echo "nameserver 8.8.8.8" > /etc/resolv.conf

Disk Setup

OpenFirmware seems to like gpt much better:

parted /dev/sda mklabel gpt

You may use fdisk to create:
– a PowerPC PrEP boot partition of 8M
– root partition with the remaining space

Device     Start      End  Sectors Size Type
/dev/sda1   2048    18431    16384   8M PowerPC PReP boot
/dev/sda2  18432 33554654 33536223  16G Linux filesystem

I’m using btrfs and zstd-compress /usr/portage and /usr/src/.

mkfs.btrfs /dev/sda2

Initial setup

It is pretty much the usual.

mount /dev/sda2 /mnt/gentoo
cd /mnt/gentoo
wget https://dev.gentoo.org/~mattst88/ppc-stages/stage3-ppc64le-20180810.tar.xz
tar -xpf stage3-ppc64le-20180810.tar.xz
mount -o bind /dev dev
mount -t devpts devpts dev/pts
mount -t proc proc proc
mount -t sysfs sys sys
cp /etc/resolv.conf etc
chroot .

You just have to emerge grub and gentoo-sources, I diverge from the defconfig by making btrfs builtin.

My /etc/portage/make.conf:

CFLAGS="-O3 -mcpu=power9 -pipe"
# WARNING: Changing your CHOST is not something that should be done lightly.
# Please consult https://wiki.gentoo.org/wiki/Changing_the_CHOST_variable beforee
 changing.
CHOST="powerpc64le-unknown-linux-gnu"

# NOTE: This stage was built with the bindist Use flag enabled
PORTDIR="/usr/portage"
DISTDIR="/usr/portage/distfiles"
PKGDIR="/usr/portage/packages"

USE="ibm altivec vsx"

# This sets the language of build output to English.
# Please keep this setting intact when reporting bugs.
LC_MESSAGES=C
ACCEPT_KEYWORDS=~ppc64

MAKEOPTS="-j4 -l6"
EMERGE_DEFAULT_OPTS="--jobs 10 --load-average 6 "

My minimal set of packages I need before booting:

emerge grub gentoo-sources vim btrfs-progs openssh

NOTE: You want to emerge again openssh and make sure bindist is not in your USE.

Kernel & Bootloader

cd /usr/src/linux
make defconfig
make menuconfig # I want btrfs builtin so I can avoid a initrd
make -j 10 all && make install && make modules_install
grub-install /dev/sda1
grub-mkconfig -o /boot/grub/grub.cfg

NOTE: make sure you pass /dev/sda1 otherwise grub will happily assume OpenFirmware knows about btrfs and just point it to your directory.
That’s not the case unfortunately.

Networking

I’m using netifrc and I’m using the eth0-naming-convention.

touch /etc/udev/rules.d/80-net-name-slot.rules
ln -sf /etc/init.d/net.{lo,eth0}
echo -e "config_eth0=\"${ip}/16\"\nroutes_eth0="default via ${gw}\"\ndns_servers_eth0=\"8.8.8.8\"" > /etc/conf.d/net

Password and SSH

Even if the mticlient is quite nice, you would rather use ssh as much as you could.

passwd 
rc-update add sshd default

Finishing touches

Right now sysvinit does not add the hvc0 console as it should due to a profile quirk, for now check /etc/inittab and in case add:

echo 'hvc0:2345:respawn:/sbin/agetty -L 9600 hvc0' >> /etc/inittab

Add your user and add your ssh key and you are ready to use your new system!

August 15, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)
new* helpers can read from stdin (August 15, 2018, 09:21 UTC)

Did you know that new* helpers can read from stdin? Well, now you know! So instead of writing to a temporary file you can install your inline text straight to the destination:

src_install() {
  # old code
  cat <<-EOF >"${T}"/mywrapper || die
    #!/bin/sh
    exec do-something --with-some-argument
  EOF
  dobin "${T}"/mywrapper

  # replacement
  newbin - mywrapper <<-EOF
    #!/bin/sh
    exec do-something --with-some-argument
  EOF
}

August 13, 2018
Michał Górny a.k.a. mgorny (homepage, bugs)

The recent efforts on improving the security of different areas of Gentoo have brought some arguments. Some time ago one of the developers has considered whether he would withstand physical violence if an attacker would use it in order to compromise Gentoo. A few days later another developer has suggested that an attacker could pay Gentoo developers to compromise the distribution. Is this a real threat to Gentoo? Are we all doomed?

Before I answer this question, let me make an important presumption. Gentoo is a community-driven open source project. As such, it has certain inherent weaknesses and there is no way around them without changing what Gentoo fundamentally is. Those weaknesses are common to all projects of the same nature.

Gentoo could indeed be compromised if developers are subject to the threat of violence to themselves or their families. As for money, I don’t want to insult anyone and I don’t think it really matters. The fact is, Gentoo is vulnerable to any adversary resourceful enough, and there are certainly both easier and cheaper ways than the two mentioned. For example, the adversary could get a new developer recruited, or simply trick one of the existing developers into compromising the distribution. It just takes one developer out of ~150.

As I said, there is no way around that without making major changes to the organizational structure of Gentoo. Those changes would probably do more harm to Gentoo than good. We can just admit that we can’t fully protect Gentoo from focused attack of a resourceful adversary, and all we can do is to limit the potential damage, detect it quickly and counteract the best we can. However, in reality random probes and script kiddie attacks that focus on trivial technical vulnerabilities are more likely, and that’s what the security efforts end up focusing on.

August 12, 2018

Pwnies logo

Congratulations to security researcher and Gentoo developer Hanno Böck and his co-authors Juraj Somorovsky and Craig Young for winning one of this year’s coveted Pwnie awards!

The award is for their work on the Return Of Bleichenbacher’s Oracle Threat or ROBOT vulnerability, which at the time of discovery affected such illustrious sites as Facebook and Paypal. Technical details can be found in the full paper published at the Cryptology ePrint Archive.

FroSCon logo

As last year, there will be a Gentoo booth again at the upcoming FrOSCon “Free and Open Source Conference” in St. Augustin near Bonn! Visitors can meet Gentoo developers to ask any question, get Gentoo swag, and prepare, configure, and compile their own Gentoo buttons.

The conference is 25th and 26th of August 2018, and there is no entry fee. See you there!

July 19, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)

This quick article is a wrap up for reference on how to connect to ScyllaDB using Spark 2 when authentication and SSL are enforced for the clients on the Scylla cluster.

We encountered multiple problems, even more since we distribute our workload using a YARN cluster so that our worker nodes should have everything they need to connect properly to Scylla.

We found very little help online so I hope it will serve anyone facing similar issues (that’s also why I copy/pasted them here).

The authentication part is easy going by itself and was not the source of our problems, SSL on the client side was.

Environment

  • (py)spark: 2.1.0.cloudera2
  • spark-cassandra-connector: datastax:spark-cassandra-connector: 2.0.1-s_2.11
  • python: 3.5.5
  • java: 1.8.0_144
  • scylladb: 2.1.5

SSL cipher setup

The Datastax spark cassandra driver uses default the TLS_RSA_WITH_AES_256_CBC_SHA cipher that the JVM does not support by default. This raises the following error when connecting to Scylla:

18/07/18 13:13:41 WARN channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0x8d6f78a7]
java.lang.IllegalArgumentException: Cannot support TLS_RSA_WITH_AES_256_CBC_SHA with currently installed providers

According to the ssl documentation we have two ciphers available:

  1. TLS_RSA_WITH_AES_256_CBC_SHA
  2. TLS_RSA_WITH_AES_128_CBC_SHA

We can get get rid of the error by lowering the cipher to TLS_RSA_WITH_AES_128_CBC_SHA using the following configuration:

.config("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA")\

However, this is not really a good solution and instead we’d be inclined to use the TLS_RSA_WITH_AES_256_CBC_SHA version. For this we need to follow this Datastax’s procedure.

Then we need to deploy the JCE security jars on our all client nodes, if using YARN like us this means that you have to deploy these jars to all your NodeManager nodes.

For example by hand:

# unzip jce_policy-8.zip
# cp UnlimitedJCEPolicyJDK8/*.jar /opt/oracle-jdk-bin-1.8.0.144/jre/lib/security/

Java trust store

When connecting, the clients need to be able to validate the Scylla cluster’s self-signed CA. This is done by setting up a trustStore JKS file and providing it to the spark connector configuration (note that you protect this file with a password).

keyStore vs trustStore

In SSL handshake purpose of trustStore is to verify credentials and purpose of keyStore is to provide credentials. keyStore in Java stores private key and certificates corresponding to the public keys and is required if you are a SSL Server or SSL requires client authentication. TrustStore stores certificates from third parties or your own self-signed certificates, your application identify and validates them using this trustStore.

The spark-cassandra-connector documentation has two options to handle keyStore and trustStore.

When we did not use the trustStore option, we would get some obscure error when connecting to Scylla:

com.datastax.driver.core.exceptions.TransportException: [node/1.1.1.1:9042] Channel has been closed

When enabling DEBUG logging, we get a clearer error which indicated a failure in validating the SSL certificate provided by the Scylla server node:

Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

setting up the trustStore JKS

You need to have the self-signed CA public certificate file, then issue the following command:

# keytool -importcert -file /usr/local/share/ca-certificates/MY_SELF_SIGNED_CA.crt -keystore COMPANY_TRUSTSTORE.jks -noprompt
Enter keystore password:  
Re-enter new password: 
Certificate was added to keystore

using the trustStore

Now you need to configure spark to use the trustStore like this:

.config("spark.cassandra.connection.ssl.trustStore.password", "PASSWORD")\
.config("spark.cassandra.connection.ssl.trustStore.path", "COMPANY_TRUSTSTORE.jks")\

Spark SSL configuration example

This wraps up the SSL connection configuration used for spark.

This example uses pyspark2 and reads a table in Scylla from a YARN cluster:

$ pyspark2 --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 --files COMPANY_TRUSTSTORE.jks

>>> spark = SparkSession.builder.appName("scylla_app")\
.config("spark.cassandra.auth.password", "test")\
.config("spark.cassandra.auth.username", "test")\
.config("spark.cassandra.connection.host", "node1,node2,node3")\
.config("spark.cassandra.connection.ssl.clientAuth.enabled", True)\
.config("spark.cassandra.connection.ssl.enabled", True)\
.config("spark.cassandra.connection.ssl.trustStore.password", "PASSWORD")\
.config("spark.cassandra.connection.ssl.trustStore.path", "COMPANY_TRUSTSTORE.jks")\
.config("spark.cassandra.input.split.size_in_mb", 1)\
.config("spark.yarn.queue", "scylla_queue").getOrCreate()

>>> df = spark.read.format("org.apache.spark.sql.cassandra").options(table="my_table", keyspace="test").load()
>>> df.show()

July 06, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
A botspot story (July 06, 2018, 14:50 UTC)

I felt like sharing a recent story that allowed us identify a bot in a haystack thanks to Scylla.

 

The scenario

While working on loading up 2B+ of rows into Scylla from Hive (using Spark), we noticed a strange behaviour in the performances of one of our nodes:

 

So we started wondering why that server in blue was having those peaks of load and was clearly diverging from the two others… As we obviously expect the three nodes to behave the same, there were two options on the table:

  1. hardware problem on the node
  2. bad data distribution (bad schema design? consistent hash problem?)

We shared this with our pals from ScyllaDB and started working on finding out what was going on.

The investigation

Hardware?

Hardware problem was pretty quickly evicted, nothing showed up on the monitoring and on the kernel logs. I/O queues and throughput were good:

Data distribution?

Avi Kivity (ScyllaDB’s CTO) quickly got the feeling that something was wrong with the data distribution and that we could be facing a hotspot situation. He quickly nailed it down to shard 44 thanks to the scylla-grafana-monitoring platform.

Data is distributed between shards that are stored on nodes (consistent hash ring). This distribution is done by hashing the primary key of your data which dictates the shard it belongs to (and thus the node(s) where the shard is stored).

If one of your keys is over represented in your original data set, then the shard it belongs to can be overly populated and the related node overloaded. This is called a hotspot situation.

tracing queries

The first step was to trace queries in Scylla to try to get deeper into the hotspot analysis. So we enabled tracing using the following formula to get about 1 trace per second in the system_traces namespace.

tracing probability = 1 / expected requests per second throughput

In our case, we were doing between 90K req/s and 150K req/s so we settled for 100K req/s to be safe and enabled tracing on our nodes like this:

# nodetool settraceprobability 0.00001

Turns out tracing didn’t help very much in our case because the traces do not include the query parameters in Scylla 2.1, it is becoming available in the soon to be released 2.2 version.

NOTE: traces expire on the tables, make sure your TRUNCATE the events and sessions tables while iterating. Else you will have to wait for the next gc_grace_period (10 days by default) before they are actually removed. If you do not do that and generate millions of traces like we did, querying the mentioned tables will likely time out because of the “tombstoned” rows even if there is no trace inside any more.

looking at cfhistograms

Glauber Costa was also helping on the case and got us looking at the cfhistograms of the tables we were pushing data to. That proved to be clearly highlighting a hotspot problem:

histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                             (micros)          (micros)           (bytes)                  
50%             0,00              6,00              0,00               258                 2
75%             0,00              6,00              0,00               535                 5
95%             0,00              8,00              0,00              1916                24
98%             0,00             11,72              0,00              3311                50
99%             0,00             28,46              0,00              5722                72
Min             0,00              2,00              0,00               104                 0
Max             0,00          45359,00              0,00          14530764            182785

What this basically means is that 99% percentile of our partitions are small (5KB) while the biggest is 14MB! That’s a huge difference and clearly shows that we have a hotspot on a partition somewhere.

So now we know for sure that we have an over represented key in our data set, but what key is it and why?

The culprit

So we looked at the cardinality of our data set keys which are SHA256 hashes and found out that indeed we had one with more than 1M occurrences while the second highest one was around 100K!…

Now that we had the main culprit hash, we turned to our data streaming pipeline to figure out what kind of event was generating the data associated to the given SHA256 hash… and surprise! It was a client’s quality assurance bot that was constantly browsing their own website with legitimate behaviour and identity credentials associated to it.

So we modified our pipeline to detect this bot and discard its events so that it stops polluting our databases with fake data. Then we cleaned up the million of events worth of mess and traces we stored about the bot.

The aftermath

Finally, we cleared out the data in Scylla and tried again from scratch. Needless to say that the curves got way better and are exactly what we should expect from a well balanced cluster:

Thanks a lot to the ScyllaDB team for their thorough help and high spirited support!

I’ll quote them conclude this quick blog post:

July 03, 2018
Thomas Raschbacher a.k.a. lordvan (homepage, bugs)

Turns out I should read things like the gentoo wiki upgrade guide *first* to avoid issues ..

After installing php 7.1 to replace the (quite old) php 5.6 I did check php.ini ,.. but forgot to check for compiled modules and setting PHP_TARGETS .. then wondered why I just got Internal Server Error messages..

Thanks to the php team for writing this nice guide to remind people like me of what to do:

https://wiki.gentoo.org/wiki/PHP/Upgrading_to_PHP_7.1

As always the Gentoo Wiki is a great source of information, and I like to use it as a reminder of the things needed when installing/upgrading,.. ;)

June 29, 2018
My comments on the Gentoo Github hack (June 29, 2018, 16:00 UTC)

Several news outlets are reporting on the takeover of the Gentoo GitHub organization that was announced recently. Today 28 June at approximately 20:20 UTC unknown individuals have gained control of the Github Gentoo organization, and modified the content of repositories as well as pages there. We are still working to determine the exact extent and … Continue reading "My comments on the Gentoo Github hack"

June 28, 2018

2018-07-04 14:00 UTC

We believe this incident is now resolved. Please see the incident report for details about the incident, its impact, and resolution.

2018-06-29 15:15 UTC

The community raised questions about the provenance of Gentoo packages. Gentoo development is performed on hardware run by the Gentoo Infrastructure team (not github). The Gentoo hardware was unaffected by this incident. Users using the default Gentoo mirroring infrastructure should not be affected.

If you are still concerned about provenance or are unsure what solution you are using, please consult https://wiki.gentoo.org/wiki/Project:Portage/Repository_Verification. This will instruct you on how to verify your repository.

2018-06-29 06:45 UTC

The gentoo GitHub organization remains temporarily locked down by GitHub support, pending fixes to pull-request content.

For ongoing status, please see the Gentoo infra-status incident page.

For later followup, please see the Gentoo Wiki page for GitHub 2018-06-28. An incident post-mortem will follow on the wiki.

June 03, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)

Repology is monitoring package repositories across Linux distributions. By now, Atom feeds of per-maintainer outdated packages that I was waiting for have been implemented.

So I subscribed to my own Gentoo feed using net-mail/rss2email and now Repology notifies me via e-mail of new upstream releases that other Linux distros have packaged that I still need to bump in Gentoo. In my case, it brought an update of dev-vcs/svn2git to my attention that I would have missed (or heard about later), otherwise.

Based on this comment, Repology may soon do release detection upstream similar to what euscan does, as well.

April 21, 2018
Rafael G. Martins a.k.a. rafaelmartins (homepage, bugs)
Updates (April 21, 2018, 14:35 UTC)

Since I don't write anything here for almost 2 years, I think it is time for some short updates:

  • I left RedHat and moved to Berlin, Germany, in March 2017.
  • The series of posts about balde was stopped. The first post got a lot of Hacker News attention, and I will come back with it as soon as I can implement the required changes in the framework. Not going to happen very soon, though.
  • I've been spending most of my free time with flight simulation. You can expect a few related posts soon.
  • I left the Gentoo GSoC administration this year.
  • blogc is the only project that is currently getting some frequent attention from me, as I use it for most of my websites. Check it out! ;-)

That's all for now.

April 18, 2018
Zack Medico a.k.a. zmedico (homepage, bugs)

In portage-2.3.30, portage’s python API provides an asyncio event loop policy via a DefaultEventLoopPolicy class. For example, here’s a little program that uses portage’s DefaultEventLoopPolicy to do the same thing as emerge --regen, using an async_iter_completed function to implement the --jobs and --load-average options:

#!/usr/bin/env python

from __future__ import print_function

import argparse
import functools
import multiprocessing
import operator

import portage
from portage.util.futures.iter_completed import (
    async_iter_completed,
)
from portage.util.futures.unix_events import (
    DefaultEventLoopPolicy,
)


def handle_result(cpv, future):
    metadata = dict(zip(portage.auxdbkeys, future.result()))
    print(cpv)
    for k, v in sorted(metadata.items(),
        key=operator.itemgetter(0)):
        if v:
            print('\t{}: {}'.format(k, v))
    print()


def future_generator(repo_location, loop=None):

    portdb = portage.portdb

    for cp in portdb.cp_all(trees=[repo_location]):
        for cpv in portdb.cp_list(cp, mytree=repo_location):
            future = portdb.async_aux_get(
                cpv,
                portage.auxdbkeys,
                mytree=repo_location,
                loop=loop,
            )

            future.add_done_callback(
                functools.partial(handle_result, cpv))

            yield future


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--repo',
        action='store',
        default='gentoo',
    )
    parser.add_argument(
        '--jobs',
        action='store',
        type=int,
        default=multiprocessing.cpu_count(),
    )
    parser.add_argument(
        '--load-average',
        action='store',
        type=float,
        default=multiprocessing.cpu_count(),
    )
    args = parser.parse_args()

    try:
        repo_location = portage.settings.repositories.\
            get_location_for_name(args.repo)
    except KeyError:
        parser.error('unknown repo: {}\navailable repos: {}'.\
            format(args.repo, ' '.join(sorted(
            repo.name for repo in
            portage.settings.repositories))))

    policy = DefaultEventLoopPolicy()
    loop = policy.get_event_loop()

    try:
        for future_done_set in async_iter_completed(
            future_generator(repo_location, loop=loop),
            max_jobs=args.jobs,
            max_load=args.load_average,
            loop=loop):
            loop.run_until_complete(future_done_set)
    finally:
        loop.close()



if __name__ == '__main__':
    main()

April 03, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
py3status v3.8 (April 03, 2018, 12:06 UTC)

Another long awaited release has come true thanks to our community!

The changelog is so huge that I had to open an issue and cry for help to make it happen… thanks again @lasers for stepping up once again 🙂

Highlights

  • gevent support (-g option) to switch from threads scheduling to greenlets and reduce resources consumption
  • environment variables support in i3status.conf to remove sensible information from your config
  • modules can now leverage a persistent data store
  • hundreds of improvements for various modules
  • we now have an official debian package
  • we reached 500 stars on github #vanity

Milestone 3.9

  • try to release a version faster than every 4 months (j/k) 😉

The next release will focus on bugs and modules improvements / standardization.

Thanks contributors!

This release is their work, thanks a lot guys!

  • alex o’neill
  • anubiann00b
  • cypher1
  • daniel foerster
  • daniel schaefer
  • girst
  • igor grebenkov
  • james curtis
  • lasers
  • maxim baz
  • nollain
  • raspbeguy
  • regnat
  • robert ricci
  • sébastien delafond
  • themistokle benetatos
  • tobes
  • woland

April 02, 2018
Thomas Raschbacher a.k.a. lordvan (homepage, bugs)

Thanks to the enlightenment devs for fixing this ;) no lock screen sucks :D

https://www.enlightenment.org/news/e0.22.3_release

also it is in my Gentoo dev overlay as of now.

So a while ago I cleaned out my dev overlay and added dev-libs/efl-1.20.7 and x11-wm/enlightenment-0.22.1 (and 0.22.2)

Works for me at the moment (except the screen (un-)lock) but not sure if that has to do with my box. any testers welcome

Here's the link: https://gitweb.gentoo.org/dev/lordvan.git/

Oh and I added it to layman's repo list again, so gentoo users can easily just "layman -a lordvan" to test it.

On a side note: 0.22.1 gave me trouble with a 2nd screen plugged in, which seems fixed in 0.22.2, but that has (pam related) problems with the lock screen ..

March 17, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)
Holy cow! Larry the cow Gentoo tattoo (March 17, 2018, 14:53 UTC)

Probably not new but was new to me: Just ran into this Larry the Cow tattoo online: http://www.geekytattoos.com/larry-the-gender-challenged-cow/

March 01, 2018
Thomas Raschbacher a.k.a. lordvan (homepage, bugs)

Just a quick post about how to run UCS (Core Edition in my case) with KVM on gentoo.

First off I go with the assumption that

  • KVM is working (kernel, config,..)
  • qemu is installed (+ init scripts)
  • bridge networking is set up and working

If any of the above are not yet set up: https://wiki.gentoo.org/wiki/QEMU

First download the Virtualbox Image from https://www.univention.de/download/ .

Further for the kvm name I use ucs-dc

next we convert the image to qcow2:

qemu-img convert -f vmdk -O qcow2 UCS-DC/UCS-DC-virtualbox-disk1.vmdk  UCS-DC_disk1.qcow2

create your init script link:

cd /etc/init.d; ln -s qemu kvm.ucs-dc

Then in /etc/conf.d copy qemu.conf.example to kvm.ucs-dc

Check / change the following:

  1. change the MACADDR (it includes a command line to generate one) -- The reason this is first is, that if you forget you might spend hours - like me -  trying to find out why your network is not working ..
  2. QEMU_TYPE="x86_64"
  3. NIC_TYPE=br
  4. point DISKIMAGE=  to your qcow2 file
  5. ENABLE_KVM=1 (believe me disabling kvm is noticeable)
  6. adjust MEMORY (I set it to 2GB for the DC) and SMP (i set that to 2)
  7. FOREGROUND="vnc=:<port>" - so you can connect to your console using VNC
  8. check the other stuff if it applies to you (OTHER_ARGS is quite useful for example to also add a CD/usb emulation of a rescue disk,..

run it with

/etc/init.d/kvm.ucs-dc start

connect with your favourite VNC client and set up your UCS Server.

One thing I did on the fileserver instance (I run 3 UCS kvms at the moment - DC, Backup-DC and File Server):

I created a LVM Volume for the file share on the Host, and mapped it to the KVM - here's the config line:

OTHER_ARGS="-drive format=raw,file=/dev/mapper/<your volume devide>,if=virtio,aio=native,cache.direct=on"

works great for me, and I will also add another one for other shares later I think. but this way if i really have any VM problems my files are just on the lvm device and i can get to it easily (also there are lvm snapshots,.. that could be useful eventually)

Tryton setup & config (March 01, 2018, 08:10 UTC)

Because I keep forgetting stuff I need to do (or the order) here a very quick overview:

Install trytond, modules + deps (on gentoo add the tryton overlay and just emerge)

If you don'T use sqlite create a user (and database) for tryton.

Gentoo Init scripts use /etc/conf.d/trytond (here's mine):

# Location of the configuration file
CONFIG=/etc/tryton/trytond.conf
# Location of the logging configuration file
LOGCONF=/etc/tryton/logging.conf
# The database names to load (space separated)
DATABASES=tryton

since it took me a while to find a working logging.conf example here's my working one:

[formatters]
keys=simple

[handlers]
keys=rotate,console

[loggers]
keys=root

[formatter_simple]
format=%(asctime)s] %(levelname)s:%(name)s:%(message)s
datefmt=%a %b %d %H:%M:%S %Y

[handler_rotate]
class=handlers.TimedRotatingFileHandler
args=('/var/log/trytond/trytond.log', 'D', 1, 120)
formatter=simple

[handler_console]
class=StreamHandler
formatter=simple
args=(sys.stdout,)

[logger_root]
level=INFO
handlers=rotate,console

(Not going into details here, if you want to know more there are plenty of resources online)

As for config I went and got an example online (from open Suse) and modified it:

# /etc/tryton/trytond.conf - Configuration file for Tryton Server (trytond)
#
# This file contains the most common settings for trytond (Defaults
# are commented).
# For more information read
# /usr/share/doc/trytond-<version>/

[database]
# Database related settings

# The URI to connect to the SQL database (following RFC-3986)
# uri = database://username:password@host:port/
# (Internal default: sqlite:// (i.e. a local SQLite database))
#
# PostgreSQL via Unix domain sockets
# (e.g. PostgreSQL database running on the same machine (localhost))
#uri = postgresql://tryton:tryton@/
#
#Default setting for a local postgres database
#uri = postgresql:///

#
# PostgreSQL via TCP/IP
# (e.g. connecting to a PostgreSQL database running on a remote machine or
# by means of md5 authentication. Needs PostgreSQL to be configured to accept
# those connections (pg_hba.conf).)
#uri = postgresql://tryton:tryton@localhost:5432/
uri = postgresql://tryton:mypassword@localhost:5432/

# The path to the directory where the Tryton Server stores files.
# The server must have write permissions to this directory.
# (Internal default: /var/lib/trytond)
path = /var/lib/tryton

# Shall available databases be listed in the client?
#list = True

# The number of retries of the Tryton Server when there are errors
# in a request to the database
#retry = 5

# The primary language, that is used to store entries in translatable
# fields into the database.
#language = en_US
language = de_AT

[ssl]
# SSL settings
# Activation of SSL for all available protocols.
# Uncomment the following settings for key and certificate
# to enable SSL.

# The path to the private key
#privatekey = /etc/ssl/private/ssl-cert-snakeoil.key

# The path to the certificate
#certificate = /etc/ssl/certs/ssl-cert-snakeoil.pem

[jsonrpc]
# Settings for the JSON-RPC network interface

# The IP/host and port number of the interface
# (Internal default: localhost:8000)
#
# Listen on all interfaces (IPv4)

listen = 0.0.0.0:8000

#
# Listen on all interfaces (IPv4 and IPv6)
#listen = [::]:8000

# The hostname for this interface
#hostname =

# The root path to retrieve data for GET requests
#data = jsondata

[xmlrpc]
# Settings for the XML-RPC network interface

# The IP/host and port number of the interface
#listen = localhost:8069

[webdav]
# Settings for the WebDAV network interface

# The IP/host and port number of the interface
#listen = localhost:8080
listen = 0.0.0.0:8080

[session]
# Session settings

# The time (in seconds) until an inactive session expires
timeout = 3600

# The server administration password used by the client for
# the execution of database management tasks. It is encrypted
# using using the Unix crypt(3) routine. A password can be
# generated using the following command line (on one line):
# $ python -c 'import getpass,crypt,random,string; \
# print crypt.crypt(getpass.getpass(), \
# "".join(random.sample(string.ascii_letters + string.digits, 8)))'
# Example password with 'admin'
#super_pwd = jkUbZGvFNeugk
super_pwd = <your pwd>


[email]
# Mail settings

# The URI to connect to the SMTP server.
# Available protocols are:
# - smtp: simple SMTP
# - smtp+tls: SMTP with STARTTLS
# - smtps: SMTP with SSL
#uri = smtp://localhost:25
uri = smtp://localhost:25

# The From address used by the Tryton Server to send emails.
from = tryton@<your-domain.tld>

[report]
# Report settings

# Unoconv parameters for connection to the unoconv service.
#unoconv = pipe,name=trytond;urp;StarOffice.ComponentContext

# Module settings
#
# Some modules are reading configuration parameters from this
# configuration file. These settings only apply when those modules
# are installed.
#
#[ldap_authentication]
# The URI to connect to the LDAP server.
#uri = ldap://host:port/dn?attributes?scope?filter?extensions
# A basic default URL could look like
#uri = ldap://localhost:389/

[web]
# Path for the web-frontend
#root = /usr/lib/node-modules/tryton-sao
listen = 0.0.0.0:8000
root = /usr/share/sao

Set up the database tables, modules, superuser

trytond-admin -c /etc/tryton/trytond.conf -d tryton --all

Should you forget to set your superuser password (or need to change it later):

trytond-admin -c /etc/tryton/trytond.conf -d tryton -p

It's now time to connect a client to it and enable & configure the modules (make sure to finish the basic configuration (including accounts,..) otherwise you have to either restart, or know what exactly needs to be set up accounting wise !

  • configure user(s)
  • enable account_eu (config and setup take a while)
  • set up company
    • create a party for it
    • assign currency and timezone
  • set up chart of accounts from the template (only do this manually if you really know what you - and tryton - needs !!)
    • choose company & pick the template (e.g. "Minimaler Kontenplan" (if using german) )
      • set the defaults (only one per option usually)
  • after applying the above activate and configure whatever else you need (sale, timesheet, ...)

during this you can watch trytond.log to see what happens behind the scenes (e.g. country module takes a while,..)

How to add languages:

  • Administration -> Localization -> Languages -> add the language and set it to
    active and translatable 

If you install new modules or languages run trytond-admin ... --all again (see above) 

February 28, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Evaluating ScyllaDB for production 2/2 (February 28, 2018, 10:32 UTC)

In my previous blog post, I shared 7 lessons on our experience in evaluating Scylla for production.

Those lessons were focused on the setup and execution of the POC and I promised a more technical blog post with technical details and lessons learned from the POC, here it is!

Before you read on, be mindful that our POC was set up to test workloads and workflows, not to benchmark technologies. So even if the Scylla figures are great, they have not been the main drivers of the actual conclusion of the POC.

Business context

As a data driven company working in the Marketing and Advertising industry, we help our clients make sense of multiple sources of data to build and improve their relationship with their customers and prospects.

Dealing with multiple sources of data is nothing new but their volume has dramatically changed during the past decade. I will spare you with the Big-Data-means-nothing term and the technical challenges that comes with it as you already heard enough of it.

Still, it is clear that our line of business is tied to our capacity at mixing and correlating a massive amount of different types of events (data sources/types) coming from various sources which all have their own identifiers (think primary keys):

  • Web navigation tracking: identifier is a cookie that’s tied to the tracking domain (we have our own)
  • CRM databases: usually the email address or an internal account ID serve as an identifier
  • Partners’ digital platform: identifier is also a cookie tied to their tracking domain

To try to make things simple, let’s take a concrete example:

You work for UNICEF and want to optimize their banner ads budget by targeting the donors of their last fundraising campaign.

  • Your reference user database is composed of the donors who registered with their email address on the last campaign: main identifier is the email address.
  • To buy web display ads, you use an Ad Exchange partner such as AppNexus or DoubleClick (Google). From their point of view, users are seen as cookie IDs which are tied to their own domain.

So you basically need to be able to translate an email address to a cookie ID for every partner you work with.

Use case: ID matching tables

We operate and maintain huge ID matching tables for every partner and a great deal of our time is spent translating those IDs from one to another. In SQL terms, we are basically doing JOINs between a dataset and those ID matching tables.

  • You select your reference population
  • You JOIN it with the corresponding ID matching table
  • You get a matched population that your partner can recognize and interact with

Those ID matching tables have a pretty high read AND write throughput because they’re updated and queried all the time.

Usual figures are JOINs between a 10+ Million dataset and 1.5+ Billion ID matching tables.

The reference query basically looks like this:

SELECT count(m.partnerid)
FROM population_10M_rows AS p JOIN partner_id_match_400M_rows AS m
ON p.id = m.id

 Current implementations

We operate a lambda architecture where we handle real time ID matching using MongoDB and batch ones using Hive (Apache Hadoop).

The first downside to note is that it requires us to maintain two copies of every ID matching table. We also couldn’t choose one over the other because neither MongoDB nor Hive can sustain both the read/write lookup/update ratio while performing within the low latencies that we need.

This is an operational burden and requires quite a bunch of engineering to ensure data consistency between different data stores.

Production hardware overview:

  • MongoDB is running on a 15 nodes (5 shards) cluster
    • 64GB RAM, 2 sockets, RAID10 SAS spinning disks, 10Gbps dual NIC
  • Hive is running on 50+ YARN NodeManager instances
    • 128GB RAM, 2 sockets, JBOD SAS spinning disks, 10Gbps dual NIC

Target implementation

The key question is simple: is there a technology out there that can sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?

Having one technology to handle both use cases would allow:

  • Simpler data consistency
  • Operational simplicity and efficiency
  • Reduced costs

POC hardware overview:

So we decided to find out if Scylla could be that technology. For this, we used three decommissioned machines that we had in the basement of our Paris office.

  • 2 DELL R510
    • 19GB RAM, 2 socket 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
  • 1 DELL R710
    • 19GB RAM, 2 socket 4 cores, RAID0 SAS spinning disks, 1Gbps NIC

I know, these are not glamorous machines and they are even inconsistent in specs, but we still set up a 3 node Scylla cluster running Gentoo Linux with them.

Our take? If those three lousy machines can challenge or beat the production machines on our current workloads, then Scylla can seriously be considered for production.

Step 1: Validate a schema model

Once the POC document was complete and the ScyllaDB team understood what we were trying to do, we started iterating on the schema model using a query based modeling strategy.

So we wrote down and rated the questions that our model(s) should answer to, they included stuff like:

  • What are all our cookie IDs associated to the given partner ID ?
  • What are all the cookie IDs associated to the given partner ID over the last N months ?
  • What is the last cookie ID/date for the given partner ID ?
  • What is the last date we have seen the given cookie ID / partner ID couple ?

As you can imagine, the reverse questions are also to be answered so ID translations can be done both ways (ouch!).

Prototyping

This is no news that I’m a Python addict so I did all my prototyping using Python and the cassandra-driver.

I ended up using a test-driven data modelling strategy using pytest. I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.

Schema

In our case, we ended up with three denormalized tables to answer all the questions we had. To answer the first three questions above, you could use the schema below:

CREATE TABLE IF NOT EXISTS ids_by_partnerid(
 partnerid text,
 id text,
 date timestamp,
 PRIMARY KEY ((partnerid), date, id)
 )
 WITH CLUSTERING ORDER BY (date DESC)

Note on clustering key ordering

One important learning I got in the process of validating the model is about the internals of Cassandra’s file format that resulted in the choice of using a descending order DESC on the date clustering key as you can see above.

If your main use case of querying is to look for the latest value of an history-like table design like ours, then make sure to change the default ASC order of your clustering key to DESC. This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!

Let me quote Glauber Costa’s detailed explanation on this:

Basically in Cassandra’s file format, the index points to an entire partition (for very large partitions there is a hack to avoid that, but the logic is mostly the same). So if you want to read the first row, that’s easy you get the index to the partition and read the first row. If you want to read the last row, then you get the index to the partition and do a linear scan to the next.

This is the kind of learning you can only get from experts like Glauber and that can justify the whole POC on its own!

Step 2: Set up scylla-grafana-monitoring

As I said before, make sure to set up and run the scylla-grafana-monitoring project before running your test workloads. This easy to run solution will be of great help to understand the performance of your cluster and to tune your workload for optimal performances.

If you can, also discuss with the ScyllaDB team to allow them to access the Grafana dashboard. This will be very valuable since they know where to look better than we usually do… I gained a lot of understandings thanks to this!

Note on scrape interval

I advise you to lower the Prometheus scrape interval to have a shorter and finer sampling of your metrics. This will allow your dashboard to be more reactive when you start your test workloads.

For this, change the prometheus/prometheus.yml file like this:

scrape_interval: 2s # Scrape targets every 2 seconds (5s default)
scrape_timeout: 1s # Timeout before trying to scrape a target again (4s default)

Test your monitoring

Before going any further, I strongly advise you to run a stress test on your POC cluster using the cassandra-stress tool and share the results and their monitoring graphs with the ScyllaDB team.

This will give you a common understanding of the general performances of your cluster as well as help in outlining any obvious misconfiguration or hardware problem.

Key graphs to look at

There are a lot of interesting graphs so I’d like to share the ones that I have been mainly looking at. Remember that depending on your test workloads, some other graphs may be more relevant for you.

  • number of open connections

You’ll want to see a steady and high enough number of open connections which will prove that your clients are pushed at their maximum (at the time of testing this graph was not on Grafana and you had to add it yourself)

  • cache hits / misses

Depending on your reference dataset, you’ll obviously see that cache hits and misses will have a direct correlation with disk I/O and overall performances. Running your test workloads multiple times should trigger higher cache hits if your RAM is big enough.

  • per shard/node distribution

The Requests Served per shard graph should display a nicely distributed load between your shards and nodes so that you’re sure that you’re getting the best out of your cluster.

The same is true for almost every other “per shard/node” graph: you’re looking for evenly distributed load.

  • sstable reads

Directly linked with your disk performances, you’ll be trying to make sure that you have almost no queued sstable reads.

Step 3: Get your reference data and metrics

We obviously need to have some reference metrics on our current production stack so we can compare them with the results on our POC Scylla cluster.

Whether you choose to use your current production machines or set up a similar stack on the side to run your test workloads is up to you. We chose to run the vast majority of our tests on our current production machines to be as close to our real workloads as possible.

Prepare a reference dataset

During your work on the POC document, you should have detailed the usual data cardinality and volume you work with. Use this information to set up a reference dataset that you can use on all of the platforms that you plan to compare.

In our case, we chose a 10 Million reference dataset that we JOINed with a 400+ Million extract of an ID matching table. Those volumes seemed easy enough to work with and allowed some nice ratio for memory bound workloads.

Measure on your current stack

Then it’s time to load this reference datasets on your current platforms.

  • If you run a MongoDB cluster like we do, make sure to shard and index the dataset just like you do on the production collections.
  • On Hive, make sure to respect the storage file format of your current implementations as well as their partitioning.

If you chose to run your test workloads on your production machines, make sure to run them multiple times and at different hours of the day and night so you can correlate the measures with the load on the cluster at the time of the tests.

Reference metrics

For the sake of simplicity I’ll focus on the Hive-only batch workloads. I performed a count on the JOIN of the dataset and the ID matching table using Spark 2 and then I also ran the JOIN using a simple Hive query through Beeline.

I gave the following definitions on the reference load:

  • IDLE: YARN available containers and free resources are optimal, parallelism is very limited
  • NORMAL: YARN sustains some casual load, parallelism exists but we are not bound by anything still
  • HIGH: YARN has pending containers, parallelism is high and applications have to wait for containers before executing

There’s always an error margin on the results you get and I found that there was not significant enough differences between the results using Spark 2 and Beeline so I stuck with a simple set of results:

  • IDLE: 2 minutes, 15 seconds
  • NORMAL: 4 minutes
  • HIGH: 15 minutes

Step 4: Get Scylla in the mix

It’s finally time to do your best to break Scylla or at least to push it to its limits on your hardware… But most importantly, you’ll be looking to understand what those limits are depending on your test workloads as well as outlining out all the required tuning that you will be required to do on the client side to reach those limits.

Speaking about the results, we will have to differentiate two cases:

  1. The Scylla cluster is fresh and its cache is empty (cold start): performance is mostly Disk I/O bound
  2. The Scylla cluster has been running some test workload already and its cache is hot: performance is mostly Memory bound with some Disk I/O depending on the size of your RAM

Spark 2 / Scala test workload

Here I used Scala (yes, I did) and DataStax’s spark-cassandra-connector so I could use the magic joinWithCassandraTable function.

  • spark-cassandra-connector-2.0.1-s_2.11.jar
  • Java 7

I had to stick with the 2.0.1 version of the spark-cassandra-connector because newer version (2.0.5 at the time of testing) were performing bad with no apparent reason. The ScyllaDB team couldn’t help on this.

You can interact with your test workload using the spark2-shell:

spark2-shell --jars jars/commons-beanutils_commons-beanutils-1.9.3.jar,jars/com.twitter_jsr166e-1.1.0.jar,jars/io.netty_netty-all-4.0.33.Final.jar,jars/org.joda_joda-convert-1.2.jar,jars/commons-collections_commons-collections-3.2.2.jar,jars/joda-time_joda-time-2.3.jar,jars/org.scala-lang_scala-reflect-2.11.8.jar,jars/spark-cassandra-connector-2.0.1-s_2.11.jar

Then use the following Scala imports:

// main connector import
import com.datastax.spark.connector._

// the joinWithCassandraTable failed without this (dunno why, I'm no Scala guy)
import com.datastax.spark.connector.writer._
implicit val rowWriter = SqlRowWriter.Factory

Finally I could run my test workload to select the data from Hive and JOIN it with Scylla easily:

val df_population = spark.sql("SELECT id FROM population_10M_rows")
val join_rdd = df_population.rdd.repartitionByCassandraReplica("test_keyspace", "partner_id_match_400M_rows").joinWithCassandraTable("test_keyspace", "partner_id_match_400M_rows")
val joined_count = join_rdd.count()

Notes on tuning spark-cassandra-connector

I experienced pretty crappy performances at first. Thanks to the easy Grafana monitoring, I could see that Scylla was not being the bottleneck at all and that I instead had trouble getting some real load on it.

So I engaged in a thorough tuning of the spark-cassandra-connector with the help of Glauber… and it was pretty painful but we finally made it and got the best parameters to get the load on the Scylla cluster close to 100% when running the test workloads.

This tuning was done in the spark-defaults.conf file:

  • have a fixed set of executors and boost their overhead memory

This will increase test results reliability by making sure you always have a reserved number of available workers at your disposal.

spark.dynamicAllocation.enabled=false
spark.executor.instances=30
spark.yarn.executor.memoryOverhead=1024
  • set the split size to 1MB

Default is 8MB but Scylla uses a split size of 1MB so you’ll see a great boost of performance and stability by setting this setting to the right number.

spark.cassandra.input.split.size_in_mb=1
  • align driver timeouts with server timeouts

It is advised to make sure that your read request timeouts are the same on the driver and the server so you do not get stalled states waiting for a timeout to happen on one hand. You can do the same with write timeouts if your test workloads are write intensive.

/etc/scylla/scylla.yaml

read_request_timeout_in_ms: 150000

spark-defaults.conf

spark.cassandra.connection.timeout_ms=150000
spark.cassandra.read.timeout_ms=150000

// optional if you want to fail / retry faster for HA scenarios
spark.cassandra.connection.reconnection_delay_ms.max=5000
spark.cassandra.connection.reconnection_delay_ms.min=1000
spark.cassandra.query.retry.count=100
  • adjust your reads per second rate

Last but surely not least, this setting you will need to try and find out the best value for yourself since it has a direct impact on the load on your Scylla cluster. You will be looking at pushing your POC cluster to almost 100% load.

spark.cassandra.input.reads_per_sec=6666

As I said before, I could only get this to work perfectly using the 2.0.1 version of the spark-cassandra-connector driver. But then it worked very well and with great speed.

Spark 2 results

Once tuned, the best results I was able to reach on this hardware are listed below. It’s interesting to see that with spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances.

  • hot cache: 2min
  • cold cache: 12min

Wow! Those three refurbished machines can compete with our current production machines and implementations, they can even match an idle Hive cluster of a medium size!

Python test workload

I couldn’t conclude on a Scala/Spark 2 only test workload. So I obviously went back to my language of choice Python only to discover at my disappointment that there is no joinWithCassandraTable equivalent available on pyspark

I tried with some projects claiming otherwise with no success until I changed my mind and decided that I probably didn’t need Spark 2 at all. So I went into the crazy quest of beating Spark 2 performances using a pure Python implementation.

This basically means that instead of having a JOIN like helper, I had to do a massive amount of single “id -> partnerid” lookups. Simple but greatly inefficient you say? Really?

When I broke down the pieces, I was left with the following steps to implement and optimize:

  • Load the 10M rows worth of population data from Hive
  • For every row, lookup the corresponding partnerid in the ID matching table from Scylla
  • Count the resulting number of matches

The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not. So you can’t possibly imagine outperforming Spark 2 with your single machine.

However, let’s remember that Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time. This is a quite expensive process that we have a chance to avoid using Python!

So what I needed was a distributed computation framework that would allow to load data in a partitioned way and run the lookups on all the partitions in parallel before merging the results. In Python, this framework exists and is named Dask!

You will obviously need to have to deploy a dask topology (that’s easy and well documented) to have a comparable number of dask workers than of Spark 2 executors (30 in my case) .

The corresponding Python code samples are here.

Hive + Scylla results

Reading the population id’s from Hive, the workload can be split and executed concurrently on multiple dask workers.

  • read the 10M population rows from Hive in a partitioned manner
  • for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid
  • create a dataframe from the resulting matches
  • gather back all the dataframes and merge them
  • count the number of matches

The results showed that it is possible to compete with Spark 2 with Dask:

  • hot cache: 2min (rounded up)
  • cold cache: 6min

Interestingly, those almost two minutes can be broken down like this:

  • distributed read data from Hive: 50s
  • distributed lookup from Scylla: 60s
  • merge + count: 10s

This meant that if I could cut down the reading of data from Hive I could go even faster!

Parquet + Scylla results

Going further on my previous remark I decided to get rid of Hive and put the 10M rows population data in a parquet file instead. I ended up trying to find out the most efficient way to read and load a parquet file from HDFS.

My conclusion so far is that you can’t be the amazing libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!

The results showed that I could almost get rid of a whole minute in the total process, effectively and easily beating Spark 2!

  • hot cache: 1min 5s
  • cold cache: 5min

Notes on the Python cassandra-driver

Tests using Python showed robust queries experiencing far less failures than the spark-cassandra-connector, even more during the cold start scenario.

  • The usage of execute_concurrent() provides a clean and linear interface to submit a large number of queries while providing some level of concurrency control
  • Increasing the concurrency parameter from 100 to 512 provided additional throughput, but increasing it more looked useless
  • Protocol version 4 forbids the tuning of connection requests / number to some sort of auto configuration. All tentative to hand tune it (by lowering protocol version to 2) failed to achieve higher throughput
  • Installation of libev on the system allows the cassandra-driver to use it to handle concurrency instead of asyncore with a somewhat lower load footprint on the worker node but no noticeable change on the throughput
  • When reading a parquet file stored on HDFS, the hdfs3 + pyarrow combo provides an insane speed (less than 10s to fully load 10M rows of a single column)

Step 5: Play with High Availability

I was quite disappointed and surprised by the lack of maturity of the Cassandra community on this critical topic. Maybe the main reason is that the cassandra-driver allows for too many levels of configuration and strategies.

I wrote this simple bash script to allow me to simulate node failures. Then I could play with handling those failures and retries on the Python client code.

#!/bin/bash

iptables -t filter -X
iptables -t filter -F

ip="0.0.0.0/0"
for port in 9042 9160 9180 10000 7000; do
	iptables -t filter -A INPUT -p tcp --dport ${port} -s ${ip} -j DROP
	iptables -t filter -A OUTPUT -p tcp --sport ${port} -d ${ip} -j DROP
done

while true; do
	trap break INT
	clear
	iptables -t filter -vnL
	sleep 1
done

iptables -t filter -X
iptables -t filter -F
iptables -t filter -vnL

This topic is worth going in more details on a dedicated blog post that I shall write later on while providing code samples.

Concluding the evaluation

I’m happy to say that Scylla passed our production evaluation and will soon go live on our infrastructure!

As I said at the beginning of this post, the conclusion of the evaluation has not been driven by the good figures we got out of our test workloads. Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla.

Instead we decided on the following points of interest (in no particular order):

  • data consistency
  • production reliability
  • datacenter awareness
  • ease of operation
  • infrastructure rationalisation
  • developer friendliness
  • costs

On the side, I tried Scylla on two other different use cases which proved interesting to follow later on to displace MongoDB again…

Moving to production

Since our relationship was great we also decided to partner with ScyllaDB and support them by subscribing to their enterprise offerings. They also accepted to support us using Gentoo Linux!

We are starting with a three nodes heavy duty cluster:

  • DELL R640
    • dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB

I’m eager to see ScyllaDB building up and will continue to help with my modest contributions. Thanks again to the ScyllaDB team for their patience and support during the POC!

February 27, 2018
Thomas Raschbacher a.k.a. lordvan (homepage, bugs)

So .. since this page/blog is running on Mezzanine I thought I'd share what I had to do to get it to work.

First off I did this on Gentoo, but in general stuff should apply to most other distributions anyway.

Versions I used:

  • Apache 2.4.27
  • python 3.6.3
  • mod_wsgi 4.5.13
  • postgresql 9.4

Don't forget to enable apache loading mod_wsgi (on Gentoo add "-D WSGI " to APACHE2_OPTS in /etc/conf.d/apache).

I am running Mezzanine on it's own virtualhost.

For the rest of this I go with the assumption that the above is installed and configured correctly.

Quick & dirty Mezzanine install:

python3.6 -m venv myenv
source myenv/bin/activate
pip install mezzanine south psycopg2

Of course replace psycopg2 with whatever Database driver you intend to use.

Here is what I installed (pip freeze output):

beautifulsoup4==4.6.0
bleach==2.1.2
certifi==2018.1.18
chardet==3.0.4
Django==1.10.8
django-contrib-comments==1.8.0
filebrowser-safe==0.4.7
future==0.16.0
grappelli-safe==0.4.7
html5lib==1.0.1
idna==2.6
Mezzanine==4.2.3
oauthlib==2.0.6
Pillow==5.0.0
psycopg2==2.7.4
pytz==2018.3
requests==2.18.4
requests-oauthlib==0.8.0
six==1.11.0
South==1.0.2
tzlocal==1.5.1
urllib3==1.22
webencodings==0.5.1

Then create your project:

mezzanine-project <projectname>

Then edit your <project>/local_settings.py to use the correct database settings - and don't forget to set ALLOWED_HOSTS (like I did at first).

chmod +x <project>/manage.py
./<project>/manage.py createdb

This should create the DB, superuser, .. then I ran

./<project>/manage.py collectstatic

you can test it with

./<project>/manage.py runserver 

If you need to change the ip/port:

./<project>/manage.py runserver <ip>:<port>

Make sure stuff works ok, set up what you want to setup first or do your development.

For deploying with mod_wsgi .. here's a snippet from my apache config (I run it on https *only* and just redirect http to the https version):

<VirtualHost <your_ip>:443>
ServerName your.domain.name
ServerAdmin your_email@your_domain
ErrorLog /path/to/your/logs/your.domain.name_error.log
CustomLog /path/to/your/logs/your.domain.name_access.log combined

LogLevel Info

SSLCertificateFile /path/to/your/sslcerts/cert.pem
SSLCertificateKeyFile /path/to/your/sslcerts/privkey.pem
SSLCertificateChainFile /path/to/your/sslcerts/fullchain.pem

WSGIDaemonProcess lvmezz home=/path/to/your/MezzanineInstallMezzanine/myenv processes=1 threads=15 display-name=[wsgi-lvmezz]httpd python-path=/path/to/your/MezzanineInstallMezzanine/mezzproject:/path/to/your/MezzanineInstallMezzanine/myenv/lib64/python3.6/site-packages

WSGIProcessGroup lvmezz
WSGIApplicationGroup %{GLOBAL}

WSGIScriptAlias / /path/to/your/MezzanineInstallMezzanine/mezzproject/apache.wsgi
Alias /static /path/to/your/MezzanineInstallMezzanine/mezzproject/static
Alias /robots.txt /path/to/your/MezzanineInstallMezzanine/htdocs_static/robots.txt
Alias /favicon.ico /path/to/your/MezzanineInstallMezzanine/htdocs_static/favicon.ico

<Directory /path/to/your/MezzanineInstallMezzanine/mezzproject>
  Options -Indexes +FollowSymLinks +MultiViews
  php_flag engine off
  <IfModule mod_authz_host.c>
    Require all granted
  </IfModule>
</Directory>

<Directory /path/to/your/MezzanineInstallMezzanine/mezzproject/static>
   Options -Indexes +FollowSymLinks +MultiViews -ExecCGI
   php_flag engine off
   RemoveHandler .cgi .php .php3 .php4 .phtml .pl .py .pyc .pyo
   AllowOverride None
   <IfModule mod_authz_host.c>
      Require all granted
   </IfModule>
</Directory>

</VirtualHost>

Should all be pretty self explainatory (maybe I'll elaborate at a later point, but I don't have that much time now and I'd rather get it finished).

Here's the apache.wsgi file:

from __future__ import unicode_literals
import os, sys, site

site.addsitedir('/path/to/your/MezzanineInstall/myenv/lib64/python3.6/site-packages')
activate_this = os.path.expanduser('/path/to/your/MezzanineInstall/myenv/bin/activate_this.py')
exec(open(activate_this, 'r').read(), dict(__file__=activate_this))

PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(PROJECT_ROOT, ".."))
settings_module = "%s.settings" % PROJECT_ROOT.split(os.sep)[-1]
os.environ["DJANGO_SETTINGS_MODULE"] = settings_module

from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

I found activate_this.py at https://github.com/pypa/virtualenv/blob/master/virtualenv_embedded/activate_this.py (since with python3 execfile wasn't really working for me):

"""By using execfile(this_file, dict(__file__=this_file)) you will
activate this virtualenv environment.
This can be used when you must use an existing Python interpreter, not
the virtualenv bin/python
"""

try:
    __file__
except NameError:
    raise AssertionError(
        "You must run this like execfile('path/to/activate_this.py', dict(__file__='path/to/activate_this.py'))")
import sys
import os

old_os_path = os.environ.get('PATH', '')
os.environ['PATH'] = os.path.dirname(os.path.abspath(__file__)) + os.pathsep + old_os_path
base = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if sys.platform == 'win32':
    site_packages = os.path.join(base, 'Lib', 'site-packages')
else:
    site_packages = os.path.join(base, 'lib', 'python%s' % sys.version[:3], 'site-packages')
prev_sys_path = list(sys.path)
import site
site.addsitedir(site_packages)
sys.real_prefix = sys.prefix
sys.prefix = base
# Move the added items to the front of the path:
new_sys_path = []
for item in list(sys.path):
    if item not in prev_sys_path:
        new_sys_path.append(item)
        sys.path.remove(item)
sys.path[:0] = new_sys_path

that would be the Apache + mod_wsgi config (make sure to replace the python version number if you don'T use 3.6)

Make sure that apache has the correct permissions to all the files too btw ;)

February 19, 2018
Jason A. Donenfeld a.k.a. zx2c4 (homepage, bugs)
WireGuard in Google Summer of Code (February 19, 2018, 14:55 UTC)

WireGuard is participating in Google Summer of Code 2018. If you're a student — bachelors, masters, PhD, or otherwise — who would like to be funded this summer for writing interesting kernel code, studying cryptography, building networks, making mobile apps, contributing to the larger open source ecosystem, doing web development, writing documentation, or working on a wide variety of interesting problems, then this may be appealing. You'll be mentored by world-class experts, and the summer will certainly boost your skills. Details are on this page — simply contact the WireGuard team to get a proposal into the pipeline.

Gentoo accepted into Google Summer of Code 2018 (February 19, 2018, 00:00 UTC)

Students who want to spend their summer having fun and writing code can do so now for Gentoo. Gentoo has been accepted as a mentoring organization for this year’s Google Summer of Code.

The GSoC is an excellent opportunity for gaining real-world experience in software design and making one’s self known in the broader open source community. It also looks great on a resume.

Initial project ideas can be found here, although new projects ideas are welcome. For new projects time is of the essence: there is typically some idea-polishing which must occur before the March 27th deadline. Because of this it is strongly recommended that students refine new project ideas with a mentor before proposing the idea formally.

GSoC students are encouraged to begin discussing ideas in the #gentoo-soc IRC channel on the Freenode network.

Further information can be found on the Gentoo GSoC 2018 wiki page. Those with unanswered questions should not hesitate to contact the Summer of Code mentors via the mailing list.

February 14, 2018
Sebastian Pipping a.k.a. sping (homepage, bugs)
I love free software... and Gentoo does! #ilovefs (February 14, 2018, 14:25 UTC)

Some people care if software is free of cost or if it has the best features, above everything else. I don't. I care that I can legally inspect its inner workings, modify and share modified versions. That's why I happily avoid macOS, Windows, Skype, Photoshop. I ran into these two pieces involving Gentoo in the Gallery of Free Software lovers and would like to share them with you:

Images are licensed under CC BY-SA 4.0 (with attribution going to Free Software Foundation Europe) as confirmed by Max Mehl.

February 10, 2018
Matthew Thode a.k.a. prometheanfire (homepage, bugs)
Native ZFS encryption for your rootfs (February 10, 2018, 06:00 UTC)

Disclaimer

I'm not responsible if you ruin your system, this guide functions as documentation for future me. Remember to back up your data.

Why do this instead of luks

I wanted to remove a layer from the File to Disk layering, before it was ZFS -> LUKS -> disk, now it's ZFS -> disk.

Prework

I just got a new laptop and wanted to just migrate the data, luckily the old laptop was using ZFS as well, so the data could be sent/received though native ZFS means.

The actual setup

Set up your root pool with the encryption key, it will be inherited by all child datasets, no child datasets will be allowed to be unencrypted.

In my case the pool name was slaanesh-zp00, so I ran the following to create the fresh pool.

zpool create -O encryption=on -O keyformat=passphrase zfstest /dev/zvol/slaanesh-zp00/zfstest

After that just go on and create your datasets as normal, transfer old data as needed (it'll be encrypted as it's written). See https://wiki.gentoo.org/wiki/ZFS for a good general guide on setting up your datasets.

decrypting at boot

If you are using dracut it should just work. No changes to what you pass on the kernel command line are needed. The code is upstream in https://github.com/zfsonlinux/zfs/blob/master/contrib/dracut/90zfs/zfs-load-key.sh.in

notes

Make sure you install from git master, there was a disk format change for encrypted datasets that just went in a week or so ago.

February 06, 2018
Gentoo at FOSDEM 2018 (February 06, 2018, 19:49 UTC)

Gentoo Linux participated with a stand during this year's FOSDEM 2018, as has been the case for the past several years. Three Gentoo developers had talks this year, Haubi was back with a Gentoo-related talk on Unix? Windows? Gentoo! - POSIX? Win32? Native Portability to the max!, dilfridge talked about Perl in the Physics Lab … Continue reading "Gentoo at FOSDEM 2018"

January 30, 2018
FOSDEM is near (January 30, 2018, 00:00 UTC)

Excitement is building with FOSDEM 2018 only a few days away. There are now 14 current and one former developer in planned attendance, along with many from the Gentoo community.

This year one Gentoo related talk titled Unix? Windows? Gentoo! will be given by Michael Haubenwallner (haubi ) of the Gentoo Prefix project.

Two more non-Gentoo related talks will be given by current Gentoo developers:

If you attend don’t miss out on your opportunity to build the Gentoo web-of-trust by bringing a valid governmental ID, a printed key fingerprint, and various UIDs you want certified. See you at FOSDEM!

January 23, 2018
Alexys Jacob a.k.a. ultrabug (homepage, bugs)
Evaluating ScyllaDB for production 1/2 (January 23, 2018, 17:31 UTC)

I have recently been conducting a quite deep evaluation of ScyllaDB to find out if we could benefit from this database in some of our intensive and latency critical data streams and jobs.

I’ll try to share this great experience within two posts:

  1. The first one (you’re reading) will walk through how to prepare yourself for a successful Proof Of Concept based evaluation with the help of the ScyllaDB team.
  2. The second post will cover the technical aspects and details of the POC I’ve conducted with the various approaches I’ve followed to find the most optimal solution.

But let’s start with how I got into this in the first place…


Selecting ScyllaDB

I got interested in ScyllaDB because of its philosophy and engagement and I quickly got into it by being a modest contributor and its Gentoo Linux packager (not in portage yet).

Of course, I didn’t pick an interest in that technology by chance:

We’ve been using MongoDB in (mass) production at work for a very very long time now. I can easily say we were early MongoDB adopters. But there’s no wisdom in saying that MongoDB is not suited for every use case and the Hadoop stack has come very strong in our data centers since then, with a predominance of Hive for the heavy duty and data hungry workflows.

One thing I was never satisfied with MongoDB was its primary/secondary architecture which makes you lose write throughput and is even more horrible when you want to set up what they call a “cluster” which is in fact some mediocre abstraction they add on top of replica-sets. To say the least, it is inefficient and cumbersome to operate and maintain.

So I obviously had Cassandra on my radar for a long time, but I was pushed back by its Java stack, heap size and silly tuning… Also, coming from the versatile MongoDB world, Cassandra’s CQL limitations looked dreadful at that time…

The day I found myself on ScyllaDB’s webpage and read their promises, I was sure to be challenging our current use cases with this interesting sea monster.


Setting up a POC with the people at ScyllaDB

Through my contributions around my packaging of ScyllaDB for Gentoo Linux, I got to know a bit about the people behind the technology. They got interested in why I was packaging this in the first place and when I explained my not-so-secret goal of challenging our production data workflows using Scylla, they told me that they would love to help!

I was a bit surprised at first because this was the first time I ever saw a real engagement of the people behind a technology into someone else’s POC.

Their pitch is simple, they will help (for free) anyone conducting a serious POC to make sure that the outcome and the comprehension behind it is the best possible. It is a very mature reasoning to me because it is easy to make false assumptions and conclude badly when testing a technology you don’t know, even more when your use cases are complex and your expectations are very high like us.

Still, to my current knowledge, they’re the only ones in the data industry to have this kind of logic in place since the start. So I wanted to take this chance to thank them again for this!

The POC includes:

  • no bullshit, simple tech-to-tech relationship
  • a private slack channel with multiple ScyllaDB’s engineers
  • video calls to introduce ourselves and discuss our progress later on
  • help in schema design and logic
  • fast answers to every question you have
  • detailed explanations on the internals of the technology
  • hardware sizing help and validation
  • funny comments and French jokes (ok, not suitable for everyone)

 

 

 

 

 

 

 

 

 


Lessons for a successful POC

As I said before, you’ve got to be serious in your approach to make sure your POC will be efficient and will lead to an unbiased and fair conclusion.

This is a list of the main things I consider important to have prepared before you start.

Have some background

Make sure to read some literature to have the key concepts and words in mind before you go. It is even more important if like me you do not come from the Cassandra world.

I found that the Cassandra: The Definitive Guide book at O’Reilly is a great read. Also, make sure to go around ScyllaDB’s documentation.

Work with a shared reference document

Make sure you share with the ScyllaDB guys a clear and detailed document explaining exactly what you’re trying to achieve and how you are doing it today (if you plan on migrating like we did).

I made a google document for this because it felt the easiest. This document will be updated as you go and will serve as a reference for everyone participating in the POC.

This shared reference document is very important, so if you don’t know how to construct it or what to put in it, here is how I structured it:

  1. Who’s participating at <your company>
    • photo + name + speciality
  2. Who’s participating at ScyllaDB
  3. POC hardware
    • if you have your own bare metal machines you want to run your POC on, give every detail about their number and specs
    • if not, explain how you plan to setup and run your scylla cluster
  4. Reference infrastructure
    • give every details on the technologies and on the hardware of the servers that are currently responsible for running your workflows
    • explain your clusters and their speciality
  5. Use case #1 : <name>
    • Context
      • give context about your use case by explaining it without tech words, think from the business / user point of view
    • Current implementations
      • that’s where you get technical
      • technology names and where they come into play in your current stack
      • insightful data volumes and cardinality
      • current schema models
    • Workload related to this use case
      • queries per second per data source / type
      • peek hours or no peek hours?
      • criticality
    • Questions we want to answer to
      • remember, the NoSQL world is lead by query-based-modeling schema design logic, cassandra/scylla is no exception
      • write down the real questions you want your data model(s) to be able to answer to
      • group them and rate them by importance
    • Validated models
      • this one comes during the POC when you have settled on the data models
      • write them down, explain them or relate them to the questions they answer to
      • copy/paste some code showcasing how to work with them
    • Code examples
      • depending on the complexity of your use case, you may have multiple constraints or ways to compare your current implementation with your POC
      • try to explain what you test and copy/paste the best code you came up with to validate each point

Have monitoring in place

ScyllaDB provides a monitoring platform based on Docker, Prometheus and Grafana that is efficient and simple to set up. I strongly recommend that you set it up, as it provides valuable insights almost immediately, and on an ongoing basis.

Also you should strive to give access to your monitoring to the ScyllaDB guys, if that’s possible for you. They will provide with a fixed IP which you can authorize to access your grafana dashboards so they can have a look at the performances of your POC cluster as you go. You’ll learn a great deal about ScyllaDB’s internals by sharing with them.

Know when to stop

The main trap in a POC is to work without boundaries. Since you’re looking for the best of what you can get out of a technology, you’ll get tempted to refine indefinitely.

So this is good to have at least an idea on the minimal figures you’d like to reach to get satisfied with your tests. You can always push a bit further but not for too long!

Plan some high availability tests

Even if you first came to ScyllaDB for its speed, make sure to test its high availability capabilities based on your experience.

Most importantly, make sure you test it within your code base and guidelines. How will your code react and handle a failure, partial and total? I was very surprised and saddened to discover so little literature on the subject in the Cassandra community.

POC != production

Remember that even when everything is right on paper, production load will have its share of surprises and unexpected behaviours. So keep a good deal of flexibility in your design and your capacity planning to absorb them.

Make time

Our POC lasted almost 5 months instead of estimated 3, mostly because of my agenda’s unwillingness to cooperate…

As you can imagine this interruption was not always optimal, for either me or the ScyllaDB guys, but they were kind not to complain about it. So depending on how thorough you plan to be, make sure you make time matching your degree of demands. The reference document is also helpful to get back to speed.


Feedback for the ScyllaDB guys

Here are the main points I noted during the POC that the guys from ScyllaDB could improve on.

They are subjective of course but it’s important to give feedback so here it goes. I’m fully aware that everyone is trying to improve, so I’m not pointing any fingers at all.

I shared those comments already with them and they acknowledged them very well.

More video meetings on start

When starting the POC, try to have some pre-scheduled video meetings to set it right in motion. This will provide a good pace as well as making sure that everyone is on the same page.

Make a POC kick starter questionnaire

Having a minimal plan to follow with some key points to set up just like the ones I explained before would help. Maybe also a minimal questionnaire to make sure that the key aspects and figures have been given some thought since the start. This will raise awareness on the real answers the POC aims to answer.

To put it simpler: some minimal formalism helps to check out the key aspects and questions.

Develop a higher client driver expertise

This one was the most painful to me, and is likely to be painful for anyone who, like me, is not coming from the Cassandra world.

Finding good and strong code examples and guidelines on the client side was hard and that’s where I felt the most alone. This was not pleasant because a technology is definitely validated through its usage which means on the client side.

Most of my tests were using python and the python-cassandra driver so I had tons of questions about it with no sticking answers. Same thing went with the spark-cassandra-connector when using scala where some key configuration options (not documented) can change the shape of your results drastically (more details on the next post).

High Availability guidelines and examples

This one still strikes me as the most awkward on the Cassandra community. I literally struggled with finding clear and detailed explanations about how to handle failure more or less gracefully with the python driver (or any other driver).

This is kind of a disappointment to me for a technology that position itself as highly available… I’ll get into more details about it on the next post.

A clearer sizing documentation

Even if there will never be a magic formula, there are some rules of thumb that exist for sizing your hardware for ScyllaDB. They should be written down more clearly in a maybe dedicated documentation (sizing guide is labeled as admin guide at time of writing).

Some examples:

  • RAM per core ? what is a core ? relation to shard ?
  • Disk / RAM maximal ratio ?
  • Multiple SSDs vs one NMVe ?
  • Hardware RAID vs software RAID ? need a RAID controller at all ?

Maybe even provide a bare metal complete example from two different vendors such as DELL and HP.

What’s next?

In the next post, I’ll get into more details on the POC itself and the technical learnings we found along the way. This will lead to the final conclusion and the next move we engaged ourselves with.

Marek Szuba a.k.a. marecki (homepage, bugs)
Randomness in virtual machines (January 23, 2018, 15:47 UTC)

I always felt that entropy available to the operating system must be affected by running said operating system in a virtual environment – after all, unpredictable phenomena used to feed the entropy pool are commonly based on hardware and in a VM most hardware either is simulated or has the hypervisor mediate access to it. While looking for something tangentially related to the subject, I have recently stumbled upon a paper commissioned by the German Federal Office for Information Security which covers this subject, with particular emphasis on entropy sources used by the standard Linux random-number generator (i.e. what feeds /dev/random and /dev/urandom), in extreme detail:

https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Publikationen/Studien/ZufallinVMS/Randomness-in-VMs.pdf?__blob=publicationFile&v=3

Personally I have found this paper very interesting but since not everyone can stomach a 142-page technical document, here are some of the main points made by its authors regarding entropy.

  1. As a reminder – the RNG in recent versions of Linux uses five sources of random noise, three of which do not require dedicated hardware (emulated or otherwise): Human-Interface Devices, rotational block devices, and interrupts.
  2. Running in a virtual machine does not affect entropy sources implemented purely in hardware or purely in software. However, it does affect hybrid sources – which in case of Linux essentially means all of the default ones.
  3. Surprisingly enough, virtualisation seems to have no negative effect on the quality of delivered entropy. This is at least in part due to additional CPU execution-time jitter introduced by the hypervisor compensating for increased predictability of certain emulated devices.
  4. On the other hand, virtualisation can strongly affect the quantity of produced entropy. More on this below.
  5. Low quantity of available entropy means among other things that it takes virtual machines visibly longer to bring /dev/urandom to usable state. This is a problem if /dev/urandom is used by services started at boot time because they can be initialised using low-quality random data.

Why exactly is the quantity of entropy low in virtual machines? The problem is that in a lot of configurations, only the last of the three standard noise sources will be active. On the one hand, even physical servers tend to be fairly inactive on the HID front. On the other, the block-device source does nothing unless directly backed by a rotational device – which has been becoming less and less likely, especially when we talk about large cloud providers who, chances are, hold your persistent storage on distributed networked file systems which are miles away from actual hard drives. This leaves interrupts as the only available noise source. Now take a machine configured this way and have it run a VPN endpoint, a HTTPS server, a DNSSEC-enabled DNS server… You get the idea.

But wait, it gets worse. Chances are many devices in your VM, especially ones like network cards which are expected to be particularly active and therefore the main source of interrupts, are in fact paravirtualised rather than fully emulated. Will such devices still generate enough interrupts? That depends on the underlying hypervisor, or to be precise on the paravirtualisation framework it uses. The BSI paper discusses this for KVM/QEMU, Oracle VirtualBox, Microsoft Hyper-V, and VMWare ESXi:

  • the former two use the the VirtIO framework, which is integrated with the Linux interrupt-handling code;
  • VMWare drivers trigger the interrupt handler similarly to physical devices;
  • Hyper-V drivers use a dedicated communication channel called VMBus which does not invoke the LRNG interrupt handler. This means it is entirely feasible for a Linux Hyper-V guest to have all noise sources disabled, and reenabling the interrupt one comes at a cost of performance loss caused by the use of emulated rather than paravirtualised devices.

 

All in all, the paper in question has surprised me with the news of unchanged quality of entropy in VMs and confirmed my suspicions regarding its quantity. It also briefly mentions (which is how I have ended up finding it) the way I typically work around this problem, i.e. by employing VirtIO-RNG – a paravirtualised hardware RNG in the KVM/VirtualBox guests which which interfaces with a source of entropy (typically /dev/random, although other options are possible too) on the host. Combine that with haveged on the host, sprinkle a bit of rate limiting on top (I typically set it to 1 kB/s per guest, even though in theory haveged is capable of acquiring entropy at a rate several orders of magnitude higher) and chances are you will never have to worry about lack of entropy in your virtual machines again.

January 18, 2018
Matthew Thode a.k.a. prometheanfire (homepage, bugs)
Undervolting your CPU for fun and profit (January 18, 2018, 06:00 UTC)

Disclaimer

I'm not responsible if you ruin your system, this guide functions as documentation for future me. While this should just cause MCEs or system resets if done wrong it might also be able to ruin things in a more permanent way.

Why do this

It lowers temps generally and if you hit thermal throttling (as happens with my 5th Generation X1 Carbon) you may thermally throttle less often or at a higher frequency.

Prework

Repaste first, it's generally easier to do and can result in better gains (and it's all about them gains). For instance, my Skull Canyon NUC was resetting itself thermally when stress testing. Repasting lowered the max temps by 20°C.

Undervolting - The How

I based my info on https://github.com/mihic/linux-intel-undervolt which seems to work on my Intel Kaby Lake based laptop.

Using the MSR registers to write the values via msr-tools wrmsr binary.

The following python3 snippet is how I got the values to write.

# for a -110mv offset I run the following
format(0xFFE00000&( (round(-110*1.024)&0xFFF) <<21), '08x')

What you are actually writing is actually as follows, with the plane index being for cpu, gpu or cache for 0, 1 or 2 respectively.

constant plane index constant write/read offset
80000 0 1 1 F1E00000

So we end up with 0x80000011F1E00000 to write to MSR register 0x150.

Undervolting - The Integration

I made a script in /opt/bin/, where I place my custom system scripts called undervolt.sh (make sure it's executable).

#!/usr/bin/env sh
/usr/sbin/wrmsr 0x150 0x80000011F1E00000  # cpu core  -110
/usr/sbin/wrmsr 0x150 0x80000211F1E00000  # cpu cache -110
/usr/sbin/wrmsr 0x150 0x80000111F4800000  # gpu core  -90

I then made a custom systemd unit in /etc/systemd/system called undervolt.service with the following content.

[Unit]
Description=Undervolt Service

[Service]
ExecStart=/opt/bin/undervolt.sh

[Install]
WantedBy=multi-user.target

I then systemctl enable undervolt.service so I'll get my settings on boot.

The following script was also placed in /usr/lib/systemd/system-sleep/undervolt.sh to get the settings after recovering from sleep, as they are lost at that point.

#!/bin/sh
if [ "${1}" == "post" ]; then
  /opt/bin/undervolt.sh
fi

That's it, everything after this point is what I went through in testing.

Testing for stability

I stress tested with mprime in stress mode for the CPU and glxmark or gputest for the GPU. I only recorded the results for the CPU, as that's what I care about.

No changes:

  • 15-17W package tdp
  • 15W core tdp
  • 3.06-3.22 Ghz

50mv Undervolt:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.22-3.43 Ghz

70mv undervolt:

  • 15-16W package tdp
  • 14W core tdp
  • 3.22-3.43 Ghz

90mv undervolt:

  • 15W package tdp
  • 13-14W core tdp
  • 3.32-3.54 Ghz

100mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.42-3.67 Ghz

110mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.42-3.67 Ghz

started getting gfx artifacts, switched the gfx undervolt to 100 here

115mv undervolt:

  • 15W package tdp
  • 13W core tdp
  • 3.48-3.72 Ghz

115mv undervolt with repaste:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.63-3.81 Ghz

120mv undervolt with repaste:

  • 15-16W package tdp
  • 14-15W core tdp
  • 3.63-3.81 Ghz

I decided on 110mv cpu and 90mv gpu undervolt for stability, with proper ventilation I get about 3.7-3.8 Ghz out of a max of 3.9 Ghz.

Other notes: Undervolting made the cpu max speed less spiky.

January 08, 2018
Andreas K. Hüttel a.k.a. dilfridge (homepage, bugs)
FOSDEM 2018 talk: Perl in the Physics Lab (January 08, 2018, 17:07 UTC)

FOSDEM 2018, the "Free and Open Source Developers' European Meeting", takes place 3-4 February at Universite Libre de Bruxelles, Campus Solbosch, Brussels - and our measurement control software Lab::Measurement will be presented there in the Perl devrooom! As all of FOSDEM, the talk will also be streamed live and archived; more details on this follow later. Here's the abstract:

Perl in the Physics Lab
Andreas K. Hüttel
    Track: Perl Programming Languages devroom
    Room: K.4.601
    Day: Sunday
    Start: 11:00
    End: 11:40 
Let's visit our university lab. We work on low-temperature nanophysics and transport spectroscopy, typically measuring current through experimental chip structures. That involves cooling and temperature control, dc voltage sources, multimeters, high-frequency sources, superconducting magnets, and a lot more fun equipment. A desktop computer controls the experiment and records and evaluates data.

Some people (like me) want to use Linux, some want to use Windows. Not everyone knows Perl, not everyone has equal programming skills, not everyone knows equally much about the measurement hardware. I'm going to present our solution for this, Lab::Measurement. We implement a layer structure of Perl modules, all the way from the hardware access and the implementation of device-specific command sets to high level measurement control with live plotting and metadata tracking. Current work focuses on a port of the entire stack to Moose, to simplify and improve the code.

FOSDEM

January 03, 2018
FOSDEM 2018 (January 03, 2018, 00:00 UTC)

FOSDEM 2018 logo

Put on your cow bells and follow the herd of Gentoo developers to Université libre de Bruxelles in Brussels, Belgium. This year FOSDEM 2018 will be held on February 3rd and 4th.

Our developers will be ready to candidly greet all open source enthusiasts at the Gentoo stand in building K. Visit this year’s wiki page to see which developer will be running the stand during the different visitation time slots. So far seven developers have specified their attendance, with most-likely more on the way!

Unlike past years, Gentoo will not be hosting a keysigning party, however participants are encouraged to exchange and verify OpenPGP key data to continue building the Gentoo Web of Trust. See the wiki article for more details.

December 28, 2017
Michael Palimaka a.k.a. kensington (homepage, bugs)
Helper scripts for CMake-based ebuilds (December 28, 2017, 14:13 UTC)

I recently pushed two new scripts to the KDE overlay to help with CMake-based ebuild creation and maintenance.

cmake_dep_check inspects all CMakeLists.txt files under the current directory and maps find_package calls to Gentoo packages.

cmake_ebuild_check builds on cmake_dep_check and produces a list of CMake dependencies that are not present in the ebuild (and optionally vice versa).

If you find it useful or run into any issues feel free to drop me a line.