Readiness protocol problems with Unix dæmons

One of the notions that some Unix and Linux dæmon management subsystems incorporate is that of service readiness. Such service management subsystems incorporate a notion of distinct "started"/"spawned" and "ready"/"running" service states, and a notion of one service depending from other services being both active and ready. A dæmon may be active simply by dint of there being a process running, but it is not ready until it has completed its initialization, opened the server ends of whatever client-server mechanisms it uses, and is about to enter its main request processing loop. (For some servers — such as, for example, RabbitMQ when it has a lot of persistent messages saved on disc — this difference is a period of time that can be measured in minutes rather than milliseconds.) Of course, only the service program itself can determine exactly when this point is.

Not starting dependent services until a depended-from service is ready, along with early socket opening, address the thundering herd problem of parallelized client-server startup. In the thundering herd model, clients are simply started and restarted blindly by the service manager until they "stick"; the clients abending repeatedly with errors caused by an unreachable server until the server is both up and ready. Early socket opening changes the abend into the client blocking, attempting to send a request to or attempting to read the response from the server over the socket, until the server processes socket messages. Waiting for services to notify that they are ready assists with the client-server protocols that are not socket based, or with servers that cannot be persuaded to do early socket opening.

The notion of service readiness is expressed in upstart via an expect stanza in the service's job file, and in systemd via the Type parameter in the service's service unit; both of which select from a number of readiness protocols.

Implemented readiness protocols
service management subsystemThe service is considered ready …
upstartsystemds6
no expect Type=simple default … as soon as it starts.
expect fork Type=forking N/A … after the main process fork()s a child and then exit()s.
expect stop N/A N/A … after the main process raise()s a SIGSTOP signal.
expect daemon N/A N/A … after the main process fork()s a child, that child fork()s a child, and then both parent and first child exit().
N/A Type=oneshot N/A … after the main process exit()s.
N/A Type=dbus N/A … after a named server, given in the service unit, appears on the Desktop Bus.
N/A Type=notify N/A … after a READY=1 text message is sent over the notification message socket.
N/A N/A notification-fd … after a newline is sent over the notification file descriptor.

Protocol mismatch

The upstart Cookbook warns about a few of the problems of readiness protocol mismatch, where one readiness protocol is specified for the service in its service management configuration, but the service program itself actually employs another. There are a number of them, applying not only to upstart but the other service managers as well, and they fall into four broad categories:

upstart's failure modes generally involve losing track of the actual state, the example in the Cookbook showing the case where it has marked as service as "stopped" when the forked children are actually running and the service is active. systemd's failure modes generally involve terminating the forked children and rendering a successfully activated service inactive. Asking why systemd is terminating a successfully started service, with nothing apparently wrong in the service's own logs, after the default wait for readiness timeout of 90 seconds is a common support question, that can be found in many discussion and Q&A fora.

No-one speaks the forking protocol.

Practically no-one speaks the forking (single or multiple) readiness protocol in the wild. This is for two reasons.

The first reason is simple: the protocol is highly specific. It doesn't cover just any old pattern of fork()ing child processes. Witness:

The second reason is more subtle: dæmon programs are not signifying readiness when they fork() child and exit() parent. They are attempting to do something else, which is nothing to do with readiness. The forking readiness protocol is an opportunistic re-use of widespread existing behaviour, but that behaviour isn't actually right for such re-use.

Indeed, it isn't actually right per se. What they are doing is trying to let system administrators start dæmons the 1980s way, where a system administrator could log in to an interactive session and simply start a dæmon by runing its program from the interactive shell. The fork()ing is part of a notion known as "dæmonization". It's widely believed, and commonly implemented, which is why it was opportunistically re-used. However, it is a widely held fallacy, common because a lot of books and other documents that just repeated the received wisdom of the 1980s perpetuated the fallacy, and because AT&T Unix System 5 rc was based upon this fallacy. It does not and cannot work safely, cleanly, and securely in systems since the 1980s and system administrators have three decades' worth of war stories that they tell about its failing. dæmons simply should not vainly try to "dæmonize" — something that the upstart Cookbook has been recommending since 2011, something that people using daemontools family toolsets have been recommending since the late 1990s, and something that IBM has been recommending since the early 1990s (when AIX's System Resource Controller came along).

Opportunistically re-using this ill-founded behaviour as a readiness protocol conflicts with its actual intent and implementation. If one wants to "dæmonize", so that a system administrator gets the shell prompt back, one does it early so that the system administrator doesn't wait around whilst the dæmon gets on with things asynchronously. And one finds that most programs that (singly or multiply) fork() child and exit() parent do so long before they have finished initialization and the service is actually ready to serve clients. Indeed, in many programs this is actually done first, before any initialization. This is because doing otherwise would lead to partially initialized resources that would then need to be cleaned up in the parent process; and to problems where linked-in software libraries might have done things as part of their initialization like spawning internal threads that the main program isn't even aware of, and that won't be carried into the forked child process, thus confusing the software library whose thread it is and leading to deadlocks, faults, and failures. It's actually problematic program design to fork() after all initialization when the program is finally ready.

At the time of the Debian systemd packaging hoo-hah several people opined that for best results dæmon programs should be altered to employ one of the protocols that actually is a readiness protocol at base, rather than relying upon this faulty reinterpretation of the "dæmonization" mechanism. This led to an analysis of the various other mechanisms.

Several incompatible protocols with low adoption

There is a wide choice of non-forking readiness protocols, some proposed, some implemented in service managers.

None of these are compatible with one another, the two closest being Ian Jackson's (unimplemented) proposal and Laurent Bercot's proposal that is being implemented in s6. The only translation layer currently in existence is Laurent Bercot's sdnotify-wrapper which translates from the s6 protocol to the systemd text message (Type=notify) protocol.

There is also a fairly low adoption rate in the wild, in actual services that are supposed to be speaking these protocols, for even the implemented ones.

The most widely adopted is the Desktop Bus service readiness protocol, implemented in systemd as Type=dbus and proposed by James Hunt (alongside several other protocols, notice) for upstart. The upstart people rejected the idea on the grounds that "there are none that implement this correctly" and "because services don't actually do this in a non-racy manner". Sadly, just like in the case of the upstart people's critique of System 5 rc the specifics of these vague generalizations, explaining where the claimed races are, are once again left entirely to the reader. Unlike in that case, here there are not numerous better critiques from other people to refer to instead, with no-one else making the same claim about Desktop Bus readiness notification.

Adoption limited by deliberate crippling of servers that nominally have adopted the protocols

In some cases, deliberate crippling has had the result of limiting adoption.

The system notification protocol is provided by the systemd authors in a library that programs can link to. However, on non-Linux platforms the library compiles to empty functions. The motivation for this on the parts of the systemd authors is clear: to avoid the charges levelled by detractors that "extra systemd code is now in my programs when this isn't even a systemd operating system", enabling them to point out that the "extra systemd code" is a function that returns zero and does nothing else.

Choosing to design based upon such charges has, however, led to the situation where a systemd-compatible system, that speaks the systemd notification protocol on non-systemd Linux operating systems, is a non-starter as an idea. (There's nothing inherent in a client sending a text message down a socket that limits it to systemd Linux operating systems.) The server programs that supposedly can speak the protocol, because their developers have done the recommended thing and used the systemd-author-supplied library, actually do not; and readiness notification the systemd way fails to work.

Ironically, servers that have rolled their own client code for the systemd protocol, such as collectd's notify_systemd() and Cameron T Norman's notify_socket(), enable adoption of the protocol where the systemd-author-supplied libraries do not.

Security of the service manager

A lot is made of the relative simplicity of implementing the various protocols in the programs for the managed services. Not much is thought about the problems of the manager-side implementation, or of general IPC security good practices.

The service manager is a trusted program that runs with superuser privileges and no security restrictions. It does so because the task of spawning a service involves applying security restrictions and switching to unprivileged accounts, passing through various kinds of one way doors, in a multiplicity of combinations peculiar to individual services (or groups of services). A readiness notification protocol is a client-server mechanism where the service programs are the clients, and the service manager is the server. Moreover, if the clients were trusted, the security restrictions under which they run wouldn't exist in the first place. All of the generally accepted wisdom about client-server interactions between not security restricted servers and untrusted clients thus apply.

One such piece of wisdom is not trusting client-supplied data. Clients are potentially compromised or malicious, and can supply erroneous or outright incorrectly structured data. They can send requests to servers that aren't expecting them; they can send large amounts of data for overflowing buffers; they can attempt to hijack existing client sessions; they can do all sorts of things. Service managers have to take this into account, and it affects protocol design.

Avoiding requests from clients other than the intended ones influences the design of the systemd, s6, and Ian Jackson readiness protocols.

Restricting the clients doesn't address the whole concern, however. A second piece of wisdom is that servers must guard against the vagaries of the inter-process communication mechanism and the simple acts of client-server I/O. Servers must be careful that clients cannot trigger things like SIGPIPE in the server, at the read end of any pipes that it uses. They must be careful to ensure that clients cannot employ tarpit attacks, where they provide requests very slowly or pause mid-request, to block the service manager or to starve other clients of I/O channels and resources. Even just reading client messages simply in order to discard them requires that a server be written carefully to avoid buffer overflow attacks against the read buffer.

This is part of the motivation for the s6 and Jackson protocols not letting clients that are "strangers" open the client end of the socket at all. It narrows the field of unrecognized clients, whose requests have to be read, from the entire system to just those processes who inherited the open file descriptor. In addition, the s6 protocol is a stream protocol not a packetized one, and a service manager speaking it can read client requests byte by byte, with no potential for buffer overflows from doing so.

There's yet more for the service manager to guard against: Even the recognized clients could be compromised. After all, they are themselves servers (the whole point of readiness protocols being to notify the system when a server is fully initialized and ready to serve) who themselves have to not trust client-supplied data.

Which brings us to a third piece of wisdom: Avoid parsing. — famously one of the qmail security principles. A readiness notification protocol is, by is very nature, a program-to-program protocol. Marshalling it into and out of a human-readable format, particularly an only loosely structured one, is pointless, inefficient, and the introduction of extra risk. The risk is that a compromised or malicious client can take advantage of quirks or outright bugs in that marshalling system, with improper quoting, malformed numbers and strings, unexpected zeroes, negative numbers, large numbers, long strings, incorrect data types, injected commands in strings, unexpected character encodings, NULs, and other such things. That risk, moreover, centres inside the code of a privileged service manager process, which is at the very least running with unrestricted security access, and sometimes running as the distinguished process #1 whose abend can have dire consequences (because the kernel abends the whole system in sympathy).


© Copyright 2015 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.