Originally Posted by TobiSGD
Yeah, looked at that article and stopped reading at Apparently, this person has never heard of testing changes on developer systems before rolling them into production, which lets me question their credibility.
It doesn't just happen in testing.
Due to the random scheduling, it can happen at any time - and not happen during testing.
ANYTHING can be a change - extra interrupts can slow a process down ... and expose another dependency failure.
Sometimes it works... Sometimes it doesn't.
That is why people keep sticking restarts of some services into rc.local. It tends to get more reliable. If a service did start, attempting to start it again gets canceled. If it didn't start, then it is more likely to get started.
Of course, when that fails too, people start sticking sleeps into rc.local to try and make it work too.
The very BASIC problem is due to the nature of network analysis.
The more complex the dependency network, the more likely adding a single node to it will cause the network to collapse.
This fact was learned back in the mid 1970s and early 1980s with PERT charting for project management - it doesn't scale well:
The next problem is that the network IS NOT a simple network (just the "before" and "after" would be the base network). There are conditional sub-networks that make it more complex ("wants","requires"...) for yet another network layer... And generates the need for multiple "targets" that do nothing but create sub-nets (it reduces the size of the list of dependencies, but can also makes the network more confusing.
So adding ONE new service could cause a number of previously untested services to also get started...
The next problem is that for "reliability" all services need to be modified to tell systemd when it is "ready". The problem here is that services that are started by other services introduce additional problems: NetworkManager is my favorite bad example. NetworkManager has to tell systemd when the network is ready... but NetworkManager isn't always in control - that is up to DHCP client.... So now the DHCP client has to tell NetworkManager when it is done... so that NetworkManager can tell systemd that the network is ready...
Which works - sort of (it is why there is a "NetworkManager-wait-online" target). Most places have a two or three level network "ready" states. One for administrative access (needed by admins to fix things), another for service networks (must be ready for service access such as remote databases), and a third for public access (needed to be up). The first one MUST be up. The second one CAN be up - but if not, then admins can connect to find/fix things - If both are up, then all services SHOULD be up and admins can do things to verify proper operation... or find out why the public access network isn't working. The third one must be up for general use... Can NetworkManager handle this? nope. All networks are either up or down. The only workaround is to take some of the networks OUT of NetworkManagers control... or dump NetworkManager entirely.
And this doesn't address the problems of a cluster when dependencies are external to the system... (though DHCP is a small example of this, but what about remote database access?).
BTW, There is no race condition for socket connections. If it exists, it is a bug in the service in the first place. Service startup is supposed to:
1) process configuration files, report any errors,
2) initialize the network (up through the listen system call) and report any errors, THEN
3) become a daemon.
At the point the "listen" system call is completed, the service is ready to accept connections. After the fork, the child starts accepting incoming connections (which are in the queue), the parent can then close the socket (the child has it) and exit normally (the event that signals it is ready). Systemd breaks this as there is no point where the service can be inherently identified as "ready".... unless it is modified to TELL systemd it is ready... Thus, the need to tell systemd there are "forking" services... which again defeats the purpose of systemd, as these services can't be monitored by systemd.
In this way, systemd takes over more and more of the formerly independent projects.
As you can see, systemd is not my favorite init.