How to diagnose systemd start job problems?

pepperslq · 03-23-2022, 08:41 AM

Every 10 boots or so, in the bootup messages, there is: "A startup job is running on /dev/sda4", followed by a timer. Then the timer runs out at 1m30s, and it dumps into emergency mode. The only fix I found is to reboot 3 or 4 times, then the message doesn't appear for a while.

/dev/sda4 is /home. I assume the start job is a fsck, but I can't figure out exactly what is happening, or which "services" are running at startup and why it's hanging. And why does it go away after a few reboots?

Anyone know how to diagnose this? I've been searching, but the systemd documentation and nomenclature is beyond confusing.

I tried to make the start job timeout longer, but editing the timeout in /etc/systemd/user.conf had no effect. Is there another place to set the timeout (if that's the problem)?

pan64 · 03-23-2022, 02:46 PM

did you check the logs? (in /var/log and also in dmesg/journalctl). Did you try to run fsck manually? Did you check the services?

tomwest · 04-22-2022, 03:54 AM

Maybe running: systemd-analyze blame, might show something in the boot up that the "startup job" runs. I've had this issue on powering off when the same message and time delay of one and a half minutes runs. That was usually I found because the network hadn't quite finished some action. After I came to close the browser and ensure no network interactivity was occurring, then that message and wait didn't reoccur, but it didn't happen at boot up, only at poweroff.

jailbait · 04-22-2022, 12:19 PM

Quote:

Originally Posted by pepperslq

Every 10 boots or so, in the bootup messages, there is: "A startup job is running on /dev/sda4", followed by a timer. Then the timer runs out at 1m30s, and it dumps into emergency mode. The only fix I found is to reboot 3 or 4 times, then the message doesn't appear for a while.

/dev/sda4 is /home. I assume the start job is a fsck, but I can't figure out exactly what is happening, or which "services" are running at startup and why it's hanging. And why does it go away after a few reboots?

I had a similar problem once when I rearranged my partition tree making it more complex.

The problem might be in /etc/fstab. One of the options in /etc/fstab is the pass number which controls the order in which partitions are mounted. As part of mount fsck checks the device/partition for errors. The root device should be 1. Other partitions can be set to 0 to disable checking, 2 to check after the root partition, or 3 to check after the 2 partitions, etc. If the pass numbers are set incorrectly then it is often a matter of timing as to whether the partitions are checked by fsck in the correct order or not. If a partition is checked too early fsck may not be able to find that partition.

tomwest · 04-22-2022, 09:07 PM

jailbait wrote:

Quote:

The problem might be in /etc/fstab. One of the options in /etc/fstab is the pass number which controls the order in which partitions are mounted. As part of mount fsck checks the device/partition for errors. The root device should be 1. Other partitions can be set to 0 to disable checking, 2 to check after the root partition, or 3 to check after the 2 partitions, etc. If the pass numbers are set incorrectly then it is often a matter of timing as to whether the partitions are checked by fsck in the correct order or not. If a partition is checked too early fsck may not be able to find that partition.

That's an interesting theory. A few thoughts come to mind. Firstly, we don't know if it's fsck that is holding up the boot. If it is, this "hold up" would likely not occur if the user has only one disk drive, unless there was some anomalous numbering in the sixth field of fstab. The info from the fstab man page is:

Quote:

Filesystems within a drive will be checked sequentially, but filesystems on different drives will be checked at the same time to utilize parallelism available in the hardware.

On one of my machines the fsck is very brief, at 3ms, as shown by the output from the command: systemd-analyze blame:

Code:

755ms systemd-fsck@dev-disk-by\x2duuid-6d2620b\x2d3a4\x2d4ec\x2dab4\x2d97e7300ef0.service
752ms colord.service

If /etc/fstab is stable, and the machine has a single disk, and the problem is intermittent, how likely is it that fsck is the issue? I think more information is needed.

jailbait · 04-23-2022, 10:01 AM

Quote:

Originally Posted by tomwest

I think more information is needed.

My theory could be checked out by pepperslq posting his /etc/fstab and explaining what file systems pepperslq wants to mount on what directories.

ondoho · 04-24-2022, 02:29 AM

First of all, pertinent information will have been be logged, and you can use journalctl and/or systemctl to retrieve it.

'journalctl -b' is everything for the current boot.
'journalctl -b -1' is everything for the previous boot.

You need to know what exactly has been happening there.

Quote:

Originally Posted by pepperslq

I tried to make the start job timeout longer, but editing the timeout in /etc/systemd/user.conf had no effect. Is there another place to set the timeout (if that's the problem)?

I don't think the timeout is the problem.
Increasing the timeout will do just that (if it doesn't you have a different problem to solve) - it won't solve the underlying problem.