LinuxQuestions.org - [SOLVED] Bedrock on a Void system: Worker freezes for a while

- Bedrock Linux (https://www.linuxquestions.org/questions/bedrock-linux-118/)

- - Bedrock on a Void system: Worker freezes for a while (https://www.linuxquestions.org/questions/bedrock-linux-118/bedrock-on-a-void-system-worker-freezes-for-a-while-4175703103/)

Bedrock on a Void system: Worker freezes for a while

Several weeks ago I set up a Void Linux install. Some programs, such as the Worker file manager, I had to compile because they are not in the Void repos. This is routine: no problems.

Today I installed Bedrock on Void. Rebooted. No obvious problems. Being cautious, I decided to not add any strata right away, just take a few days, on and off, to run things and see how well it all works. Sound, network, xorg, all seem fine.

I also have a partition with Devuan on it and a partition which is currently used as /home on the Devuan OS. On the Bedrock OS, that same partition is mounted as /other_home. Those partitions are ext4.

From within Worker on Bedrock, I navigated to /other_home to pick a video to play and ... Worker froze for a good half-minute. It hasn't done this before, ever.

In a terminal, I ran top and, as expected, saw the CPU was uncharacteristically high for worker: it was the topmost, using the most CPU. Unexpectedly, the second-highest CPU using process was ... etcfs.

My understanding is that etcfs is part of bedrock, and is unique to bedrock?

Worker had two directories open, one was the small home directory of my Void non-root user, and the other was the directory on /other_home. Worker was not touching /etc. The only thing unusual is that the /other_home directory has roughly twelve thousand files in my non-root user's home directory. Normally Worker loading that directory gives one to two seconds of pause.

I'm wondering what is going on and if anyone can clue me in please? Thank you.

p.s. Intel 64bit

p.p.s. after the long freeze, Worker and etcfs returned to the usual near-zero cpu usage.

Quote:

Originally Posted by jr_bob_dobbs (Post 6298521)

From within Worker on Bedrock, I navigated to /other_home to pick a video to play and ... Worker froze for a good half-minute. It hasn't done this before, ever.

Was this a one-off or is it consistently reproducible? If the former, no idea. If the latter, run it through strace to see what it was doing.

Quote:

Originally Posted by jr_bob_dobbs (Post 6298521)

In a terminal, I ran top and, as expected, saw the CPU was uncharacteristically high for worker: it was the topmost, using the most CPU. Unexpectedly, the second-highest CPU using process was ... etcfs.

My understanding is that etcfs is part of bedrock, and is unique to bedrock?

Yes.

Userland programs can't do stuff with/to files through the virtual filesystem layer directly; instead, they ask the kernel to do it on their behalf. This is how permissions work - the kernel can check if the program has sufficient permissions and, if not, refuse to do the requested operation. On Bedrock, requests to /etc get redirected through a Bedrock program called etcfs which does work on the calling program's behalf.

This adds a tiny bit of overhead to request into /etc. In practice this is negligible for a reasonable number of requests to /etc; on the order of a hundredth or thousandth of a second per request. However, if something makes hundreds or thousands of requests to /etc, the overhead does add up to something noticeable.

On a typical system, CPU usage tracking software such as top usually understands this and categorizes both the userland program CPU usage and the kernel CPU usage that happens on the program's behalf under just the program. However, such software isn't aware of the redirect stuff and tends to categorize work etcfs is doing on behalf of another program as etcfs' own work.

Reproduce the issue, watching top before, during, and after the event. If etcfs CPU usage is consistently higher than expected, something else is probably spamming /etc and this is unrelated to Worker. If both Worker and etcfs spike at the same time and quiet at the same time, then it's likely Worker that's spamming /etc.

Quote:

Originally Posted by jr_bob_dobbs (Post 6298521)

Worker was not touching /etc.

Did you actually check for this - strace logs, working thorugh the source code - or are you just guessing here?

If you're just guessing, try actually checking. Again, I recommend strace for this. If you're unfamiliar with strace, try running Worker with:

Code:

strace -tvf -s999 -o /tmp/log <command to start Worker>

Do whatever you need to do to make Worker reproduce the issue then immediately after it stops also stop Worker. If you send me the resulting logs at /tmp/log I can help you work through them. Note strace logging slows programs down; it will take even longer to complete than it was before.

It's possible to get logging from etcfs' side, and I can walk you through that if you insist, but it's kind of a pain. I'm completely rewriting etcfs for 0.8.0 Naga which should make getting quick statistics about processes that spam etcfs much easier, but that's won't be ready for a while.

It's reproducible. Navigating away from that directory and back, or even just clicking "reload" will cause it to happen. Each time, both Worker and etcfs have high CPU usage. This settles down, on both processes, after that directory is loaded.

I ran strace & worker. That was a very good idea. Thank you.

Oh my. Using strace was most revealing. I am very surprised at the results. It's 39 meg so I won't be pasting it here in its entirety. However, this excerpt should clarify matters considerably

Code:

6731  12:51:37 geteuid()                = 1000

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 geteuid()                = 1000

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 geteuid()                = 1000

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

6731  12:51:37 stat("/etc/localtime", {st_dev=makedev(0, 0x1b), st_ino=4465, st_mode=S_IFREG|0444, st_nlink=3, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=3536, st_atime=1636048524 /* 2021-11-04T13:55:24.366086749-0400 */, st_atime_nsec=366086749, st_mtime=1633310589 /* 2021-10-03T21:23:09-0400 */, st_mtime_nsec=0, st_ctime=1633717395 /* 2021-10-08T14:23:15.032507724-0400 */, st_ctime_nsec=32507724}) = 0

It's that, over and over again, for almost all of the 39 meg. Worker is accessing /etc/localtime over and over again, presumably for every file it finds in the current directory.

Code:

$ cat /tmp/log | grep 'stat("/etc/localtime' | wc

  53985 1544450 21682206

$ ls /other_home/bob | wc

  12548  12611  260724

$

53000 accesses of /etc/localtime by Worker in 30 (ish) seconds. Yeesh!

While I have some ideas to the improve performance of etcfs (including lessening the overhead of accessing /etc/localtime) in 0.8.x to a degree, I don't foresee a road where I can get it to the point where it can sustain that kind of workload without noticeable slowdown.

However, I'm doubtful Worker is similarly constrained. I suspect they could refactor things to avoid this, such fetching the time once and caching it for the duration of whatever loop is occurring here. Moreover, I wouldn't be surprised if doing so would improve performance even on non-Bedrock systems with normal /etc/localtime access overhead, although less significantly. Consider reporting your finding of excessive /etc/localtime queries (presumably caused by some current time query) to the Worker devs.

That makes sense. Thank you.

So I contacted the creator of Worker and after some questions and testing he made a surprisingly small change to Worker to get it to load large directories in Bedrock. The localtime function reads /etc/localtime every time whereas localtime_r() does not. That's a new one on me. Yay!