[SOLVED] sync() hangs in kernel but I/O is working

Skaperen · 07-05-2013, 10:50 AM

I am trying out -current, but have not done so since it was released with the 3.8.13 kernel (been overly busy for a month). Now I just tried out the latest with the 3.9.7 kernel and hit the following bug ... the sync() syscall hangs. This is first seen when I run a script that updates the fresh install with lots of my software, and tweaks a few /etc files. There are no disk errors.

Here's the strange part. Even while the sync command is hung in a sync() syscall, I can do I/O, even in large amounts. And the writes are successful (after reboot I see the data that was written), as well as read. So it's not just writing into RAM buffer space, and the underlying driver and hardware are not hung, and are fully functional. This would seem to be a kernel buffer handling bug.

I re-installed just to be at a fresh state, so my updating would not be an impact. So this re-install is freshly booted and and I try the sync command. No problem, it exits quickly. So I try writing some data to disk with dd so I can manage how much I write. With ever increasing amounts, I finally find that sync is slowing down around 128 MB. At 1 GB sync takes about 6 seconds. At 4 GB sync takes about 24 seconds. At 64 GB it took 396 seconds (close to the expected 384 seconds). This machine has 64 GB of RAM.

These are timings that seem about right for the time to perform the disk I/O. So it seems the the sync() syscall has changed so it actually waits for the I/O to complete. But this does not explain the long term hang after my script added about 400 MB of files. And it does not explain why I can write to the disk at full speed while sync() has the process blocked. Either something is NOT getting written from the buffer for some reason, OR after this kind of data update, the sync() call does not come back.

The command "echo s > /proc/sysrq-trigger" does not hang. But I cannot tell if it really is writing buffers to disk.

Anyway, 3.9.7 is an unusable kernel.

NeoMetal · 07-05-2013, 11:02 AM

I think sync() in linux does wait for all buffers to write before returning. After your reinstall and dd test, did your update script still hang in that call?

Skaperen · 07-05-2013, 05:36 PM

Yes, the update did hang again, at sync(), as did a sync executed manually. You are right that it does wait. But it never wakes up and finishes. I can still do more writes and they actually get written to the device. But any attempt to call sync() won't return.

Update: sometimes sync() hangs and sometimes not. When it does, then every attempt will hang. It's like maybe it is hanging on something. But it isn't the hard drives.

Celyr · 07-05-2013, 07:05 PM

I think that if you have a constant flow of data sync may appear to hang but it's not the case, when flow will be over and buffer will be cleared sync will exit.

Skaperen · 07-05-2013, 11:59 PM

Quote:

Originally Posted by Celyr

I think that if you have a constant flow of data sync may appear to hang but it's not the case, when flow will be over and buffer will be cleared sync will exit.

Not the case here. The sync() remained blocked well after the data flow was done. Only about 400MB of data was written by processes (mostly one rsync step). That much data can be written to the hard drive in about 3 to 4 seconds. What would sync be writing after that if it was doing any? After I noticed the update process had been hung for a few minutes, I logged in on that machine. I was able to do writing to disk that ran at full speed, so there was no competing writing going on. Yet the original sync command was stuck in 'D' state. And in that login session if I did another sync command, that, too, would hang on the sync() syscall (did it in strace and saw it happen).

FYI: that writing I did why the sync() was hung actually did get written to disk. So sync() being hung did not block writes going to disk (and it was not hung hardware).

One thing I've noticed in more tests is that it does not always do this. I repeated the process several times and it turns out, after a reboot, sync() does this hang about 30% of the time. But once sync() does hang, all sync() calls hang.

BTW: on normal systems, when there is a constant flow of data, calling sync() never hangs longer than it takes to write all of what is in the RAM buffer at the time of the call. That CAN be a while when there is a lot of RAM. But it eventually finishes. So apparently new writing after sync() is called does not block it from returning. That's how it should be.

Skaperen · 07-07-2013, 12:46 AM

After more googling, I see that this symptom does get mentioned for earlier kernels under a number of different bugs and descriptions that all seem to boil down to jdb2 constantly holding a drive to update the journal every 5 to 10 seconds. So I looked at my processes, and notice jdb2 is in fact hung in 'D' state constantly, though the drive is not running constantly. A couple more kernel threads are also running related to this. So it seems sync() can't get everything to finish very well possibly because of this journaling problem.

Skaperen · 07-07-2013, 10:18 PM

Workaround: mount all filesystems with -o sync.

Skaperen · 07-21-2013, 06:39 PM

The problem still exists in kernel 3.9.10.

guanx · 07-21-2013, 09:36 PM

Got similar (not identical) problems. Could you try if disabling pre-emption and/or dynamic tick helps?

I turned off pre-emption in the kernel config and this occurrence of problem changed from often to rare. When I further dosabled dynamic ticks and used the classic HZ=100 I never saw this problem again. This problem is however hard to reproduce. So I cannot be sure if my suggestion works.

Skaperen · 07-22-2013, 09:03 AM

Can you tell me where these suggestions are configure? I normally use the "make oldconfig" or "make config" ways to configure the kernel. But I don't recall seeing these kinds of configs. Is there a specific symbol name to look for?

BTW, this happens about 1 out of 4 times I run my mods script against a fresh install. I have imaged a fresh install so I can go back to that point faster to re-try. But, after this is done, I have not seen this happen on the modified system. The kernel was not changed. But I suspect I'm just not hitting it as hard. Or maybe the problem happens when a few hundred MBs get transferred in real soon after the kernel comes up (e.g. a conflict with some internal thing going on right after boot/init). (Just thinking out loud)

I've also tried doing the mods to a fresh install system w/o the big payload of loading the /home files which takes the amount of files down to just 4.5 MB. The problem still happens at no less frequency. There is one "unusual" step done in the mods scripts and that is adding 30 new users. These are added by running the groupadd, useradd, chown, and chmod commands. This causes /etc/group, /etc/passwd, and /etc/shadow to be rewritten 30 times. In theory that should mean just a bunch of pages in the filesystem that got dirty and then deallocated, or some pages logically written twice. I don't know how ext4 handles that, but I would think it just rewrites the underlying absolute page again, and has extra dirty pages. But if it tries to remove pages from cache that get deallocated after writing, maybe that gets the paging code confused.

guanx · 07-22-2013, 09:16 AM

For your kernel (Linux 3.9)

PREEMPT_NONE = y: Processor type and features -> Preemption Model -> No Forced Preemption
NO_HZ = n: General setup -> Timers subsystem -> Tickless System
HZ_100 = y: Processor type and features -> Timer frequency -> 100 HZ

It's different but similar for Linux 3.10.

Skaperen · 07-23-2013, 10:59 PM

I have narrowed down the problem. Of the many things happening when I run my update, one of them is adding 30 new users. I ran tests with this script component disabled (I only need to remove the execute bit from the file to disable it). The sync problems no longer happen with it disabled. I then modified the script to do a sync just before it adds the 30 users, and just after it added them. The one just before completes instantly. The one just after hangs. The only things that happened are the updates of /etc/group, /etc/passwd, and /etc/shadow done by the groupadd and useradd commands. I believe those commands create a modified copy of the files with temporary names in the /etc directory, link the old files to a backup name, then rename the new files over the old files, leaving the old files under the backup names. The 2nd time this is done for user 2, that linking of the backup name deletes the old file. The 3rd time, it is newly created files (for user 1) getting deleted. So for 30 new users, 3 files are created and soon deleted, 28 times.

This should work fine. But I have read that ext4 has issues with things like this where changes like that are not handled in order (e.g. if I power off and start back up, the changes now visible on disk are not necessarily an exact point in time in the past). I'm wondering if that nature of ext4 could impact how the sync() actions are handled.

Skaperen · 07-24-2013, 03:15 PM

It is definitely an ext4 issue. Substituting with ext2 does not encounter the problem.

One way to create the issue is to have a lot of files somewhere on an ext4 filesystem. Mount that filesystem just now. Do a recursive chown command to change the owner of a bunch of files (I did it for 88000 files). Do the "time sync" command. Wait for it to get done (could be a very long time). If you do it again under the same mounting of the filesystem, it won't happen. I'm not sure why this aspect of the problem is so. It also is not happening on newly created files. It only happens when it is the first chown after a mount, at least for those particular files.