kernel process hungs for more than 120 seconds...

barani_pdy · 04-06-2018, 07:25 AM

Hi,

In our of deployment(vm deployment on ESXi 5.5x hypervisor)the following are observed. Request you expertise views on this to understand it.

[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0

[2441049.121000] INFO: task sadc:11031 blocked for more than 120 seconds.
[2441049.121000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2441049.121000] sadc D ffff88073f028100 0 11031 1 0x00000004
[2441049.121000] ffff880351263de8 0000000000000086 ffff880351263db0 ffff880351263dac
[2441049.121000] 0000000000000000 00000000000007fe 000000000000d9b8 ffff88016961c360
[2441049.121000] 000000019178904d 0000000400000000 ffff8805e59d6080 ffff88016961c080
[2441049.121000] Call Trace:
[2441049.121000] [<ffffffff815b3b28>] __schedule+0x3c/0x55
[2441049.121000] [<ffffffff815b3c43>] schedule+0x10/0x22
[2441049.121000] [<ffffffff8119b8ff>] log_wait_commit+0xc5/0x11b
[2441049.121000] [<ffffffff810662ef>] ? autoremove_wake_function+0x0/0x34
[2441049.121000] [<ffffffff8119b258>] ? spin_unlock+0x9/0xb
[2441049.121000] [<ffffffff8114c57b>] ext3_sync_file+0x9f/0xd4
[2441049.121000] [<ffffffff81111c05>] vfs_fsync_range+0x7d/0xad
[2441049.121000] [<ffffffff81111c92>] vfs_fsync+0x18/0x1a
[2441049.121000] [<ffffffff81111cc3>] do_fsync+0x2f/0x44
[2441049.121000] [<ffffffff81111ce6>] sys_fdatasync+0xe/0x12
[2441049.121000] [<ffffffff8100bc42>] system_call_fastpath+0x16/0x1b

[2792409.478000] INFO: task lvm:2528 blocked for more than 120 seconds.
[2792409.478000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2792409.478000] lvm D 0000000000000008 0 2528 2525 0x00000000
[2792409.478000] ffff8805d422fc08 0000000000000082 ffff880500000000 ffff88075de4c0c0
[2792409.478000] 0000000000000000 0000000000000400 ffff880624761770 ffff88075de4c3a0
[2792409.478000] ffff8806247617b0 0000000000000001 ffff880c24d83830 ffff88075de4c0c0
[2792409.478000] Call Trace:
[2792409.478000] [<ffffffff81072a34>] ? spin_unlock+0x9/0xb
[2792409.478000] [<ffffffff815b44c7>] ? __mutex_unlock_slowpath+0x3a/0x3e
[2792409.478000] [<ffffffff815b3b28>] __schedule+0x3c/0x55
[2792409.478000] [<ffffffff815b3c43>] schedule+0x10/0x22
[2792409.478000] [<ffffffff815b3d76>] schedule_timeout+0x22/0xbb
[2792409.478000] [<ffffffff815b5393>] ? _raw_spin_unlock_irqrestore+0x32/0x3d
[2792409.478000] [<ffffffff815b512a>] ? _raw_spin_lock_irq+0x1d/0x36
[2792409.478000] [<ffffffff815b3124>] wait_for_common+0xb1/0x11f
[2792409.478000] [<ffffffff810406d6>] ? default_wake_function+0x0/0x11
[2792409.478000] [<ffffffff81051e59>] ? local_bh_enable_ip+0x55/0x60
[2792409.478000] [<ffffffff815b321c>] wait_for_completion+0x18/0x1a
[2792409.478000] [<ffffffff8110e323>] sync_inodes_sb+0x8a/0x162
[2792409.478000] [<ffffffff81111e45>] __sync_filesystem+0x4c/0x75
[2792409.478000] [<ffffffff81112007>] sync_filesystem+0x47/0x4c
[2792409.478000] [<ffffffff8111889a>] fsync_bdev+0x23/0x43
[2792409.478000] [<ffffffff81238ce8>] blkdev_ioctl+0x1ca/0x692
[2792409.478000] [<ffffffff81117ba3>] block_ioctl+0x32/0x36
[2792409.478000] [<ffffffff810ff6f6>] vfs_ioctl+0x2d/0xa1
[2792409.478000] [<ffffffff810ffc5e>] do_vfs_ioctl+0x47d/0x4c3
[2792409.478000] [<ffffffff810ffd51>] sys_ioctl+0xad/0xd2
[2792409.478000] [<ffffffff8100bc42>] system_call_fastpath+0x16/0x1b

smallpond · 04-06-2018, 09:30 AM

I/O to disk sdb blocked for over 120s. lvm and sadc are victim processes that were prevented from running as a result. You need to figure out which disk is sdb and why I/O isn't proceeding to fix the underlying problem.

sundialsvcs · 04-06-2018, 10:28 AM

Quote:

Originally Posted by smallpond

I/O to disk sdb blocked for over 120s. lvm and sadc are victim processes that were prevented from running as a result. You need to figure out which disk is sdb and why I/O isn't proceeding to fix the underlying problem.

You probably have a serious disk hardware or controller issue on your hands!

(But, and I am quite serious ... "check the obvious things first." Press on every cable to be sure it's firmly seated. Check the power cords, UPS boxes (are they on?), fuses. Press down on every controller card to see if it's firmly seated. And so on.)

You should be able to run s.m.a.r.t. internal-diagnostics on the drive – if it's still running – to see what it reports about its own internal status. (These facilities are built-in to the drive hardware and can also therefore be used with Windows, AIX, and so on.)

anotherlinuxuser · 04-07-2018, 11:47 PM

Is /dev/sdb an iSCSI, or other network based disk? It is not uncommon to use a network disk of some type with ESXi. If sdb is a network device, check the NAS/SANs device and make sure it not disconnecting for some reason, or being held up by another server that may be sharing the device, as that can cause the "blocked for more than 120 seconds" message at the ESXi server.

Other possibilities are:
A VM or other process is writing a very large file to sdb that takes more than 120 seconds, causing other processes to wait.

If a VM is using a physical disk partition for its disk, as opposed to a image file, and that VM is doing lots of disk writes.

The blocked for more 120 seconds message is not necessarily an error, as the blocked operation will continue once it is no longer blocked, athough it may appear forzen while it is blocked.

barani_pdy · 04-09-2018, 06:59 AM

Quote:

Originally Posted by anotherlinuxuser

Is /dev/sdb an iSCSI, or other network based disk? It is not uncommon to use a network disk of some type with ESXi. If sdb is a network device, check the NAS/SANs device and make sure it not disconnecting for some reason, or being held up by another server that may be sharing the device, as that can cause the "blocked for more than 120 seconds" message at the ESXi server.

Other possibilities are:
A VM or other process is writing a very large file to sdb that takes more than 120 seconds, causing other processes to wait.

If a VM is using a physical disk partition for its disk, as opposed to a image file, and that VM is doing lots of disk writes.

The blocked for more 120 seconds message is not necessarily an error, as the blocked operation will continue once it is no longer blocked, athough it may appear forzen while it is blocked.

Hi,

It is not a network based disk. It is a disk plugged into the server running ESXi hypervisor, where the disk are emulated to VM as a file. Also the deployment has only one node. And like understand what made the kernel process to block for more than 120 seconds.

also anyidea on the following, can it be the cause ?
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0

The impact we are seeing is the applications are not responding, believe it is because of process being blocked for more than 120 seconds.

Regards,

barani_pdy · 04-12-2018, 07:31 AM

Hi,

Request input for the above query.

Regards,

TB0ne · 04-12-2018, 07:43 AM

Quote:

Originally Posted by barani_pdy

Hi,
It is not a network based disk. It is a disk plugged into the server running ESXi hypervisor, where the disk are emulated to VM as a file. Also the deployment has only one node. And like understand what made the kernel process to block for more than 120 seconds. also anyidea on the following, can it be the cause ?

Code:

[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
 [2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
 [2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
 [2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0

The impact we are seeing is the applications are not responding, believe it is because of process being blocked for more than 120 seconds.

...and, bumping your own thread and not providing ANY further info

Quote:

Originally Posted by barani_pdy

Request input for the above query.

We volunteer our time here; bumping your own thread and providing no additional information is fairly rude. You have been told clearly that you need to figure out where/what that sdb device is. Have you done ANY of the several troubleshooting steps given to you thus far? What were the results? What version/distro of Linux??

And since you appear to need a quick solution, you're using VMWare ESXi....which is a commercial, PAY-FOR product. So have you contacted VMWare support???

anotherlinuxuser · 04-12-2018, 09:55 PM

The problem is that 2 processes (sadc and lvm) are trying to write something to disk 'sdb'. These 2 processes are being blocked by a another process that is writing to disk, but because the process doing the blocking is not blocked itself, it does not produce warnings or errors. In other words, the process that is hogging the disk is doing its thing without any errors, so you are left having to figure out what process is hogging the disk some other way.

For example, a VM doing back up to a local VM disk image file will be writing LOTS of data as quickly as possible, so that VM will receive much more disk time than other processes that occasionally write a few bytes to disk. These other processes can end up waiting more than 120 seconds to do a write, and produce the warnings you are getting. Given enough time, and provided programs don't time out while waiting, everything will get written to disk and tasks will carry on, but of course, waiting for things for long periods of time is not optimal.

Let's breakdown what you have here. These are the most informative lines, with brief explanations.

[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
A 'task' that is trying to use disk sdb was aborted, but the task could not be aborted. A blocked process cannot be aborted until it is unblocked, so the abort failed.
The 'task' could be anything that writes anything to disk. On VM servers, most of disk writing is done by VMs, but any program or command that produces disk write will wait for a long time to complete the disk write, because not completing a committed write will cause file system issues.
The abort can come from a 'kill' command, a CTRL-C at the CLI, or another program terminating a child process, like a hypervisor that is told to force a VM to stop. Unfortunately, we cannot tell what the 'task' is.

The first blocked task's name and PID# are shown in this line.
[2441049.121000] INFO: task sadc:11031 blocked for more than 120 seconds.
Task name: sadc, PID#: 11031
This PID# can be found in the process list (see 'ps --help') but only while this instance of the process is still running. If the process becomes unblocked and finishes, or it dies for any reason, or the system is rebooted, the process will no longer exist, so you will lose the opportunity to examine the full command line.

Further down we that PID 11031 was trying to commit changes to disk
[2441049.121000] [<ffffffff8119b8ff>] log_wait_commit+0xc5/0x11b

The LVM task (PID 2528), likely for /dev/sdb, is also blocked:
[2792409.478000] INFO: task lvm:2528 blocked for more than 120 seconds.
This is not unusual, as tasks writing to a Logical Volume need to go through a LVM process.

Use the 'iotop' and 'top' commands to see what process(es) are using the disk and cpu resources, and, if possible, stop suspected blocking processes without killing the processes. Trying to force a process to die may just make things worse.

As you stop disk busy programs, use 'top' and 'ps' to see if the blocked PIDs complete and disappear from the process list.
If they do, then the last process you stopped was doing the blocking. What you do from there depends on what you find.