LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-06-2018, 07:25 AM   #1
barani_pdy
LQ Newbie
 
Registered: Jan 2018
Posts: 7

Rep: Reputation: Disabled
kernel process hungs for more than 120 seconds...


Hi,

In our of deployment(vm deployment on ESXi 5.5x hypervisor)the following are observed. Request you expertise views on this to understand it.

[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0

[2441049.121000] INFO: task sadc:11031 blocked for more than 120 seconds.
[2441049.121000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2441049.121000] sadc D ffff88073f028100 0 11031 1 0x00000004
[2441049.121000] ffff880351263de8 0000000000000086 ffff880351263db0 ffff880351263dac
[2441049.121000] 0000000000000000 00000000000007fe 000000000000d9b8 ffff88016961c360
[2441049.121000] 000000019178904d 0000000400000000 ffff8805e59d6080 ffff88016961c080
[2441049.121000] Call Trace:
[2441049.121000] [<ffffffff815b3b28>] __schedule+0x3c/0x55
[2441049.121000] [<ffffffff815b3c43>] schedule+0x10/0x22
[2441049.121000] [<ffffffff8119b8ff>] log_wait_commit+0xc5/0x11b
[2441049.121000] [<ffffffff810662ef>] ? autoremove_wake_function+0x0/0x34
[2441049.121000] [<ffffffff8119b258>] ? spin_unlock+0x9/0xb
[2441049.121000] [<ffffffff8114c57b>] ext3_sync_file+0x9f/0xd4
[2441049.121000] [<ffffffff81111c05>] vfs_fsync_range+0x7d/0xad
[2441049.121000] [<ffffffff81111c92>] vfs_fsync+0x18/0x1a
[2441049.121000] [<ffffffff81111cc3>] do_fsync+0x2f/0x44
[2441049.121000] [<ffffffff81111ce6>] sys_fdatasync+0xe/0x12
[2441049.121000] [<ffffffff8100bc42>] system_call_fastpath+0x16/0x1b

[2792409.478000] INFO: task lvm:2528 blocked for more than 120 seconds.
[2792409.478000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2792409.478000] lvm D 0000000000000008 0 2528 2525 0x00000000
[2792409.478000] ffff8805d422fc08 0000000000000082 ffff880500000000 ffff88075de4c0c0
[2792409.478000] 0000000000000000 0000000000000400 ffff880624761770 ffff88075de4c3a0
[2792409.478000] ffff8806247617b0 0000000000000001 ffff880c24d83830 ffff88075de4c0c0
[2792409.478000] Call Trace:
[2792409.478000] [<ffffffff81072a34>] ? spin_unlock+0x9/0xb
[2792409.478000] [<ffffffff815b44c7>] ? __mutex_unlock_slowpath+0x3a/0x3e
[2792409.478000] [<ffffffff815b3b28>] __schedule+0x3c/0x55
[2792409.478000] [<ffffffff815b3c43>] schedule+0x10/0x22
[2792409.478000] [<ffffffff815b3d76>] schedule_timeout+0x22/0xbb
[2792409.478000] [<ffffffff815b5393>] ? _raw_spin_unlock_irqrestore+0x32/0x3d
[2792409.478000] [<ffffffff815b512a>] ? _raw_spin_lock_irq+0x1d/0x36
[2792409.478000] [<ffffffff815b3124>] wait_for_common+0xb1/0x11f
[2792409.478000] [<ffffffff810406d6>] ? default_wake_function+0x0/0x11
[2792409.478000] [<ffffffff81051e59>] ? local_bh_enable_ip+0x55/0x60
[2792409.478000] [<ffffffff815b321c>] wait_for_completion+0x18/0x1a
[2792409.478000] [<ffffffff8110e323>] sync_inodes_sb+0x8a/0x162
[2792409.478000] [<ffffffff81111e45>] __sync_filesystem+0x4c/0x75
[2792409.478000] [<ffffffff81112007>] sync_filesystem+0x47/0x4c
[2792409.478000] [<ffffffff8111889a>] fsync_bdev+0x23/0x43
[2792409.478000] [<ffffffff81238ce8>] blkdev_ioctl+0x1ca/0x692
[2792409.478000] [<ffffffff81117ba3>] block_ioctl+0x32/0x36
[2792409.478000] [<ffffffff810ff6f6>] vfs_ioctl+0x2d/0xa1
[2792409.478000] [<ffffffff810ffc5e>] do_vfs_ioctl+0x47d/0x4c3
[2792409.478000] [<ffffffff810ffd51>] sys_ioctl+0xad/0xd2
[2792409.478000] [<ffffffff8100bc42>] system_call_fastpath+0x16/0x1b
 
Old 04-06-2018, 09:30 AM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,137

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
I/O to disk sdb blocked for over 120s. lvm and sadc are victim processes that were prevented from running as a result. You need to figure out which disk is sdb and why I/O isn't proceeding to fix the underlying problem.
 
Old 04-06-2018, 10:28 AM   #3
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,632
Blog Entries: 4

Rep: Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931
Quote:
Originally Posted by smallpond View Post
I/O to disk sdb blocked for over 120s. lvm and sadc are victim processes that were prevented from running as a result. You need to figure out which disk is sdb and why I/O isn't proceeding to fix the underlying problem.
You probably have a serious disk hardware or controller issue on your hands!

(But, and I am quite serious ... "check the obvious things first." Press on every cable to be sure it's firmly seated. Check the power cords, UPS boxes (are they on?), fuses. Press down on every controller card to see if it's firmly seated. And so on.)

You should be able to run s.m.a.r.t. internal-diagnostics on the drive – if it's still running – to see what it reports about its own internal status. (These facilities are built-in to the drive hardware and can also therefore be used with Windows, AIX, and so on.)

Last edited by sundialsvcs; 04-06-2018 at 10:31 AM.
 
Old 04-07-2018, 11:47 PM   #4
anotherlinuxuser
Member
 
Registered: Jan 2007
Location: Alberta Canada
Distribution: Fedora/Redhat/CentOS
Posts: 70

Rep: Reputation: 19
Is /dev/sdb an iSCSI, or other network based disk? It is not uncommon to use a network disk of some type with ESXi. If sdb is a network device, check the NAS/SANs device and make sure it not disconnecting for some reason, or being held up by another server that may be sharing the device, as that can cause the "blocked for more than 120 seconds" message at the ESXi server.

Other possibilities are:
A VM or other process is writing a very large file to sdb that takes more than 120 seconds, causing other processes to wait.

If a VM is using a physical disk partition for its disk, as opposed to a image file, and that VM is doing lots of disk writes.

The blocked for more 120 seconds message is not necessarily an error, as the blocked operation will continue once it is no longer blocked, athough it may appear forzen while it is blocked.

Last edited by anotherlinuxuser; 04-08-2018 at 12:03 AM.
 
Old 04-09-2018, 06:59 AM   #5
barani_pdy
LQ Newbie
 
Registered: Jan 2018
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by anotherlinuxuser View Post
Is /dev/sdb an iSCSI, or other network based disk? It is not uncommon to use a network disk of some type with ESXi. If sdb is a network device, check the NAS/SANs device and make sure it not disconnecting for some reason, or being held up by another server that may be sharing the device, as that can cause the "blocked for more than 120 seconds" message at the ESXi server.

Other possibilities are:
A VM or other process is writing a very large file to sdb that takes more than 120 seconds, causing other processes to wait.

If a VM is using a physical disk partition for its disk, as opposed to a image file, and that VM is doing lots of disk writes.

The blocked for more 120 seconds message is not necessarily an error, as the blocked operation will continue once it is no longer blocked, athough it may appear forzen while it is blocked.
Hi,

It is not a network based disk. It is a disk plugged into the server running ESXi hypervisor, where the disk are emulated to VM as a file. Also the deployment has only one node. And like understand what made the kernel process to block for more than 120 seconds.

also anyidea on the following, can it be the cause ?
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0


The impact we are seeing is the applications are not responding, believe it is because of process being blocked for more than 120 seconds.

Regards,
 
Old 04-12-2018, 07:31 AM   #6
barani_pdy
LQ Newbie
 
Registered: Jan 2018
Posts: 7

Original Poster
Rep: Reputation: Disabled
Hi,

Request input for the above query.

Regards,
 
Old 04-12-2018, 07:43 AM   #7
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,604

Rep: Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960Reputation: 7960
Quote:
Originally Posted by barani_pdy View Post
Hi,
It is not a network based disk. It is a disk plugged into the server running ESXi hypervisor, where the disk are emulated to VM as a file. Also the deployment has only one node. And like understand what made the kernel process to block for more than 120 seconds. also anyidea on the following, can it be the cause ?
Code:
[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
 [2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
 [2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff88075fc79cc0
 [2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff88075fc79cc0
The impact we are seeing is the applications are not responding, believe it is because of process being blocked for more than 120 seconds.
...and, bumping your own thread and not providing ANY further info
Quote:
Originally Posted by barani_pdy
Request input for the above query.
We volunteer our time here; bumping your own thread and providing no additional information is fairly rude. You have been told clearly that you need to figure out where/what that sdb device is. Have you done ANY of the several troubleshooting steps given to you thus far? What were the results? What version/distro of Linux??

And since you appear to need a quick solution, you're using VMWare ESXi....which is a commercial, PAY-FOR product. So have you contacted VMWare support???
 
Old 04-12-2018, 09:55 PM   #8
anotherlinuxuser
Member
 
Registered: Jan 2007
Location: Alberta Canada
Distribution: Fedora/Redhat/CentOS
Posts: 70

Rep: Reputation: 19
The problem is that 2 processes (sadc and lvm) are trying to write something to disk 'sdb'. These 2 processes are being blocked by a another process that is writing to disk, but because the process doing the blocking is not blocked itself, it does not produce warnings or errors. In other words, the process that is hogging the disk is doing its thing without any errors, so you are left having to figure out what process is hogging the disk some other way.

For example, a VM doing back up to a local VM disk image file will be writing LOTS of data as quickly as possible, so that VM will receive much more disk time than other processes that occasionally write a few bytes to disk. These other processes can end up waiting more than 120 seconds to do a write, and produce the warnings you are getting. Given enough time, and provided programs don't time out while waiting, everything will get written to disk and tasks will carry on, but of course, waiting for things for long periods of time is not optimal.

Let's breakdown what you have here. These are the most informative lines, with brief explanations.

[2441143.217000] sd 2:0:1:0: [sdb] task abort on host 2, ffff880c01e0c3c0
[2441143.217000] sd 2:0:1:0: [sdb] Failed to abort cmd ffff880c01e0c3c0
A 'task' that is trying to use disk sdb was aborted, but the task could not be aborted. A blocked process cannot be aborted until it is unblocked, so the abort failed.
The 'task' could be anything that writes anything to disk. On VM servers, most of disk writing is done by VMs, but any program or command that produces disk write will wait for a long time to complete the disk write, because not completing a committed write will cause file system issues.
The abort can come from a 'kill' command, a CTRL-C at the CLI, or another program terminating a child process, like a hypervisor that is told to force a VM to stop. Unfortunately, we cannot tell what the 'task' is.

The first blocked task's name and PID# are shown in this line.
[2441049.121000] INFO: task sadc:11031 blocked for more than 120 seconds.
Task name: sadc, PID#: 11031
This PID# can be found in the process list (see 'ps --help') but only while this instance of the process is still running. If the process becomes unblocked and finishes, or it dies for any reason, or the system is rebooted, the process will no longer exist, so you will lose the opportunity to examine the full command line.

Further down we that PID 11031 was trying to commit changes to disk
[2441049.121000] [<ffffffff8119b8ff>] log_wait_commit+0xc5/0x11b

The LVM task (PID 2528), likely for /dev/sdb, is also blocked:
[2792409.478000] INFO: task lvm:2528 blocked for more than 120 seconds.
This is not unusual, as tasks writing to a Logical Volume need to go through a LVM process.


Use the 'iotop' and 'top' commands to see what process(es) are using the disk and cpu resources, and, if possible, stop suspected blocking processes without killing the processes. Trying to force a process to die may just make things worse.

As you stop disk busy programs, use 'top' and 'ps' to see if the blocked PIDs complete and disappear from the process list.
If they do, then the last process you stopped was doing the blocking. What you do from there depends on what you find.

Last edited by anotherlinuxuser; 04-12-2018 at 10:10 PM. Reason: Spelling corrections
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Rpi3 - kworker blocked for more than 120 seconds mralk3 Slackware - ARM 3 06-23-2018 01:05 PM
Task blocked for more than 120 seconds errors and crashes Red Squirrel Linux - Server 17 04-01-2014 01:12 PM
kernel: INFO: task gzip:28803 blocked for more than 120 seconds nallanisai Red Hat 3 01-16-2013 06:48 AM
RHEL VM - blocked for more than 120 seconds rajaniyer123 Linux - Virtualization and Cloud 0 07-13-2012 09:35 AM
INFO: <application>blocked for more than 120 seconds alok.rhct Linux - Hardware 1 04-08-2009 01:35 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:47 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration