Task blocked for more than 120 seconds errors and crashes
I see tons of these in my dmesg, and it's causing my virtualbox vms to crash with disk IO errors. It's only affecting the Linux guests. What would cause this?
Code:
INFO: task tar:1865 blocked for more than 120 seconds. This started randomly, I have not changed anything. Here's the output of uname -a if it helps: Linux borg.loc 2.6.27.25-78.2.56.fc9.x86_64 #1 SMP Thu Jun 18 12:24:37 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux I hope I can find a solution to this, it's my main home server, I can't afford for this to go down. Also not finding much on google other than it's not a distro specific issue as people with newer versions of Fedora, Ubuntu etc have this issue as well. More info that may be useful: OS drive is a single drive (can't recall brand) Data drive is a MD raid 5 with 5 1TB Hitachi drives. When I recently added 2 new drives, I went through many RMAs until I finally got two drives that were not failing out of the box. Could these errors be disk related? I do have one drive that has lot CRC smart errors but it was like that from the start. I had given up with all the RMAs and said screw it. Should I replace this drive? |
If you are getting accumulating CRC errors, you should replace the drive. Also, make sure you have TLER (time limited error recovery) enabled on the drives:
Code:
smartctl -l scterc /dev/sda Code:
smartctl -l scterc,70,70 /dev/sda |
I just get invalid argument when I try that command.
Code:
[root@borg ~]# smartctl -l scterc /dev/sdb Another thing I get a lot is these errors: CIFS VFS: Unexpected lookup error -112 Nothing at all on Google for that. |
smartctl 5.41 supports the option. The CIFS message is related to a timeout accessing data on an SMB/CIFS network share.
|
Also, should I do a fsck on the file system? Could it maybe be a file system issue? It seems to be happening based on how much activity there is. With no VMs running I went a whole day, with VMs running, it started doing it within maybe 5 hours.
|
It happens when trying to write to a physically bad sector. A spare has to be allocated and remapped by the drive.
|
hmm so that would probably cause the raid to block for a bit which would then cause that right?
Now here's a weird one: Code:
EXT3-fs error (device sdh1): ext3_find_entry: reading directory #2 offset 0 As for the bad sectors, is there a way I can do a check? Since the drives are in raid, I can't do fsck on them individually. (just says invalid file system) I want to confirm which drive is failing so I know which one to replace. I wish when drives fail they would just drop out, end of story, this crap is what makes it so irritating. |
Did you unplug a usb or similar device?
|
I do have an esata drive I use for backup, but it would have been unplugged the night before as the backup completed. Would it be normal for the error to be delayed? I've also never seen big errors like that over a drive being unplugged.
|
I ran a full fsck on the md device yesterday before bed as well, forgot to mention that. It found some errors which it corrected (I forget what, something about blocks incorrectly reported or something). Even with that error I posted, the VMs did not crap out. So MAYBE the fsck solved this issue, but still early to tell. I started up ALL my VMs (8 of them) so I'll see how it goes tonight when the backup jobs run. So far nothing funny in dmesg.
I will also be replacing that drive with the CRC errors regardless. Just ordered a 1TB WD black so it should be here by end of next week hopefully. I'm not buying Hitachi anymore, I've had too much bad luck with the 2010-2011 drives. |
Just a FYI - this may not be a disk issue. I'm trying to resolve the same issue on Debian and there are a number of reports showing this as an issue with the CFS CPU scheduler. And the majority of them seem to involve virtualization of some sort.
If I find a solution I'll post back. |
I see these error messages pop up everywhere, on my server at home, on customers systems, across distros. And everytime the system becomes extremely slow and soon goes beyond being usable and you cannot even do a shut down anymore. Load AVG climbs up to obscene heights, stuff like 20 is not unknown.
What is causing this? The tasks that get blocked are totally random, it could be anything. Where is this error message from the kernel documented? How to avoid it? I also thought first that it was disk related, but this does not make sense, as the systems in question have all different disk and arrays, some raid, some single disks etc.. And sometimes even the kernel tasks themselves get blocked. |
Hi,
anything new on the topic? I am also experiencing these issues with random tasks (see the quote). At first, I also assumed a disk error and even installed a new disk. Since my system contains only RAIDs, I then struggled to find out, which RAID element to replace since all disks reported healthy status after long smartctl checks. I did an "apt-get update" as I hoped this problem could have been solved by a patch. But it no more 36h later my system experienced the same error. Happy for any updates. Best regards, Linuxnomo ------ For information: here following you see an extract of dmesg and then ps -eaf. As I had only one remote ssh running on the system when the system crashed, these two (dmesg, ps -eaf) where the only commands that I could run. the ^C I issued after the ps -eaf stopped working. I was not able to revitalize any connection to the system thereafter. Quote:
|
Hello,
I have similar problem with my xen guest kernel version: 2.6.32-5-686-bigmem linux debian 6.0 many entries Quote:
|
I have the exact same issue. The load average hit 3000+ . I was wondering if anyone had nfs shares mounted? I've narrowed it down to either being a problem with the drives or a nfs issue although it could be something else.
This link seems promising as it explains everything rather nicely, however I still believe the problem is related to nfs blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/ |
Quote:
the log shows stuff like below Jul 10 20:00:39 oravms kernel: INFO: task nfsd:2415 blocked for more than 120 se conds. Jul 10 20:00:39 oravms kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs " disables this message. Jul 10 20:00:39 oravms kernel: nfsd D 0000000000000000 0 2415 2 0x00000000 If anybody knows a solution, please! regards, raj |
|
Interesting that this has just cropped up for us.
We are running a VMWare server with about 6 VMs, each running Ubuntu 12.04. We've had no problem until last week when I did a restart on two of the systems. For whatever reason, logging in, the systems told us a restart was required. Ever since then, we have these Java errors followed by a call trace: INFO: task java:1120 blocked for more than 120 seconds. Apr 1 10:21:56 osbuild4 kernel: [ 2156.765270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 1 10:21:56 osbuild4 kernel: [ 2156.765393] java D c1010847 0 1120 1110 0x00000000 Apr 1 10:21:56 osbuild4 kernel: [ 2156.765394] ee1b5e24 00000086 ee1b5dd0 c1010847 0000000b f7508ca0 c1930e00 c1930e00 Apr 1 10:21:56 osbuild4 kernel: [ 2156.765398] f2c40326 000001d1 f7987e00 f6804bc0 f6b53f20 00000000 ee1b5e5c c15a5d69 Apr 1 10:21:56 osbuild4 kernel: [ 2156.765401] ffffffec 00000000 ee1b5dfc c15a7f3d 67776bb0 00000000 f6804bc0 ed2c2400 |
All times are GMT -5. The time now is 05:28 PM. |