Disk utilization 100%
Having an odd issue with SAN storage.
We use an EMC Clariion CX4 array to talk to an HP server box through FC connects.
Using ESXi 6.0 on the HP box and a VM running on that box with raw-mapped luns.
The host runs for several days, then it essentially stops with the disk utilization going to 100.
avg-cpu: user nice %system %iowait %steal %idle
0.00 0.00 0.01 3.12 0.00 96.87
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 129.00 0.00 0.00 100.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 131.03 0.00 0.00 100.02
It doesn't recovery (have waited up to 2 days) on multiple occasions, and the only way to clear it is to reboot the VM (CentOS 6.6). The OS remains responsive, but disk related commands all hang, such as "sync", "pvs", etc... and cannot be killed, even with a kill -9 as root.
The only thing the logs show is multiple blocks similar to this:
Dec 31 00:50:05 lx1 kernel: INFO: task events/27:542 blocked for more than 120 seconds.
Dec 31 00:50:05 lx1 kernel: Not tainted 2.6.32-573.8.1.el6.x86_64 #1
Dec 31 00:50:05 lx1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 31 00:50:05 lx1 kernel: events/27 D 000000000000001b 0 542 2 0x00000000
Dec 31 00:50:05 lx1 kernel: ffff880fe8caba50 0000000000000046 ffff880fe8caba18 ffff880fe8caba14
Dec 31 00:50:05 lx1 kernel: ffff880fe4d60148 ffff880fffe84600 0000d46b9a707ae8 ffff881037a159c0
Dec 31 00:50:05 lx1 kernel: 0000000000000400 000000010de949d3 ffff880fe8c9fad8 ffff880fe8cabfd8
Dec 31 00:50:05 lx1 kernel: Call Trace:
Dec 31 00:50:05 lx1 kernel: [<ffffffff81538d43>] io_schedule+0x73/0xc0
Dec 31 00:50:05 lx1 kernel: [<ffffffff81276038>] get_request_wait+0x108/0x1d0
Dec 31 00:50:05 lx1 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
Dec 31 00:50:05 lx1 kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
Dec 31 00:50:05 lx1 kernel: [<ffffffff81276756>] blk_get_request+0x46/0xa0
Dec 31 00:50:05 lx1 kernel: [<ffffffff813943a8>] scsi_execute+0x48/0x180
Dec 31 00:50:05 lx1 kernel: [<ffffffffa004db4a>] spi_execute+0xaa/0x130 [scsi_transport_spi]
Dec 31 00:50:05 lx1 kernel: [<ffffffff81537f5a>] ? printk+0x41/0x47
Dec 31 00:50:05 lx1 kernel: [<ffffffffa004df1f>] spi_dv_device_compare_inquiry+0x7f/0x120 [scsi_transport_spi]
Dec 31 00:50:05 lx1 kernel: [<ffffffffa004e12e>] spi_dv_device+0x16e/0x7b0 [scsi_transport_spi]
Dec 31 00:50:05 lx1 kernel: [<ffffffff8129232a>] ? kobject_get+0x1a/0x30
Dec 31 00:50:05 lx1 kernel: [<ffffffffa0086fc4>] mptspi_dv_device+0xb4/0x1b0 [mptspi]
Dec 31 00:50:05 lx1 kernel: [<ffffffffa00871ab>] mptspi_dv_renegotiate_work+0xeb/0x120 [mptspi]
Dec 31 00:50:05 lx1 kernel: [<ffffffffa00870c0>] ? mptspi_dv_renegotiate_work+0x0/0x120 [mptspi]
Dec 31 00:50:05 lx1 kernel: [<ffffffff8109a780>] worker_thread+0x170/0x2a0
Dec 31 00:50:05 lx1 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
Dec 31 00:50:05 lx1 kernel: [<ffffffff8109a610>] ? worker_thread+0x0/0x2a0
Dec 31 00:50:05 lx1 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
Dec 31 00:50:05 lx1 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Dec 31 00:50:05 lx1 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
Dec 31 00:50:05 lx1 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Last edited by usao; 12-31-2015 at 08:51 AM.
|