LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Virtualization and Cloud (https://www.linuxquestions.org/questions/linux-virtualization-and-cloud-90/)
-   -   QEMU-KVM locking on hard disk access (https://www.linuxquestions.org/questions/linux-virtualization-and-cloud-90/qemu-kvm-locking-on-hard-disk-access-4175454772/)

midorikawa 03-19-2013 09:46 PM

QEMU-KVM locking on hard disk access
 
I have a Gentoo KVM host running 2 Gentoo guests. They've been working for 2 years without issue. Recently, one guest started locking without any error message, kernel panic, or other logged failure when I started mysql. Shortly after, it started locking up if I did any drive access to it. When I loaded a backup into the other guest and ran mysql on it, it started locking too. Unable to produce any software fault log, and unable to produce this behavior in the host, I tried swapping the motherboard out of desperation, thinking that something was corrupting the disk. I should have known that wasn't the cause, but oh well.

Both virtual machines lock at this line in the boot process:

Code:

[    1.135647] EXT3-fs (sda1): error: couldn't mount because of unsupported optional features (240)
[    1.144455] EXT4-fs (sda1): couldn't mount as ext2 due to feature incompatibilities
[    1.160868] EXT4-fs (sda1): INFO: recovery required on readonly filesystem
[    1.163865] EXT4-fs (sda1): write access will be enabled during recovery

Booting either VM to a livecd and trying to mount, fsck, or otherwise do anything with the drives causes the VM to lock as well. I tried blowing away / on one VM and doing a full reinstall with no luck. Immediately after, it resumed locking on boot, even with a clean filesystem.

At first, I thought it was power related, and decided to slam the host with as much CPU use as possible, with no failures. I also ran memtest86 with no failures reported. I've run emerge -e world on the host in the hopes something got broken, as well as revdep-rebuild.

I've been running these VMs with the following manual kvm lines:

Code:

/usr/bin/qemu-kvm -cpu host -append 'root=/dev/sda1' -net nic,macaddr=b2:29:7a:b9:2a:c1 \
-kernel /boot/kvm/vmlinuz -smp 2 -net tap,ifname=tap0,script=no,downscript=no \
-net nic,macaddr=b2:29:7a:b9:2a:c3 -net tap,ifname=tap4,script=no,downscript=no \
-hda /web/web.img -hdb /dev/md3 -m 2048 -daemonize

Code:

/usr/bin/qemu-kvm -cpu host -append 'root=/dev/sda1' \
-net nic,macaddr=52:54:00:12:34:57 -kernel /boot/mysql/vmlinuz \
-net tap,ifname=tap1,script=no,downscript=no -hda /web/mysql.img \
-m 512 -daemonize

Then tried importing them into libvirtd, with the same result.

Thanks for your help in advance.

dyasny 03-20-2013 12:47 AM

have you considered moving to virtio?

midorikawa 03-20-2013 01:37 AM

Quote:

Originally Posted by dyasny (Post 4914928)
have you considered moving to virtio?

No, I haven't. I'll have to have a look tomorrow and report back when I can implement and test.

jefro 03-21-2013 02:36 PM

Wonder if moving the vm's to a clean new installed host would show any improvement. At first it would seem to be an issue with clients but if both fail it leads me to consider host or host hardware. I'd bet host hardware at this point.

midorikawa 03-21-2013 04:00 PM

Quote:

Originally Posted by jefro (Post 4916151)
Wonder if moving the vm's to a clean new installed host would show any improvement. At first it would seem to be an issue with clients but if both fail it leads me to consider host or host hardware. I'd bet host hardware at this point.

What's strange about this, is that the host has no issues whatsoever. The host is on 2 320GB SATA drives in md RAID1, and the RAID shows as clean. The other thing that's strange about this is that it wasn't both at first. At first it was just one, and only moved to the other when I started up MySQL. Shutting down all VMs and pushing resource consumption up higher than with MySQL running in either guest does nothing.

I'd really rather not reinstall unless I was sure this is the cause.

I also just tried using virtio, and booting from an ISO, and still no luck. It locked up here:

Code:

livecd ~ # mount /dev/vda1 /mnt/gentoo
and here:

Code:

livecd ~ # fsck.ext4 -fy /dev/vda1                                                   
e2fsck 1.42 (29-Nov-2011)                                                           
/: recovering journal

It doesn't matter if I virtio the iso or not, the result is the same.

I just straced with -ff the virtualized process while it's locked, and got the following output:

Code:

[pid 13453] rt_sigaction(SIGALRM, NULL, {0x7ff23af35dc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7ff238876460}, 8) = 0
[pid 13453] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(3, 0x7fff5ce802b0, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29541303}}, NULL) = 0
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 29433214}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29067851}}, NULL) = 0
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [4])
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(4, "\1\0\0\0\0\0\0\0", 512) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [3])
[pid 13453] read(5, 0x7fff5ce80340, 16) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] read(3, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
[pid 13453] rt_sigaction(SIGALRM, NULL, {0x7ff23af35dc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7ff238876460}, 8) = 0
[pid 13453] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(3, 0x7fff5ce802b0, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29563778}}, NULL) = 0
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 29465375}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29115804}}, NULL) = 0
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [4])
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(4, "\1\0\0\0\0\0\0\0", 512) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [3])
[pid 13453] read(5, 0x7fff5ce80340, 16) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] read(3, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
[pid 13453] rt_sigaction(SIGALRM, NULL, {0x7ff23af35dc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7ff238876460}, 8) = 0
[pid 13453] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(3, 0x7fff5ce802b0, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29581552}}, NULL) = 0
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 29243760}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29098059}}, NULL) = 0
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [4])
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(4, "\1\0\0\0\0\0\0\0", 512) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [3])
[pid 13453] read(5, 0x7fff5ce80340, 16) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] read(3, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
[pid 13453] rt_sigaction(SIGALRM, NULL, {0x7ff23af35dc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7ff238876460}, 8) = 0
[pid 13453] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(3, 0x7fff5ce802b0, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29278736}}, NULL) = 0
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 29188725}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 28789433}}, NULL) = 0
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [4])
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(4, "\1\0\0\0\0\0\0\0", 512) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [3])
[pid 13453] read(5, 0x7fff5ce80340, 16) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] read(3, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
[pid 13453] rt_sigaction(SIGALRM, NULL, {0x7ff23af35dc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7ff238876460}, 8) = 0
[pid 13453] write(4, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(3, 0x7fff5ce802b0, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13453] select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29696573}}, NULL) = 0
[pid 13453] timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 29380885}}) = 0
[pid 13453] timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 29029681}}, NULL) = 0
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [4])
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] read(4, "\1\0\0\0\0\0\0\0", 512) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL) = 1 (in [5])
[pid 13453] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 13453] select(11, [3 4 5 8 9 10], [], [], NULL^CProcess 13453 detached

While I'm at it, lsof dumps the following:

Code:

COMMAND    PID USER  FD  TYPE DEVICE    SIZE/OFF    NODE NAME
qemu-syst 13453 root  cwd    DIR  9,127        4096      2 /
qemu-syst 13453 root  rtd    DIR  9,127        4096      2 /
qemu-syst 13453 root  txt    REG  9,127      4920872 2419871 /usr/bin/qemu-system-x86_64
qemu-syst 13453 root  mem    REG  9,127        14592 2326682 /lib64/libdl-2.15.so
qemu-syst 13453 root  mem    REG  9,127      1909952 1350217 /usr/lib64/libcrypto.so.1.0.0
qemu-syst 13453 root  mem    REG  9,127      431760 1350220 /usr/lib64/libssl.so.1.0.0
qemu-syst 13453 root  mem    REG  9,127      1732824 2326238 /lib64/libc-2.15.so
qemu-syst 13453 root  mem    REG  9,127      135074 2312695 /lib64/libpthread-2.15.so
qemu-syst 13453 root  mem    REG  9,127        88440 1353096 /lib64/libz.so.1.2.7
qemu-syst 13453 root  mem    REG  9,127      1009848 2325424 /lib64/libm-2.15.so
qemu-syst 13453 root  mem    REG  9,127      588432 1340465 /usr/lib64/libpixman-1.so.0.28.0
qemu-syst 13453 root  mem    REG  9,127        5280 1369378 /lib64/libaio.so.1.0.1
qemu-syst 13453 root  mem    REG  9,127        71904 2087180 /usr/lib64/libseccomp.so.1.0.1
qemu-syst 13453 root  mem    REG  9,127      264704 1361335 /usr/lib64/libjpeg.so.8.0.2
qemu-syst 13453 root  mem    REG  9,127      174928 2268981 /usr/lib64/libpng15.so.15.13.0
qemu-syst 13453 root  mem    REG  9,127        18728 1375234 /lib64/libuuid.so.1.3.0
qemu-syst 13453 root  mem    REG  9,127      921272 1358418 /usr/lib64/libasound.so.2.0.0
qemu-syst 13453 root  mem    REG  9,127      345968 1350297 /lib64/libncurses.so.5.9
qemu-syst 13453 root  mem    REG  9,127      360456 1393330 /usr/lib64/libcurl.so.4.3.0
qemu-syst 13453 root  mem    REG  9,127        10456 2326765 /lib64/libutil-2.15.so
qemu-syst 13453 root  mem    REG  9,127      1192888 1352994 /usr/lib64/libglib-2.0.so.0.3200.4
qemu-syst 13453 root  mem    REG  9,127        35656 2326731 /lib64/librt-2.15.so
qemu-syst 13453 root  mem    REG  9,127      144816 2326771 /lib64/ld-2.15.so
qemu-syst 13453 root  mem    REG    0,9                3826 anon_inode:kvm-vcpu (stat: No such file or directory)
qemu-syst 13453 root    0u  CHR  136,4          0t0      7 /dev/pts/4
qemu-syst 13453 root    1u  CHR  136,4          0t0      7 /dev/pts/4
qemu-syst 13453 root    2u  CHR  136,4          0t0      7 /dev/pts/4
qemu-syst 13453 root    3u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root    4u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root    5u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root    6u  CHR 10,232          0t0    1256 /dev/kvm
qemu-syst 13453 root    7u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root    8u  CHR 10,200          0t0      45 /dev/net/tun
qemu-syst 13453 root    9u  CHR 10,200          0t0      45 /dev/net/tun
qemu-syst 13453 root  10u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root  11r  REG  9,127    179519488  791369 /root/install-amd64-minimal-20130110.iso
qemu-syst 13453 root  12u  REG    9,2 137679273984      12 /web/web.img
qemu-syst 13453 root  13u  BLK    9,3  0x4af000000    4638 /dev/md3
qemu-syst 13453 root  14u  0000    0,9            0    3826 anon_inode
qemu-syst 13453 root  15u  0000    0,9            0    3826 anon_inode

I do see that there's a "no such file or directory" on anon_inode:kvm-vcpu, but I'm entirely unsure what this is or whether or not it's the cause of my problems, and google doesn't show much useful info. At least, not that I've found.

jefro 03-22-2013 02:43 PM

Run Memtest on the system for a day or so.

midorikawa 03-23-2013 02:06 AM

Quote:

Originally Posted by jefro (Post 4916695)
Run Memtest on the system for a day or so.

I'd run memtest on it for several hours with no errors detected previously. Regardless, I've used more RAM than the guests use with the host OS with no ill effects.

jefro 03-25-2013 03:21 PM

"RAM than the guests use with the host OS with no ill effects."

Do you mean you assigned more ram than is available?

midorikawa 03-25-2013 04:41 PM

Quote:

Originally Posted by jefro (Post 4918632)
"RAM than the guests use with the host OS with no ill effects."

Do you mean you assigned more ram than is available?

My apologies. I meant the guests use less RAM than I used to test with on the host. The guests use about 2.5GB of RAM out of 8GB total. I used enough to swap the box out without issue on the host. Plus, memtest86 completed without error.

jefro 03-25-2013 09:15 PM

OK, well, not sure which way to go.

I feel it is the hardware. Without any evidence to say what is the cause then you need to take more aggressive tests.

"I also just tried using virtio, and booting from an ISO, and still no luck. It locked up here:" In a very real sense I guess it could be components related to VM.

midorikawa 03-25-2013 09:21 PM

Quote:

Originally Posted by jefro (Post 4918851)
OK, well, not sure which way to go.

I feel it is the hardware. Without any evidence to say what is the cause then you need to take more aggressive tests.

"I also just tried using virtio, and booting from an ISO, and still no luck. It locked up here:" In a very real sense I guess it could be components related to VM.

Is there anything that could cause RAM to pass memtest, but still actually fail? Or do you think it could be CPU? I've replaced the mobo already, so it's for sure not with the SATA controller. What more aggressive tests could I run? I don't have the cash to just replace everything.

jefro 03-26-2013 02:25 PM

Ram and any associated device could fail in seconds or days. A long time ago companies used to use hot/cold chambers and run diags on computers to try to weed out failures. Causes for failures are poor connections as in cold solder joints or any connector. Maybe the most common is PN junctions in components. Damage from time, heat and esd as well as poor production could cause any of the few million gates to fail.

We don't really have any sort of way to do full diags under hot and cold conditions. When one runs memtest it may or may not use all the components in your system. It also may need to run for days to see if any failure happens. There are a few memtests out there and may or may not be best way to test.


Might be worth it to just try to reseat all components and try live again.

I assume you have server ram such as ecc. Might have to disable that and run memtest again.

midorikawa 03-26-2013 02:40 PM

Quote:

Originally Posted by jefro (Post 4919379)
I assume you have server ram such as ecc. Might have to disable that and run memtest again.

Sadly, this server was built on a budget, and uses desktop components. I try not to skimp on things like RAM, CPU, etc, and instead skimp on areas like the case. I was afraid you'd say RAM, as that's the only thing that makes sense to me still, too. I get paid in a few days, so I'll probably be purchasing replacements then. Thanks for your help. :-)

jefro 03-26-2013 07:28 PM

I'd at least install a new clean OS to test this again on it.

Don't throw money at it. Any part could be the cause. If you have some old parts consider swapping like swapping old psu out or such.

If you want to throw money at it then get a full new system.


All times are GMT -5. The time now is 12:20 AM.