KVM virtual disk cache mode safety concerns.

arcadiosincero · 04-28-2017, 10:02 AM

I have setup a small Linux KVM-based VM server farm. I have 4 machines that host the VMs. The VM hard drive image files are accessed by the VM host machines via NFS. The NFS server is a two-node high availability cluster based on heartbeat and DRBD, with each node having 8GB of memory. All network connectivity is gigabit ethernet. All machines are running Debian Linux 8. Also, all machines are plugged into UPS.

Everything seems to work well enough. Live migration of VMs among the 4 VM host machines works great. Testing fail-over of the NFS server also works with only a slight momentary pause being seen in the VMs while it is happening.

I am a little disappointed by the disk I/O performance I am seeing in the VMs, though. Since I don't have a basis for comparison, the performance I am seeing could very well be normal for my kind of setup. I am using qcow2 files for the hard drive image files. They are also configured for VirtIO. The cache mode I am using is "none".

One of the first things I tried playing around with are the NFS mount options. The NFS mount options I am using are: rsize=32768, wsize=32768, and timeo=60. I found that using "32768" for "rsize/wsize" actually gave me better throughput than using the default of "1048576", which I thought was odd. I used a simple dd read/write test to find this out though. Also, I am using a "timeo" value of "60" so that the NFS client will realize sooner that a fail-over has occurred at the NFS server.

The next thing I tried fiddling with was the cache mode. The reason why I used "none" in the first place was because KVM complained that it is not safe to do a live migration of the VM unless the cache mode was set to none. I tried using the KVM default of "writethrough" and saw that write performance was way worse than "none".

I then tried "writeback" and noticed that the performance was a lot better than both "writethrough" and "none". It seemed to me that "writeback" mode should be the way to go. However, after doing some Googling I read that "writeback" isn't safe and shouldn't be used for production servers.

If I'm understanding correctly, when a block is written to disk from the guest, the block is first sent to the host's disk cache. In "writeback" mode, the host will then "lie" to the guest and say that the block was successfully written to disk immediately even if the block is still sitting in the host's disk cache. I guess, this is why they say "writeback" mode is not safe because if the host crashes before the cache is flushed, then that data is lost.

I'm wondering just how dangerous "writeback" mode is for production servers with my particular setup. Everything is on UPS, so the most likely reason for the host to crash would be hardware failure which is probably the least likely thing to happen. Even if hardware failure is imminent, I would probably have ample opportunity to do a proper VM shutdown before it happens.

Also I am wondering if there is much difference between "none" and "writeback" with my setup since I am using NFS. I mean, obviously there is a difference since "writeback" yielded better throughput numbers, but I am wondering more with respect to safety. According to this diagram:

http://www.ilsistemista.net/index.ph...2.html?start=2

in "writeback" mode, blocks get sent to the host's cache before being sent to the physical disk cache while in "none" mode, blocks get sent directly to the physical disk cache. However, in my case there is no physical disk cache; only an NFS client cache. And since I am using the default NFS mount option of "async", the blocks will sit in the NFS client cache until it is filled or until it is told to flush the cache via an invocation of sync or something. Am I correct in concluding that there isn't much difference with respect to data safety between "none" and "writeback" with this setup?

Sorry for the long winded post. To summarize about what I am asking:

1. Is "writeback" cache mode really that bad with my setup?
2. Am I right in that there isn't a difference with respect to data safety between "writeback" and "none" cache mode with my setup?

Thanks!

syg00 · 04-30-2017, 03:21 AM

1. Probably not
2. Probably.

My question to you is "why qcow2 ?". You have enough software layers getting in the way of your I/O - I would have thought raw was a must.

Slax-Dude · 05-04-2017, 09:23 AM

Your bottleneck is the network, not disk I/O.
You using qcow2, raw or even a native HD partition on the cluster is irrelevant, if all disk I/O is pumped to the VM hosts via gigabit ethernet.

For comparison purposes, make a qcow2 image on the VM host itself and build a VM with it.

arcadiosincero · 05-04-2017, 09:49 AM

Quote:

Originally Posted by Slax-Dude

Your bottleneck is the network, not disk I/O.
You using qcow2, raw or even a native HD partition on the cluster is irrelevant, if all disk I/O is pumped to the VM hosts via gigabit ethernet.

For comparison purposes, make a qcow2 image on the VM host itself and build a VM with it.

Yep, I realize the network is the primary bottleneck here. I guess I should've used a better thread title. When I started making the post, my brain was still thinking "we're having performance issues with KVM virtual disks" not realizing that I already have a solution to the issue. The real question was whether or not the solution I have come up with (using writeback cache mode versus none) was safe to use in my particular setup. More specifically, how confident can I be that I won't have data corruption under ordinary circumstances.

What I mean by "ordinary" would be, for example, a power outage that happens to take out all the UPSes too. Extraordinary would be like an earthquake that took down the entire building the VM farm lives in. In that case, there's really nothing I can reasonably do to prevent data corruption. (I do have an off-site backup system in place so in the off-chance that event does occur, I'm ready).

I have since edited the thread title to reflect the true intent of the thread.

arcadiosincero · 05-04-2017, 10:14 AM

Quote:

Originally Posted by syg00

1. Probably not
2. Probably.

My question to you is "why qcow2 ?". You have enough software layers getting in the way of your I/O - I would have thought raw was a must.

This might sound like a silly reason, but I want people to be able to provision a new virtual machine with as much point-and-click convenience as possible. You see, I won't be the only one making VMs on the VM farm. We're a smallish group of software developers, with me having been designated as the resident IT Guy. qcow2 is the default when provisioning a VM via virt-manager. I could just say "oh BTW, make sure you select 'raw'", but chances are nobody is going to remember that. It'll just be click, click, click all the way through.

Also, I have encountered a rather strange bug when trying to provision a new VM with a raw disk through virt-manager 1.4.0. As I've mentioned in my first post, the VM disk images live on a NFS server. When virt-manager tells libvirtd to go ahead and allocate the raw disk on the NFS share, the NFS client goes "haywire". What I mean by that, performance drops to a stand still. Doing an "ls -l" from a command prompt of the NFS share causes it to just sit there and do nothing. I see no messages on either the NFS server or the box the NFS client is on. It's the oddest thing and I can't even begin to speculate how this is happening. If I allocate the raw disk using qemu-img from the command line, it all works fine. But like I said, I want new VM provisioning to be as easy as possible, so allocating a raw disk using qemu-img from the command line first isn't really an option.

This bug, however, is a topic for another thread. And it's a bug that won't even be an issue if everybody just sticks to qcow2. Which is what will probably happen anyway, no matter what I say.

Slax-Dude · 05-05-2017, 06:34 AM

Quote:

Originally Posted by arcadiosincero

I could just say "oh BTW, make sure you select 'raw'", but chances are nobody is going to remember that. It'll just be click, click, click all the way through. --8<--snip-->8-- everybody just sticks to qcow2. Which is what will probably happen anyway, no matter what I say.

Like I said: it is irrelevant if you are exporting disk images via network.
Whatever the overhead of using qcow2 is nothing compared to the delay the network will cause on disk I/O.

Regarding cache: on the proxmox site is an easy to grasp explanation regarding KVM disk cache modes.

IMHO, you would be better served with iSCSI than NFS, although upping the network speed (add another gigabit NIC and use bonding) would be yield a far more noticeable increase in performance.

arcadiosincero · 05-12-2017, 08:01 AM

Quote:

Originally Posted by Slax-Dude

Like I said: it is irrelevant if you are exporting disk images via network.
Whatever the overhead of using qcow2 is nothing compared to the delay the network will cause on disk I/O.

Regarding cache: on the proxmox site is an easy to grasp explanation regarding KVM disk cache modes.

IMHO, you would be better served with iSCSI than NFS, although upping the network speed (add another gigabit NIC and use bonding) would be yield a far more noticeable increase in performance.

Oddly, I didn't think of adding multiple gigabit NICs and use bonding to increase bandwidth. I only considered getting 10Gbps cards and a new switch. And since that's currently out of our budget, I quickly dismissed that idea.

I just put in an order for these:

https://www.amazon.com/dp/B000P0NX3G...=I9ZEOEOA4IO1Y

and this:

https://www.amazon.com/dp/B00I5W5EGA...TZJIHIUZ&psc=1

Total cost for the upgrade is about $400. Not too bad at all. Hopefully I'll see some significant improvement with these.