Is there a way to stop Linux from chocking up and often crashing if disk IO is slow?

Red Squirrel · 07-07-2015, 06:21 PM

Quote:

Originally Posted by jpollard

One advantage iSCSI has over NFS for the host providing the virtual disk is that it would bypass the two level I/O. In the NFS mode, the VM making an I/O request to its virtual disk first gets translated to a host reference - then the host translates that to an NFS reference... then the NFS server translates that to a disk reference.

With iSCSI, the VM would make a an I/O request to the virtual device, which then gets sent to the iSCSI target, which can then translates to a disk block and to a disk reference.

This would eliminate the VM host from a lot of excess work - including buffer management which can add latency to the usual NFS delays from both the file server and the VM host.

Note: the iscsi target does not have to be a hardware unit - CAN be, but it isn't required.

That's what I'm thinking too, iSCSI would probably take out lot of overhead. For actual file access I can just setup a file server VM with a large virtual disk, then do NFS and SMB for that. This would also have the advantage of being able to eventually get iSCSI cards so I can put the OS on the SAN too for physical servers. Less parts that can fail.

jpollard · 07-07-2015, 06:26 PM

Actually you can.

VMFS is a distributed shared filesystem. You can get that with gluster.

An iscsi target is more aimed at giving the VM an appearance of a dedicated device. Suitable for a root disk.

jpollard · 07-07-2015, 06:39 PM

Quote:

Originally Posted by Red Squirrel

That's what I'm thinking too, iSCSI would probably take out lot of overhead. For actual file access I can just setup a file server VM with a large virtual disk, then do NFS and SMB for that. This would also have the advantage of being able to eventually get iSCSI cards so I can put the OS on the SAN too for physical servers. Less parts that can fail.

You don't need "iSCSI" cards. It is all software passing SCSI commands over a network connection. The targeted host then interprets the SCSI commands - which COULD just pass them to a dedicated disk, but usually interprets them to access a disk file. The VM would use an iscsi driver to intercept the commands - and encapsulate them to send to a server over the network.

Red Squirrel · 07-07-2015, 07:06 PM

Quote:

Originally Posted by jpollard

Actually you can.

VMFS is a distributed shared filesystem. You can get that with gluster.

An iscsi target is more aimed at giving the VM an appearance of a dedicated device. Suitable for a root disk.

Yeah but VMFS is proprietary is it not? If I want to setup multiple Linux HOSTS to use KVM or other VM solution (not vmware) and I want them to be able to access the same iSCSI targets, what file system would I use on the HOSTS? That's what I'm asking. Or would gluster take care of that? Is that a file system on it's own?

Ex: I go on one of the VM hosts, and setup an ISCSI target, which will be like a raw hard drive, I need to know what file system I would format with so that I can setup that same target on another host to see the same files, without risk of corruption. Not all file systems will work this way.

Quote:

Originally Posted by jpollard

You don't need "iSCSI" cards. It is all software passing SCSI commands over a network connection. The targeted host then interprets the SCSI commands - which COULD just pass them to a dedicated disk, but usually interprets them to access a disk file. The VM would use an iscsi driver to intercept the commands - and encapsulate them to send to a server over the network.

I was talking about physical servers. If I wanted to I could put an iSCSI card in it and have it boot off a target, rather than put a hard drive in the server. It would eliminate a point of failure. Those are ridiculously expensive though so probably would not bother... For VMs then the VM hosts would use software iSCSI initiator. I've managed SANs before.

Just never in Linux/open source but I want to set up my environment that way, if it means better performance. I kinda treat my file server as a SAN anyway so there's really no point in the overhead of NFS when I can do block storage.

Red Squirrel · 07-08-2015, 03:46 AM

Been trying to find info on setting up a HA iSCSI environment in Linux and there is little to no documentation out there so I think for now I will scrap that idea for now and I rather not try to completely overhaul my environment live anyway, I'll wait till the future when I decide to actually get more hardware to actually do HA.

I just want to know what I can do to make my existing setup choke less. What files do I have to edit, what do I have to put in them, etc. For example how do I disable the caching like was suggested? Where do I go for that?

jpollard · 07-08-2015, 05:54 AM

You don't disable the caching.

The problem isn't caching - but that COULD introduce problems with multiple updates to a file from different places...

The problem appears to be timeouts, which is why I indicated a number of options for NFS mounts to change the timeouts...

One last nfs option (and I don't like it as it makes things harder to shutdown) is to use the "hard" option. This causes NFS clients to hang while an NFS server reboots - and if it never reboots, you can't easily shutdown the client as it is locked in an uninterruptable wait for the server...

Slax-Dude · 07-08-2015, 06:08 AM

Regarding host disk cache: google is your friend
https://pubs.vmware.com/vsphere-4-es...hard_disk.html

Regarding cluster filesystems on linux: I use OCFS2 and like it a lot as it is simple and gets the job done, but you can use others.
https://en.wikipedia.org/wiki/List_o...systems#SHARED

navigatorsystemindia · 07-08-2015, 06:42 AM

Linux has a command “top” which shows which process is taking up lots of CPU and memory resources.

Use the top command and kill the process which is unnecessarily taking up lots of CPU and memory resources.

jpollard · 07-08-2015, 08:01 AM

Quote:

Originally Posted by navigatorsystemindia

Linux has a command top which shows which process is taking up lots of CPU and memory resources.

Use the top command and kill the process which is unnecessarily taking up lots of CPU and memory resources.

For a file server, top will only report itself... NFS is done within the kernel.

And within the this particular context, I think it will show sufficient idle time...

I BELIEVE (not having proof) that the sum of latencies involved with the I/O are causing the problem, not necessarily a lack of CPU time. It may be an overloaded network... or overloaded disk... and neither are examined by top. Might try "iotop" instead.

smallpond · 07-08-2015, 08:20 AM

NFS doesn't have a timeout. If a read takes an hour, everything waits and runs fine after it completes.

However the disk block device driver in your VM does have a timeout. You can change it from the default, usually 30 or 60 seconds, to 5 minutes by doing the command below:

Code:

echo 300 >/sys/class/block/sda/device/timeout

jpollard · 07-08-2015, 08:31 AM

Quote:

Originally Posted by smallpond

NFS doesn't have a timeout. If a read takes an hour, everything waits and runs fine after it completes.

However the disk block device driver in your VM does have a timeout. You can change it from the default, usually 30 or 60 seconds, to 5 minutes by doing the command below:

Code:

echo 300 >/sys/class/block/sda/device/timeout

NFS does indeed have a timeout - unless you mount "hard", which introduces management problems to the clients. The timeouts can also cause total system hangs when multiple systems depend on a single export... as one client can lock the entire tree. And if that client then enters a LONG timeout cycle, other clients will gradually backup behind that lock.

Now changing the VM device driver timeout would be an interesting modification. I hadn't considered that.

jpollard · 07-08-2015, 08:35 AM

Quote:

Originally Posted by jpollard

NFS does indeed have a timeout - unless you mount "hard", which introduces management problems to the clients. The timeouts can also cause total system hangs when multiple systems depend on a single export... as one client can lock the entire tree. And if that client then enters a LONG timeout cycle, other clients will gradually backup behind that lock.

Now changing the VM device driver timeout would be an interesting modification. I hadn't considered that.

And that thought brings up another thought...

Has the possibility of using an NFS mounted root filesystem been considered?

This would remove the VM drivers from the loop, and allow direct NFS handling of the root filesystem between the VM and the server. It is "close" to the way iscsi would be interacting with the server by not having to work through the VM host which would then have to work through NFS.

PS:
There would be a couple of advantages provided:
1. shared space with the file server where unused storage by one VM would be available to another...
2. Possible sharing of /usr among all VMs (assuming all are at the same level)
3. Possibly easier updating? I haven't done this in a long time, but when I was doing it, only the file server needed updating - as updating it would update the /usr filesystem (presumably shared). If the NFS /usr is separate, only one temporary host would need updating, and that one would update the shared /usr for all. The only thing the /root filesystem would have that HAS to be separate is /etc, and /var (assuming /tmp is mounted as a tmpfs mount).

Alternatively (and likely simpler) would be to have /root (and /usr combined) separate for each VM. Takes up more disk space though as there would be no shared binaries. The unused space would still be shared.

One way to view this model is that the VMs are all treated as diskless clients of a file server.

Red Squirrel · 07-08-2015, 03:57 PM

I've used top and iotop, and backup jobs will naturally cause lot of usage, I don't want to stop those, I just don't want the system to choke up because there's lot of activity. Torrents seem to cause lot of activity too due to dealing with lot of small writes. It's one thing if access is slower because of increased I/O, I just don't want the systems to crash or have issues and end up generating tons of errors, which is what happens now. For that time out command, which system do I put that on, the ESX hosts? The file server? Or each VM? Guessing those changes are not persistent so I'd have to set it in my startup script too?

Also figured this might help, this is what my exports file looks like:

Code:

/volumes/raid1/p2p       falcon.loc(rw,all_squash,anonuid=1020,anongid=1039) falcon2.loc(rw) borg.loc(rw) p2p.loc(rw,all_squash,anonuid=1020,anongid=1039) htpc.loc(rw) hal9000.loc(rw) 10.5.5.5(rw)
/volumes/raid3/userdata/ryan    falcon.loc(rw) falcon2.loc(rw) borg.loc(rw) 10.5.5.5(rw)
/volumes/raid3/public           10.1.0.0/16(ro) falcon.loc(rw) falcon2.loc(rw) borg.loc(rw,root_squash) 10.5.5.5(rw) moria.loc(rw)
#/volumes/raid3/applications     falcon.loc(rw) falcon2.loc(rw) borg.loc(rw) 10.5.5.5(rw)
/volumes/raid3/intranet         falcon.loc(rw) falcon2.loc(rw) borg.loc(rw) 10.5.5.5(rw)
/volumes/raid2/backups          falcon.loc(rw) falcon2.loc(rw) hal9000.loc(rw) borg.loc(rw) appdev.loc(ro)
/volumes/raid1/temp             10.1.0.0/16(rw,no_root_squash) 10.5.5.0/16(rw,no_root_squash)
/volumes/raid2/temp             10.1.0.0/16(rw,no_root_squash) 10.5.5.0/16(rw,no_root_squash)
/volumes/raid3/temp             10.1.0.0/16(rw,no_root_squash) 10.5.5.0/16(rw,no_root_squash)
/volumes/raid1/appdev           10.0.0.0/8(rw,all_squash,anonuid=1066,anongid=1066)

/volumes/raid1/vms_lun1              borg.loc(rw,all_squash,anonuid=1046,anongid=1046) moria.loc(rw,all_squash,anonuid=1046,anongid=1046)
/volumes/raid2/vms_lun2              borg.loc(rw,all_squash,anonuid=1046,anongid=1046) moria.loc(rw,all_squash,anonuid=1046,anongid=1046)
/volumes/raid3/vms_lun3              borg.loc(rw,all_squash,anonuid=1046,anongid=1046) moria.loc(rw,all_squash,anonuid=1046,anongid=1046)

/volumes/raid1/mysql/borg.loc          borg.loc(rw,all_squash,anonuid=27,anongid=27) 
/volumes/raid1/mysql/appdev.loc        appdev.loc(rw,all_squash,anonuid=27,anongid=27)

Any options I can add there to make it better?

jpollard · 07-08-2015, 04:35 PM

Quote:

Originally Posted by Red Squirrel

I've used top and iotop, and backup jobs will naturally cause lot of usage, I don't want to stop those, I just don't want the system to choke up because there's lot of activity. Torrents seem to cause lot of activity too due to dealing with lot of small writes. It's one thing if access is slower because of increased I/O, I just don't want the systems to crash or have issues and end up generating tons of errors, which is what happens now. For that time out command, which system do I put that on, the ESX hosts? The file server? Or each VM?

EACH VM.

Quote:

Guessing those changes are not persistent so I'd have to set it in my startup script too?

Since you have a RH based kit, no. There is a /etc/sysconfig.d (and see the manpage on sysctl) that handles that.

voleg · 07-09-2015, 03:15 AM

Definitely hardware error for sda, replace it.
Other option: replace SATA cable (is it SATA?).