SAN , NAS and IO Scheduling
Hi,
I was looking for some insights about IO scheduling (Oracle Database) on Red Hat with SAN/NAS at back end |
Before that, you might need to know the difference here:
http://planet.admon.org/2009/09/a-co...io-schedulers/ |
I thought Oracle used O_Direct ... (not that I use Oracle)
|
Hi,
thank you for your responses. I was looking for some more information though. typically io scheduling in context/relation to io controllers and considering that we usually have multi path - so - if one group fires a read request - and it gets split across multiple channels (same goes with write as well) so how do you really measure latency - typically in some of the scenarios where you have a parent process running multiple worker threads and those worker threads may have been assigned to different different queue/io channels. I do not think as of now there is anything reliable which gives you a proper feedback on this kind of io. and i was looking for something in a typical dbms context where you have different read/write threads - I was trying different scheduling stuff and trying to make out WHY a given scheduler was good - a typical 360 degree view which will affect io stuff - like -disk kind - sata/scsi -disk blocks - 4kb etc - os buffering -dbms buffering -io controller buffering - journaling in use - if that really would affect read/writes -any others if I am missing. -queue depth -queue interval /service time - Channel bandwidth etc i need some more insight as how to really benchmark any scheme - noop/as/cfq/deadline |
Have a read of this - reasonably old. I had seen a more expansive paper (this is almost an executive overview), but I can't find it now.
With outboard (hardware) caching controller(s) I would have expected NOOP to be right up there - especially if using O_Direct. Maybe try fio as a benchmarking tool - article here. |
I read it when I was googling around.
But it doesnt addresses a few more questions as I read a few Kernel.org emails regarding IO Scheduling - Vivek Goyal - had typically some good stuff going around on this and Multi IO and IO Controllers - and related stuff. I will paste excerpt and the link itself shortly |
[EXCERPT]
http://people.redhat.com/~vgoyal/io-...ller-v10.patch Fairness at logical device level vs at physical device level ------------------------------------------------------------ IO scheduler based controller has the limitation that it works only with the bottom most devices in the IO stack where IO scheduler is attached. For example, assume a user has created a logical device lv0 using three underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. T1 T2 \ / lv0 / | \ sda sdb sdc Now resource control will take place only on devices sda, sdb and sdc and not at lv0 level. So if IO from two tasks is relatively uniformly distributed across the disks then T1 and T2 will see the throughput ratio in proportion to weight specified. But if IO from T1 and T2 is going to different disks and there is no contention then at higher level they both will see same BW. Here a second level controller can produce better fairness numbers at logical device but most likely at redued overall throughput of the system, because it will try to control IO even if there is no contention at phsical possibly leaving diksks unused in the system. Hence, question comes that how important it is to control bandwidth at higher level logical devices also. The actual contention for resources is at the leaf block device so it probably makes sense to do any kind of control there and not at the intermediate devices. Secondly probably it also means better use of available resources. Limited Fairness ---------------- Currently CFQ idles on a sequential reader queue to make sure it gets its fair share. A second level controller will find it tricky to anticipate. Either it will not have any anticipation logic and in that case it will not provide fairness to single readers in a group (as dm-ioband does) or if it starts anticipating then we should run into these strange situations where second level controller is anticipating on one queue/group and underlying IO scheduler might be anticipating on something else. Need of device mapper tools --------------------------- A device mapper based solution will require creation of a ioband device on each physical/logical device one wants to control. So it requires usage of device mapper tools even for the people who are not using device mapper. At the same time creation of ioband device on each partition in the system to control the IO can be cumbersome and overwhelming if system has got lots of disks and partitions with-in. IMHO, IO scheduler based IO controller is a reasonable approach to solve the problem of group bandwidth control, and can do hierarchical IO scheduling more tightly and efficiently. But I am all ears to alternative approaches and suggestions how doing things can be done better and will be glad to implement it. TODO ==== - code cleanups, testing, bug fixing, optimizations, benchmarking etc... - More testing to make sure there are no regressions in CFQ. Testing ======= Environment ========== A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly running fio jobs which have been limited to 30 seconds run and then monitored the throughput and latency. Test1: Random Reader Vs Random Writers ====================================== Launched a random reader and then increasing number of random writers to see the effect on random reader BW and max lantecies. http://people.redhat.com/~vgoyal/io-...ller-v10.patch |
That describes a very specific situation - do you even use cgroups ?. Do you create different pv's on the same (physical) disk - and assign them to different lv's ? ...
As I suggested, fio might be the way to go - you can set the test as you desire. If it's good enough for the guy writing those patches, should work for you. |
sure,
Specific Situation - May be / May be not. As of now - no cgroups - but a strong possiblity may be there if we can prove our case of ( different benchmarking stuff in different scnearios) Do you create different pv's on the same (physical) disk - and assign them to different lv's ? In few cases - yes. [I wont be able to give more details on existing scenario - since its confidential] |
Interesting discussion. Oracle prefers / suggests using ASM over raw devices for the database. In this way you accomplish multiple objectives:
|
understandably - ASM might be solution - but what about raw devices and SAN - Wouldn't SAN controller still be merging read and write requests. - Can ASM bypass Fibre controller and manage - What about bottlenecks then - would it be easier to diagnose and isolate io issues.
going further [may be unrelated] what about TCQ/NCQ [Tagged command Queueing/NAtive stuff] Would asm go further and interact with SAS / SCSI disks ? I appreciate all posters[in given thread] for their valuable time and feedbacks Regards, a |
All times are GMT -5. The time now is 03:29 PM. |