LinuxQuestions.org - SAN , NAS and IO Scheduling

- Linux - Enterprise (https://www.linuxquestions.org/questions/linux-enterprise-47/)

- - SAN , NAS and IO Scheduling (https://www.linuxquestions.org/questions/linux-enterprise-47/san-nas-and-io-scheduling-760726/)

SAN , NAS and IO Scheduling

Hi,

I was looking for some insights about IO scheduling (Oracle Database) on Red Hat with SAN/NAS at back end

Before that, you might need to know the difference here:
http://planet.admon.org/2009/09/a-co...io-schedulers/

I thought Oracle used O_Direct ... (not that I use Oracle)

Hi,

thank you for your responses.

I was looking for some more information though.

typically io scheduling in context/relation to io controllers and considering that we usually have multi path - so - if one group fires a read request - and it gets split across multiple channels (same goes with write as well) so how do you really measure latency - typically in some of the scenarios where you have a parent process running multiple worker threads and those worker threads may have been assigned to different different queue/io channels.

I do not think as of now there is anything reliable which gives you a proper feedback on this kind of io.

and i was looking for something in a typical dbms context where you have different read/write threads - I was trying different scheduling stuff and trying to make out WHY a given scheduler was good - a typical 360 degree view which will affect io stuff - like
-disk kind - sata/scsi
-disk blocks - 4kb etc
- os buffering
-dbms buffering
-io controller buffering
- journaling in use - if that really would affect read/writes
-any others if I am missing.

-queue depth
-queue interval /service time
- Channel bandwidth etc

i need some more insight as how to really benchmark any scheme - noop/as/cfq/deadline

Have a read of this - reasonably old. I had seen a more expansive paper (this is almost an executive overview), but I can't find it now.
With outboard (hardware) caching controller(s) I would have expected NOOP to be right up there - especially if using O_Direct. Maybe try fio as a benchmarking tool - article here.

I read it when I was googling around.

But it doesnt addresses a few more questions as I read a few Kernel.org emails regarding IO Scheduling - Vivek Goyal - had typically some good stuff going around on this and Multi IO and IO Controllers - and related stuff.

I will paste excerpt and the link itself shortly

[EXCERPT]

http://people.redhat.com/~vgoyal/io-...ller-v10.patch

Fairness at logical device level vs at physical device level
------------------------------------------------------------

IO scheduler based controller has the limitation that it works only with the
bottom most devices in the IO stack where IO scheduler is attached.

For example, assume a user has created a logical device lv0 using three
underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
in two groups doing IO on lv0. Also assume that weights of groups are in the
ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

T1 T2
\ /
lv0
/ | \
sda sdb sdc

Now resource control will take place only on devices sda, sdb and sdc and
not at lv0 level. So if IO from two tasks is relatively uniformly
distributed across the disks then T1 and T2 will see the throughput ratio
in proportion to weight specified. But if IO from T1 and T2 is going to
different disks and there is no contention then at higher level they both
will see same BW.

Here a second level controller can produce better fairness numbers at
logical device but most likely at redued overall throughput of the system,
because it will try to control IO even if there is no contention at phsical
possibly leaving diksks unused in the system.

Hence, question comes that how important it is to control bandwidth at
higher level logical devices also. The actual contention for resources is
at the leaf block device so it probably makes sense to do any kind of
control there and not at the intermediate devices. Secondly probably it
also means better use of available resources.

Limited Fairness
----------------
Currently CFQ idles on a sequential reader queue to make sure it gets its
fair share. A second level controller will find it tricky to anticipate.
Either it will not have any anticipation logic and in that case it will not
provide fairness to single readers in a group (as dm-ioband does) or if it
starts anticipating then we should run into these strange situations where
second level controller is anticipating on one queue/group and underlying
IO scheduler might be anticipating on something else.

Need of device mapper tools
---------------------------
A device mapper based solution will require creation of a ioband device
on each physical/logical device one wants to control. So it requires usage
of device mapper tools even for the people who are not using device mapper.
At the same time creation of ioband device on each partition in the system to
control the IO can be cumbersome and overwhelming if system has got lots of
disks and partitions with-in.

IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently.

But I am all ears to alternative approaches and suggestions how doing things
can be done better and will be glad to implement it.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Testing
=======

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly
running fio jobs which have been limited to 30 seconds run and then monitored
the throughput and latency.

Test1: Random Reader Vs Random Writers
======================================
Launched a random reader and then increasing number of random writers to see
the effect on random reader BW and max lantecies.

http://people.redhat.com/~vgoyal/io-...ller-v10.patch

That describes a very specific situation - do you even use cgroups ?. Do you create different pv's on the same (physical) disk - and assign them to different lv's ? ...
As I suggested, fio might be the way to go - you can set the test as you desire. If it's good enough for the guy writing those patches, should work for you.

sure,

Specific Situation - May be / May be not.

As of now - no cgroups - but a strong possiblity may be there if we can prove our case of ( different benchmarking stuff in different scnearios)

Do you create different pv's on the same (physical) disk - and assign them to different lv's ?

In few cases - yes.

[I wont be able to give more details on existing scenario - since its confidential]

Interesting discussion. Oracle prefers / suggests using ASM over raw devices for the database. In this way you accomplish multiple objectives:

Let the database and ASM determine and manage I/O scheduling using their heuristic models.
Manage database backup and recovery with RMAN.

understandably - ASM might be solution - but what about raw devices and SAN - Wouldn't SAN controller still be merging read and write requests. - Can ASM bypass Fibre controller and manage - What about bottlenecks then - would it be easier to diagnose and isolate io issues.

going further [may be unrelated] what about TCQ/NCQ [Tagged command Queueing/NAtive stuff] Would asm go further and interact with SAS / SCSI disks ?

I appreciate all posters[in given thread] for their valuable time and feedbacks

Regards,
a