I/O wait CPU percentage definition

DotHQ · 07-23-2007, 12:38 PM

We run RH AS3 and AS4 in the enterprise.
The DB machines are fiber connected to a SAN.
We run Oracle DB (v9 and 10g)

On one application in particular (AS3) we have consistently seen high i/o wait numbers. The percentage would go as high as 99%, but more often it would be between 40 and 80%.

From my checking the OS is doing fine, it is waiting for response from the SAN. But should that keep the CPU's (4 of them) all at 80 or 90% busy?

What is the CPU doing at that time ....i know it's waiting for data, but what is it doing to make the % go so high.
I equated it to the CPU going to the door, and opening it looking for data, but none is there so it goes back to it's easy chair. Then ..it's still waiting so it gets up again and goes to the door.
The problem with this is we have 4 CPU's all showing that high i/o wait busy and management does not like my answer. Now I'm begining to question it and I'm looking for any info that might help explain what is happening.

One problem that makes this harder to debug is that our SAN dude only shows the SAN at 10% busy and does not show any indication of a bottleneck. For the past 10 months or so we've attributed the slow down to the SAN, but do to them not having the best tools to look into the SAN we've left it at that. Now a major outage and SLOW restore has us all rethinking the issue and looking for something / anything we might have overlooked.

TIA!!!!!!!!!!!!!!!

jlliagre · 07-23-2007, 03:13 PM

A CPU in i/o wait state isn't a busy CPU but an available (i.e. idle) one.

You shouldn't care that much about this metric which is usually more confusing than anything in investigating how a system performs.

DotHQ · 07-23-2007, 04:20 PM

Quote:

Originally Posted by jlliagre

A CPU in i/o wait state isn't a busy CPU but an available (i.e. idle) one.

You shouldn't care that much about this metric which is usually more confusing than anything in investigating how a system performs.

Thanks for your reply. If we ignore that metric we are still left with the symptom which is extremely slow data transfer rates. We had to move a 33 gig file from the SAN to a local drive. After 1 1/2 hour is was about 15% done. I/O was through the roof. 70 and 80% range. No other symptoms.

Where would you look to help debug such performance issues?

twantrd · 07-23-2007, 07:57 PM

Have you recompiled to the latest module for your hba?

-twantrd

jlliagre · 07-23-2007, 11:42 PM

Quote:

Originally Posted by DotHQ

If we ignore that metric we are still left with the symptom which is extremely slow data transfer rates. We had to move a 33 gig file from the SAN to a local drive. After 1 1/2 hour is was about 15% done.

That's ~8 Mbps. Quite disappointing indeed.

Quote:

I/O was through the roof. 70 and 80% range. No other symptoms.

Where would you look to help debug such performance issues?

You should investigate all the components that are involved in the transfer. The i/o wait figure mostly tell you the CPU isn't the bottleneck here. My first guess would be your network is downgraded to 10 Mbps for some reason.

elcody02 · 07-25-2007, 01:27 AM

It might be interesting how your infrastructure looks like (how are how many servers attached to your SAN-Storage? And how are the SAN Switches integrated?), what drivers you are using for your hbas (moduleoptions are also very important) on what RHELs and what storagesystem lies underneath.

Another question is: if this performance was always that bad or not? If not what got changed in the meantime?

You might also want to start collecting statistics on all the servers attached to the storage-array under investigation. Sar which is on RHEL part of sysstat or just iostat or at least vmstat can help a lot. You should collect all I/Os on the SAN from all servers and sum it. Then you know what I/O you get out of your storage system. For just copying data from SAN to say local disks and doing nothing else you should get a way much higher performance then statet.

Hope that helps
have fun.

DotHQ · 07-26-2007, 08:32 PM

Quote:

Originally Posted by twantrd

Have you recompiled to the latest module for your hba?

-twantrd

Good thought & yes we have the latest module installed for our qlogic cards.

DotHQ · 07-26-2007, 08:35 PM

Quote:

Originally Posted by jlliagre

That's ~8 Mbps. Quite disappointing indeed.
You should investigate all the components that are involved in the transfer. The i/o wait figure mostly tell you the CPU isn't the bottleneck here. My first guess would be your network is downgraded to 10 Mbps for some reason.

We have a 4 gb fiber cable connected to the Brocade switch, which connects to the cx700. I do not have permissions to see the SAN stats, but the SAN guys are reporting no performance peaks, all look normal. But they added that their tool kit leaves a lot to be desired, and they are in the process of upgrading their tools.

DotHQ · 07-26-2007, 08:39 PM

Quote:

Originally Posted by elcody02

It might be interesting how your infrastructure looks like (how are how many servers attached to your SAN-Storage? And how are the SAN Switches integrated?), what drivers you are using for your hbas (moduleoptions are also very important) on what RHELs and what storagesystem lies underneath.

Another question is: if this performance was always that bad or not? If not what got changed in the meantime?

You might also want to start collecting statistics on all the servers attached to the storage-array under investigation. Sar which is on RHEL part of sysstat or just iostat or at least vmstat can help a lot. You should collect all I/Os on the SAN from all servers and sum it. Then you know what I/O you get out of your storage system. For just copying data from SAN to say local disks and doing nothing else you should get a way much higher performance then statet.

Hope that helps
have fun.

I listed as much of the enviroment as I can in the previous posts. We are zoned on the SAN so contention should be kept to a minimum.

Performance had this same symptom when we were benchmarking before going life. That is when the SAN lame tool kit 1st reared it's ugly head.

elcody02 · 07-27-2007, 01:36 AM

Another thought to dig deeper in performance matters is to check the type of failover you are doing. Are you using device-mapper, the build in qlogic or power path (as CX700 is IMHO EMC)? I think I would not use the qlogic driver for failover and multipathing (except for RHELAS3, I don't know what the apropriate way was there. Is the qlogic driver supported with EMC CX on RHEL3? I recall problems.). I would dig deeper into multipathing.

Cause your system is waiting for I/Os that means either the driver (plus multipathing) cannot handle data es expected and therefore does not deliver them or the storagesystem (not the switches or anything) cannot deliver the data as requested.

And talk to your SAN Guys. What exactly is the SAN Toolkit you are talking about?

Good look!!