LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Enterprise Linux Forums > Linux - Enterprise
User Name
Password
Linux - Enterprise This forum is for all items relating to using Linux in the Enterprise.

Notices

Reply
 
Search this Thread
Old 07-23-2007, 12:38 PM   #1
DotHQ
Member
 
Registered: Mar 2006
Location: Ohio, USA
Distribution: Red Hat, Fedora, Knoppix,
Posts: 542

Rep: Reputation: 33
I/O wait CPU percentage definition


We run RH AS3 and AS4 in the enterprise.
The DB machines are fiber connected to a SAN.
We run Oracle DB (v9 and 10g)

On one application in particular (AS3) we have consistently seen high i/o wait numbers. The percentage would go as high as 99%, but more often it would be between 40 and 80%.

From my checking the OS is doing fine, it is waiting for response from the SAN. But should that keep the CPU's (4 of them) all at 80 or 90% busy?

What is the CPU doing at that time ....i know it's waiting for data, but what is it doing to make the % go so high.
I equated it to the CPU going to the door, and opening it looking for data, but none is there so it goes back to it's easy chair. Then ..it's still waiting so it gets up again and goes to the door.
The problem with this is we have 4 CPU's all showing that high i/o wait busy and management does not like my answer. Now I'm begining to question it and I'm looking for any info that might help explain what is happening.

One problem that makes this harder to debug is that our SAN dude only shows the SAN at 10% busy and does not show any indication of a bottleneck. For the past 10 months or so we've attributed the slow down to the SAN, but do to them not having the best tools to look into the SAN we've left it at that. Now a major outage and SLOW restore has us all rethinking the issue and looking for something / anything we might have overlooked.

TIA!!!!!!!!!!!!!!!
 
Old 07-23-2007, 03:13 PM   #2
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris10, Solaris 11, Mint, OL
Posts: 9,499

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
A CPU in i/o wait state isn't a busy CPU but an available (i.e. idle) one.

You shouldn't care that much about this metric which is usually more confusing than anything in investigating how a system performs.
 
Old 07-23-2007, 04:20 PM   #3
DotHQ
Member
 
Registered: Mar 2006
Location: Ohio, USA
Distribution: Red Hat, Fedora, Knoppix,
Posts: 542

Original Poster
Rep: Reputation: 33
Quote:
Originally Posted by jlliagre
A CPU in i/o wait state isn't a busy CPU but an available (i.e. idle) one.

You shouldn't care that much about this metric which is usually more confusing than anything in investigating how a system performs.
Thanks for your reply. If we ignore that metric we are still left with the symptom which is extremely slow data transfer rates. We had to move a 33 gig file from the SAN to a local drive. After 1 1/2 hour is was about 15% done. I/O was through the roof. 70 and 80% range. No other symptoms.

Where would you look to help debug such performance issues?
 
Old 07-23-2007, 07:57 PM   #4
twantrd
Senior Member
 
Registered: Nov 2002
Location: CA
Distribution: redhat 7.3
Posts: 1,438

Rep: Reputation: 52
Have you recompiled to the latest module for your hba?

-twantrd
 
Old 07-23-2007, 11:42 PM   #5
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris10, Solaris 11, Mint, OL
Posts: 9,499

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
Quote:
Originally Posted by DotHQ
If we ignore that metric we are still left with the symptom which is extremely slow data transfer rates. We had to move a 33 gig file from the SAN to a local drive. After 1 1/2 hour is was about 15% done.
That's ~8 Mbps. Quite disappointing indeed.
Quote:
I/O was through the roof. 70 and 80% range. No other symptoms.

Where would you look to help debug such performance issues?
You should investigate all the components that are involved in the transfer. The i/o wait figure mostly tell you the CPU isn't the bottleneck here. My first guess would be your network is downgraded to 10 Mbps for some reason.
 
Old 07-25-2007, 01:27 AM   #6
elcody02
Member
 
Registered: Jun 2007
Posts: 52

Rep: Reputation: 17
It might be interesting how your infrastructure looks like (how are how many servers attached to your SAN-Storage? And how are the SAN Switches integrated?), what drivers you are using for your hbas (moduleoptions are also very important) on what RHELs and what storagesystem lies underneath.

Another question is: if this performance was always that bad or not? If not what got changed in the meantime?

You might also want to start collecting statistics on all the servers attached to the storage-array under investigation. Sar which is on RHEL part of sysstat or just iostat or at least vmstat can help a lot. You should collect all I/Os on the SAN from all servers and sum it. Then you know what I/O you get out of your storage system. For just copying data from SAN to say local disks and doing nothing else you should get a way much higher performance then statet.

Hope that helps
have fun.
 
Old 07-26-2007, 08:32 PM   #7
DotHQ
Member
 
Registered: Mar 2006
Location: Ohio, USA
Distribution: Red Hat, Fedora, Knoppix,
Posts: 542

Original Poster
Rep: Reputation: 33
Quote:
Originally Posted by twantrd
Have you recompiled to the latest module for your hba?

-twantrd
Good thought & yes we have the latest module installed for our qlogic cards.
 
Old 07-26-2007, 08:35 PM   #8
DotHQ
Member
 
Registered: Mar 2006
Location: Ohio, USA
Distribution: Red Hat, Fedora, Knoppix,
Posts: 542

Original Poster
Rep: Reputation: 33
Quote:
Originally Posted by jlliagre
That's ~8 Mbps. Quite disappointing indeed.
You should investigate all the components that are involved in the transfer. The i/o wait figure mostly tell you the CPU isn't the bottleneck here. My first guess would be your network is downgraded to 10 Mbps for some reason.
We have a 4 gb fiber cable connected to the Brocade switch, which connects to the cx700. I do not have permissions to see the SAN stats, but the SAN guys are reporting no performance peaks, all look normal. But they added that their tool kit leaves a lot to be desired, and they are in the process of upgrading their tools.
 
Old 07-26-2007, 08:39 PM   #9
DotHQ
Member
 
Registered: Mar 2006
Location: Ohio, USA
Distribution: Red Hat, Fedora, Knoppix,
Posts: 542

Original Poster
Rep: Reputation: 33
Quote:
Originally Posted by elcody02
It might be interesting how your infrastructure looks like (how are how many servers attached to your SAN-Storage? And how are the SAN Switches integrated?), what drivers you are using for your hbas (moduleoptions are also very important) on what RHELs and what storagesystem lies underneath.

Another question is: if this performance was always that bad or not? If not what got changed in the meantime?

You might also want to start collecting statistics on all the servers attached to the storage-array under investigation. Sar which is on RHEL part of sysstat or just iostat or at least vmstat can help a lot. You should collect all I/Os on the SAN from all servers and sum it. Then you know what I/O you get out of your storage system. For just copying data from SAN to say local disks and doing nothing else you should get a way much higher performance then statet.

Hope that helps
have fun.
I listed as much of the enviroment as I can in the previous posts. We are zoned on the SAN so contention should be kept to a minimum.

Performance had this same symptom when we were benchmarking before going life. That is when the SAN lame tool kit 1st reared it's ugly head.
 
Old 07-27-2007, 01:36 AM   #10
elcody02
Member
 
Registered: Jun 2007
Posts: 52

Rep: Reputation: 17
Unhappy

Another thought to dig deeper in performance matters is to check the type of failover you are doing. Are you using device-mapper, the build in qlogic or power path (as CX700 is IMHO EMC)? I think I would not use the qlogic driver for failover and multipathing (except for RHELAS3, I don't know what the apropriate way was there. Is the qlogic driver supported with EMC CX on RHEL3? I recall problems.). I would dig deeper into multipathing.

Cause your system is waiting for I/Os that means either the driver (plus multipathing) cannot handle data es expected and therefore does not deliver them or the storagesystem (not the switches or anything) cannot deliver the data as requested.

And talk to your SAN Guys. What exactly is the SAN Toolkit you are talking about?

Good look!!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
`top` Irregularities - cpu count, percentage, process lists ColinAnderson Linux - Software 4 05-09-2007 06:41 PM
Hi Percentage of CPU usage in copying or writing large files kaplan71 Fedora 8 09-14-2006 04:02 PM
ps -aux CPU usage percentage depdiver Linux - General 1 02-25-2006 02:02 AM
find remaining percentage of cpu happy78 Programming 11 09-26-2005 08:02 PM
What is the percentage of each distro in use ? newlin Linux - General 1 08-03-2003 04:41 PM


All times are GMT -5. The time now is 09:14 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration