high iowait in RHES

sklam · 06-24-2005, 04:00 PM

Dear all

I found that I have gotten a high IOwait in top when just copy a large from one folder to other in RH Enterprise server v3. I have tried some different vender 's hardware and gotten the same result.
However, it does not occur the problem on Fedora and RH8.

Any idea of it ??

Thanks

ddaas · 07-08-2005, 02:19 AM

I've also observed that at my rhel es 3 server.
Anybody knows why?

RedHatCat · 07-12-2005, 10:35 AM

Hmm, me too - just got a 1U back today that is showing signs of the same problems you describe (has RHes 3 upate3 installed).

Its a dual Xeon with 4Gb ram, hyper-threading is enabled, when I watch "top" while copying a file from one partition, the IOwait is ~90-100% for each cpu, not good for a storage server.

RedHatCat · 07-15-2005, 11:11 AM

You guys got Xeons in your machines? From the scraps of info I've read it seems to be a problem with RH es 3 and Xeons.

kbadeau · 12-17-2005, 12:57 AM

We've recently migrated from one of our two production database servers from Solaris on a v880 to RHEL AS v3 U4 on a HP DL580 (dual Xeon).

We have a Clarion CX500 with Emulex LP9802-E cards running the rhel provided lpfc 7.1.14 driver for the emulex card.

We find a copy test of 1 gig takes 20 seconds on another solaris machine with respectable activity, at peak times IO waits can run about 50%.

We find on the HP/RHEL machine these copies can take over a minute upon 1st invocation. After that cacheing kicks in and makes subsequent repeats of the same test take much less time (under 10 seconds).

We are running oracle 9.2.0.4 and are finding for certain heavy sequential read operations performance to be very poor. We eliminated the oracle database as an initial issue by doing a simple copy test at the os level and found this disprepency between the old solaris db server and the new rhel server. During even light activity on the database IO waits shoot into the 80-90% range.

I am wondering if anyone can offer experiences with a similar configuration and/or any solutions they may have encountered in addressing such issues.

We have been working with certain kernel params:
/proc/sys/vm/min-readahead
/proc/sys/vm/max-readahead
/proc/sys/vm/inactive_clean_percent
/proc/sys/vm/pagecache
to no significant avail.

Looking forward to any feedback possible.

Thanks,
Kevin

Krietjur · 02-20-2006, 08:59 AM

We experience problems like this on our production server. The server is a HP Proliant ML570, 4GB ram, four 3Ghz Xeon processors and a Compaq Smart Array 64xx Raid-controller, with four raid-1 arrays, and one raid 1+0 array. I've created a 1GB testfile, and timed how long it took to copy it to another array. I timed it on 58 seconds, IOwait had peaks at 80-100% a lot of the time, between those peaks the IOwait was around 40%.

We only have one machine like this, but I tried the same on a system with only one PII 350Mhz processor, 512MB memory and copied from a software raid-1 drive to a non-raid scsi drive, and there it took 52 seconds. IOwait was 0% all the time.. don't know if this can be correct.. top doesn't show iowait, and I just installed the systat package on that machine to be able to watch the iowait.. This is a Gentoo machine by the way..

I also did a third test, on a dual Xeon 3Ghz system, running Redhat Enterprise. There it took 1 minute and 56 seconds to copy the 1GB testfile. On this system, we use an Adaptec AAC-RAID controller with SATA disks. IOwait is 100% here almost all of the time during this action, CPU idle time 0%.

Then a final test, on my home linux box (Debian). This is no high-performance machine, just a Pentium II 350 Mhz, 192MB memory and two simple IDE disks in it, each on its own ide-controller. Same test again. The IOwait sometimes had a little peak at 80%, but most of the time it was below 50%. The copy test itself, took 1 minute 49 seconds. At that moment, the machine had some other stuff to do (X-Windows running, Azureus downloading) and it's still faster then the Dual Xeon machine running Redhat Enterprise from the second test who was doing nothing else!

So I have tested four systems, two Red Hat Enterprise, two other systems, both Red Hat Enterprise systems are having the problems...

RedHatCat · 02-21-2006, 04:49 AM

I'm still keeping an eye on this subject

I spent a while playing with swap files last year, it really had little or no impact on IOwait. I got a bit nervous when building a Blade up recently, because it seemed to show all the signs of an IOwait problem like the 1U's I had last year, I didn't realise it was still building the RAID1; and by the morning it was shifting 2Gb files in less than 50secs again.

Switching from NCQ (or SATA2 compliant) SATA disks, to original non-NCQ (native command queuing) disks, seemed to cure our problems in the 1U servers (from Seagate Barracuda 400Gb's to Maxtor Maxline III 300Gb's), what drives does your HP Proliant ML570 have in it Krietjur?

Take it easy,

Jim

Krietjur · 02-22-2006, 05:13 AM

There are SCSI disks in it, but I don't know which type etc. I don't know where I can find that info in Linux (normally I'm able to find it somewhere under /proc/scsi but there is only my tapestreamer listed.) I've had some other hints, that it might have to do with a cache-battery failure on the raid-controller. As soon as I have more information, I'll post it here ofcourse

RedHatCat · 02-24-2006, 03:05 AM

Hmm, the third machine you tested with SATA disks in, can you find out what make and model the drives are? That to me looks like a full on IOwait lockup, whereas the SCSI-based system just looks a bit slow. Exactly why its slow, I'm not sure, have you tried copying the file from different RAID's? I'd expect it to perform slightly better, for example, copying from or to the RAID 1+0 than between plain RAID 1's; are all the RAID's on that box fully built when it was tested?

If the machines can't be downed to have a look at the disks, or in the RAID bios, I would expect to find info on the drives in the /proc/ directory somewhere. I often find the info is inside a folder with the name of the driver for your RAID card. Good luck,

Jim

Krietjur · 02-24-2006, 07:09 AM

The copy actions I did on those RAID systems, were all from an array to a different array. I've looked in /proc but can't find the model of the harddrives there. The SATA system was rebooted yesterday, and I saw then that they have 6 Maxtor disks, but I don't know what model. There's a Raid 1 and a Raid 1+0 configured there.

The RAID's on the machines were fully built when tested, machines are running for about one year now I think.

RedHatCat · 03-01-2006, 10:19 AM

Uh oh - just been told there's a system coming my way thats got lockup problems (its another 1U SATA-RAID system), the symptoms sound like another IOwait lockup but the snippets from 'top' that I've seen show 0% IOwait and ~100% on user/system processes.

The plan is to replace the 400Gb NCQ drives with 300Gb non-NCQ ones and see how we go from there. If anything useful comes to light I'll let you know.

Krietjur · 03-06-2006, 03:14 AM

I've had a chance to reboot the Proliant ML570 last week, and boot it from the smart start cd. I did a run of the ADU tool, and this reported no errors, so I guess the Raid controller is ok. I've also a list of drives in the system now:

Code:

driveid raid port 1     raid port 2     raid type
--------------------------------------------------
0	BF03688284      BF03688284      Raid 1
1	BF07285A36      BF07285A36      Raid 1
2	BF03688284      BF03688284      Raid 1+0
3	BF03688284      BF03688284      Raid 1+0
4	BF03685A35      BF03688284      Raid 1 <-- this one is a bit strange, two different drives in the same array
5	BF03688284      BF03688284      Raid 1

RedHatCat · 03-10-2006, 01:26 PM

Early reports suggest the machine with the suspected IOwait is now back on top form. It was acting very odd whilst watching "top" and seemed ony to recognise 3 of the 4 cores (dual xeons). I swapped the disks for 300Gb's from the 400Gb sata's that were in there anyway, but disabling HT seemed to be the turning point. This was a slightly different platform to the one we usually have issues with, which makes me think some kind of issue with the cores being recognised was more likely.

Not sure about the mirror with the different model disks - they are obviously the same size, but perhaps slightly different spec, I'm not sure if this could have an impact. I'm not even sure how well disks from totally different manufacturers would play together a SCSI/sata array, my old IDE raid in an Athlon box doesn't care what disks it uses but its hardly a performance machine/server. If I get an hour with my xeon in the next few weeks, I'll build an array or two and see how mixed model or manufacturer disks perform.

simcolor · 03-26-2006, 10:06 AM

Hi, I have got similar problems. We have a RAID-5 array with 12 disk, and a computer farm with 7 dual-Xeon nodes. It gets very slow, even for ls command and tab-complete.

I want to try to turn off the HT option, can you tell me how? Thx!

simcolor · 03-26-2006, 10:12 AM

On www(dot)felipecruz(dot)com/blog_high-iowait-times-may21.php
it said upgrading the kernel may solve the problem, has anyone tried this?