LinuxQuestions.org - high iowait in RHES

- Red Hat (https://www.linuxquestions.org/questions/red-hat-31/)

- - high iowait in RHES (https://www.linuxquestions.org/questions/red-hat-31/high-iowait-in-rhes-336881/)

sklam

06-24-2005 04:00 PM

high iowait in RHES

Dear all

I found that I have gotten a high IOwait in top when just copy a large from one folder to other in RH Enterprise server v3. I have tried some different vender 's hardware and gotten the same result.
However, it does not occur the problem on Fedora and RH8.

Any idea of it ??

Thanks

ddaas

07-08-2005 02:19 AM

I've also observed that at my rhel es 3 server.
Anybody knows why?

RedHatCat

07-12-2005 10:35 AM

Hmm, me too - just got a 1U back today that is showing signs of the same problems you describe (has RHes 3 upate3 installed).

Its a dual Xeon with 4Gb ram, hyper-threading is enabled, when I watch "top" while copying a file from one partition, the IOwait is ~90-100% for each cpu, not good for a storage server.

RedHatCat

07-15-2005 11:11 AM

You guys got Xeons in your machines? From the scraps of info I've read it seems to be a problem with RH es 3 and Xeons.

kbadeau

12-17-2005 12:57 AM

We've recently migrated from one of our two production database servers from Solaris on a v880 to RHEL AS v3 U4 on a HP DL580 (dual Xeon).

We have a Clarion CX500 with Emulex LP9802-E cards running the rhel provided lpfc 7.1.14 driver for the emulex card.

We find a copy test of 1 gig takes 20 seconds on another solaris machine with respectable activity, at peak times IO waits can run about 50%.

We find on the HP/RHEL machine these copies can take over a minute upon 1st invocation. After that cacheing kicks in and makes subsequent repeats of the same test take much less time (under 10 seconds).

We are running oracle 9.2.0.4 and are finding for certain heavy sequential read operations performance to be very poor. We eliminated the oracle database as an initial issue by doing a simple copy test at the os level and found this disprepency between the old solaris db server and the new rhel server. During even light activity on the database IO waits shoot into the 80-90% range.

I am wondering if anyone can offer experiences with a similar configuration and/or any solutions they may have encountered in addressing such issues.

We have been working with certain kernel params:
/proc/sys/vm/min-readahead
/proc/sys/vm/max-readahead
/proc/sys/vm/inactive_clean_percent
/proc/sys/vm/pagecache
to no significant avail.

Looking forward to any feedback possible.

Thanks,
Kevin

Krietjur

02-20-2006 08:59 AM

We experience problems like this on our production server. The server is a HP Proliant ML570, 4GB ram, four 3Ghz Xeon processors and a Compaq Smart Array 64xx Raid-controller, with four raid-1 arrays, and one raid 1+0 array. I've created a 1GB testfile, and timed how long it took to copy it to another array. I timed it on 58 seconds, IOwait had peaks at 80-100% a lot of the time, between those peaks the IOwait was around 40%.

We only have one machine like this, but I tried the same on a system with only one PII 350Mhz processor, 512MB memory and copied from a software raid-1 drive to a non-raid scsi drive, and there it took 52 seconds. IOwait was 0% all the time.. don't know if this can be correct.. top doesn't show iowait, and I just installed the systat package on that machine to be able to watch the iowait.. This is a Gentoo machine by the way..

I also did a third test, on a dual Xeon 3Ghz system, running Redhat Enterprise. There it took 1 minute and 56 seconds to copy the 1GB testfile. On this system, we use an Adaptec AAC-RAID controller with SATA disks. IOwait is 100% here almost all of the time during this action, CPU idle time 0%.

Then a final test, on my home linux box (Debian). This is no high-performance machine, just a Pentium II 350 Mhz, 192MB memory and two simple IDE disks in it, each on its own ide-controller. Same test again. The IOwait sometimes had a little peak at 80%, but most of the time it was below 50%. The copy test itself, took 1 minute 49 seconds. At that moment, the machine had some other stuff to do (X-Windows running, Azureus downloading) and it's still faster then the Dual Xeon machine running Redhat Enterprise from the second test who was doing nothing else!

So I have tested four systems, two Red Hat Enterprise, two other systems, both Red Hat Enterprise systems are having the problems... :(

RedHatCat

02-21-2006 04:49 AM

I'm still keeping an eye on this subject :)

I spent a while playing with swap files last year, it really had little or no impact on IOwait. I got a bit nervous when building a Blade up recently, because it seemed to show all the signs of an IOwait problem like the 1U's I had last year, I didn't realise it was still building the RAID1; and by the morning it was shifting 2Gb files in less than 50secs again.

Switching from NCQ (or SATA2 compliant) SATA disks, to original non-NCQ (native command queuing) disks, seemed to cure our problems in the 1U servers (from Seagate Barracuda 400Gb's to Maxtor Maxline III 300Gb's), what drives does your HP Proliant ML570 have in it Krietjur?

Take it easy,

Jim

Krietjur

02-22-2006 05:13 AM

There are SCSI disks in it, but I don't know which type etc. I don't know where I can find that info in Linux (normally I'm able to find it somewhere under /proc/scsi but there is only my tapestreamer listed.) I've had some other hints, that it might have to do with a cache-battery failure on the raid-controller. As soon as I have more information, I'll post it here ofcourse :)

RedHatCat

02-24-2006 03:05 AM

Hmm, the third machine you tested with SATA disks in, can you find out what make and model the drives are? That to me looks like a full on IOwait lockup, whereas the SCSI-based system just looks a bit slow. Exactly why its slow, I'm not sure, have you tried copying the file from different RAID's? I'd expect it to perform slightly better, for example, copying from or to the RAID 1+0 than between plain RAID 1's; are all the RAID's on that box fully built when it was tested?

If the machines can't be downed to have a look at the disks, or in the RAID bios, I would expect to find info on the drives in the /proc/ directory somewhere. I often find the info is inside a folder with the name of the driver for your RAID card. Good luck,

Jim

Krietjur

02-24-2006 07:09 AM

The copy actions I did on those RAID systems, were all from an array to a different array. I've looked in /proc but can't find the model of the harddrives there. The SATA system was rebooted yesterday, and I saw then that they have 6 Maxtor disks, but I don't know what model. There's a Raid 1 and a Raid 1+0 configured there.

The RAID's on the machines were fully built when tested, machines are running for about one year now I think.

RedHatCat

03-01-2006 10:19 AM

Uh oh - just been told there's a system coming my way thats got lockup problems (its another 1U SATA-RAID system), the symptoms sound like another IOwait lockup but the snippets from 'top' that I've seen show 0% IOwait and ~100% on user/system processes.

The plan is to replace the 400Gb NCQ drives with 300Gb non-NCQ ones and see how we go from there. If anything useful comes to light I'll let you know.

Krietjur

03-06-2006 03:14 AM

I've had a chance to reboot the Proliant ML570 last week, and boot it from the smart start cd. I did a run of the ADU tool, and this reported no errors, so I guess the Raid controller is ok. I've also a list of drives in the system now:

Code:

driveid raid port 1    raid port 2    raid type

--------------------------------------------------

0        BF03688284      BF03688284      Raid 1

1        BF07285A36      BF07285A36      Raid 1

2        BF03688284      BF03688284      Raid 1+0

3        BF03688284      BF03688284      Raid 1+0

4        BF03685A35      BF03688284      Raid 1 <-- this one is a bit strange, two different drives in the same array

5        BF03688284      BF03688284      Raid 1

RedHatCat

03-10-2006 01:26 PM

Early reports suggest the machine with the suspected IOwait is now back on top form. It was acting very odd whilst watching "top" and seemed ony to recognise 3 of the 4 cores (dual xeons). I swapped the disks for 300Gb's from the 400Gb sata's that were in there anyway, but disabling HT seemed to be the turning point. This was a slightly different platform to the one we usually have issues with, which makes me think some kind of issue with the cores being recognised was more likely.

Not sure about the mirror with the different model disks - they are obviously the same size, but perhaps slightly different spec, I'm not sure if this could have an impact. I'm not even sure how well disks from totally different manufacturers would play together a SCSI/sata array, my old IDE raid in an Athlon box doesn't care what disks it uses but its hardly a performance machine/server. If I get an hour with my xeon in the next few weeks, I'll build an array or two and see how mixed model or manufacturer disks perform.

simcolor

03-26-2006 10:06 AM

Hi, I have got similar problems. We have a RAID-5 array with 12 disk, and a computer farm with 7 dual-Xeon nodes. It gets very slow, even for ls command and tab-complete.

I want to try to turn off the HT option, can you tell me how? Thx!

simcolor

03-26-2006 10:12 AM

On www(dot)felipecruz(dot)com/blog_high-iowait-times-may21.php
it said upgrading the kernel may solve the problem, has anyone tried this?

Krietjur

03-27-2006 12:44 AM

I've recently upgraded my kernel from 2.4.21-4 to 2.4.21-37 because of the newer drivers for the raid controller. This didn't help. Last week, we upgraded the firmware from 2.32 to 2.48 and upgraded the cache memory from 128 tot 512 MB. I'll have to watch the server for a while to see if this helped. I'll post when I know more.

RedHatCat

03-27-2006 05:58 AM

Turning off HT is an option enabled by default in most BIOS', might take a little bit of searching around in the menu's but it'll be in there somewhere. Simply set to 'Disabled' and that psuedo-cpu/s should disappear.

iolotusbobo

07-28-2006 02:21 PM

Problems with Proliant 64xx RAID on a 4Proc HP machine

We have some hard disk problems on a Proliant 64xx controller. The machine works ok when we put 300 GB + 150 GB disks in Raid 0 config to get a total of 300 GB space. But if we put 300 GB + 300 GB disks - the disks randomly go into read only or the superblock becomes inaccessible.

We are using RHEL4 with kernel-smp-2.6.9-22.

If anyone has faced such a problem or has some clue about the source of the problem - do tell.

iolotusbobo

07-28-2006 03:59 PM

dmesg output

For the above problem - after the disk goes into read only mode... it becomes inaccessible....

dmesg gives output lines saying -

Buffer I/O error on device cciss/c1d0p1, logical block xxx
EXT3-fs error (device cciss/c1d0p1) in start_transaction: Journal has aborted

And now this problem is coming with the 300+150 disks as well...

linulex

08-01-2006 06:04 AM

Quote:

Originally Posted by RedHatCat

You guys got Xeons in your machines? From the scraps of info I've read it seems to be a problem with RH es 3 and Xeons.

To be more correct: with RHEL 3 and 800MHz Xeons

We are a hosting company and have several dozen dual Xeons running on RHEL 3 and have been using it since it came out. October last year our stock of Xeons 533 MHz was depleted and we where thinking about making the switch to 800 MHz machines.
The first server we replaced was one with about 300 domains on it and the very same day we started getting calls the machine was very slow. We imidiatly switched back to a 533 MHz and it all was back ok.
Then we started testing with different MoBo's with different chipsets (all Supermicro), got new BIOS updates from them, the works. On Xeon 533 MHz: no problem, on 800 MHz: big problem. We havent tested on higher MHz boards yet.

As a professional hosting company we cant afford to just test it under pressure (with clients on it reading a lot of files), so we bought up all the 533 Xeons we could find and now are waiting for RHEL 5 to be released.

Regards
Jan

arazas

12-14-2006 12:35 PM

Hello guys,

I had same problem with one of my RHEL4 server and the problem was high swap usage and cpu usage due to mysql process. I had upgraded the kernel and it was fixed itself after a reboot.

Again today, I have upgraded kernal on another server (RHLE3 server) on which i was suffering with high IOWAIT. It seems to br workning peferct after reboot.

Very simple to upgrade for RHEL

(SEE CURRENT VERSION)
$ uname -r

(UPGRADE)
$ up2date -if kernel

---Cheers

All times are GMT -5. The time now is 07:29 PM.