CPUs in high IOwait state despite of lack of load

kvsraju · 09-27-2005, 12:16 PM

Hi Friends,

We are using Red Hat Linux AS 3.0 on IBM blade server with Oracle 10g.
We have mutiple CPUs (8 CPUs) and 2 nodes RAC Cluster.
Some times CPU usage on two Database nodes showing high IO wait despite of lack of load. That time we are not running any db jobs.
We are check this IOwait using "top" command.

Can any please let us know what may be the reason.

basemodel · 10-07-2005, 11:25 PM

We need more information. Are both these nodes connected to a fibre-channel SAN or SCSI disk array, NFS, etc? Are you using ASM or OCFS? I suggest you install the 'sysstat' package if you don't have it already, this is very useful in determining I/O issues. For example, the command 'iostat -x 1' will show disk IO for each partition, updated once every second. Once you find the partition with the most read/write activity, or avgqu (Average I/O queue), you can do something like 'fuser -vm /dev/sdaX' (where /dev/sdaX is your partition). This will give a list of processes using that partition, so you can narrow it down much more easily. Please post:

1. Speed of the attached storage (1GB for SAN, 160MB/s SCSI, etc)
2. Any PCI Cards involved (SCSI, HBA)
3. Method of connectivity (Direct attached SCSI storage, Fibre attached SAN with or without Fibre switches)

basemodel · 10-07-2005, 11:26 PM

Furthermore, are there any I/O errors in /var/log/messages or dmesg?

joerg.weis · 11-03-2005, 03:34 PM

Hi

we do have a similar problem?
We are running redHat AS with a 2.4.x Kernel on HP Blade in SMP Mode with 2 Processors and 4GB Ram each. The blades are connected to a HP SAN via a QLogic adapter.

We are running a big search engine (FAST) on the machines. The nodes we have problems with are the crawler nodes. The iowait is very high there and the document crawl rate is very low, cpu usage is low and the memory consumption is very ok.

I have done the things mentioned in the post before:

iostat -x 1 showed me that

Code:

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
/dev/sda    24.05 827.85 56.96 391.14  648.10 9772.15   324.05  4886.08    23.25    58.35   12.94   1.10  49.37
/dev/sda1   24.05 827.85 56.96 391.14  648.10 9772.15   324.05  4886.08    23.25    58.35   12.94   1.10  49.37

which tells mee that the most frequently used devices are the mounted san devices which are /dev/sda and /dev/sda1

then i executed the mentioned "fuser -vm /dev/sda1" command
which returned the output below, what showed me that the processes
writing to the san are all from our search engine.

Code:

                     USER        PID ACCESS COMMAND
/dev/sda1            fastusr   16212 f....  cachemanager
                     fastusr   16218 f....  frtsobj
                     fastusr   16266 f....  statusserver
                     fastusr   16283 f....  fsearch
                     fastusr   16291 f....  fsearch
                     fastusr   16299 f....  fsearch
                     fastusr   16306 f....  anchorserver
                     fastusr   16355 f.c..  mysqld
                     fastusr   16356 f.c..  mysqld
                     fastusr   16357 f.c..  mysqld
                     fastusr   16358 f.c..  mysqld
                     fastusr   16359 f.c..  mysqld
                     fastusr   16360 f.c..  mysqld
                     fastusr   16363 f.c..  mysqld
                     fastusr   16364 f.c..  mysqld
                     fastusr   16365 f.c..  mysqld
                     fastusr   16366 f.c..  mysqld
                     fastusr   16647 f....  crawler
                     fastusr   16650 f....  crawlerfs
                     fastusr   16652 f....  uberslave
                     fastusr   16654 f....  uberslave
                     fastusr   16655 f....  uberslave
                     fastusr   17083 f....  uberslave
                     fastusr   17140 f....  uberslave
                     fastusr   17142 f....  uberslave
                     fastusr   17200 f....  postprocess
                     fastusr   17202 f....  uberslave
                     fastusr   21036 f.c..  mysqld
                     fastusr   21421 f.c..  mysqld
                     fastusr   21422 f.c..  mysqld
                     fastusr   21423 f.c..  mysqld
                     fastusr   21424 f.c..  mysqld

Than I had a look into dmesg for file system errors and got the following output:

Code:

EXT3-fs error (device sd(8,1)): ext3_readdir: bad entry in directory #2670799: rec_len % 4 != 0 - offset=0, inode=926102069, rec_len=13874, name_len=10
EXT3-fs warning (device sd(8,1)): empty_dir: bad directory (dir #2670799) - no `.' or `..'
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 5342380
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 6750760
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 4654459
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 4481402
EXT3-fs error (device sd(8,1)): ext3_readdir: bad entry in directory #2588774: rec_len % 4 != 0 - offset=0, inode=1702129263, rec_len=29806, name_len=45
EXT3-fs warning (device sd(8,1)): empty_dir: bad directory (dir #2588774) - no `.' or `..'
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 5178169
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 4358984
EXT3-fs error (device sd(8,1)): ext3_free_blocks: bit already cleared for block 2753246

(this is just a bit of the errors I got)

No errors in /var/log/messages file

Please give me some hints what to do to reduce the iowait.
Normally the crawling machines should be able to crawl and process a few thousand documents per minute at the moment we are runnin on 9 documents per second, far away from beeing speedy.

Regards for every answer