Issue in Disk I/O ??

rajasun · 02-13-2008, 08:33 AM

Hi,

Problem Description:
We need to support 10 million files (each of size < 3kb) in a file system. We need have to have a server application to serve these files in pretty quick time (<300ms) based on requests from multiple socket based clients.

We were writing a small app to test the performance of ext3 file system. We had created 3 million files. A multi-threaded (75 THREADS) server reads these files randomly. In this case, the time taken to read the file is quite high. Of the order of 1800ms. While the server is running, the output of "iostat -tkx 5" is this:

Time: 06:31:02 PM
avg-cpu: %user %nice %sys %iowait %idle
0.05 0.00 0.00 49.95 50.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 33.81 45.42 45.82 363.34 648.47 181.67 324.24 11.09 181.70 1789.90 11.16 101.85
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

I have these questions in understanding the above stats. can you please help ??
1) Even though the server is just reading, we see the 'wkB/s' to be 324kB. Most of the times, it is more than 'rkB/s'. I have checked the swap space. No increase in the swap-usage.

2) The 'await' is of the order of 1400ms even though 'svctm' is only about 15ms. What does it indicate ?

3) The 'iowait' is very high. Some times as high as 90%. Is this normal ??

Configuration:
OS: RHEL ES rel 4.
Kernel: 2.6.9-42.ELsmp
Processor: Two - Dual Core Xeon 1.8Ghz
Disk: 500 GB SATA based (not sure)..

Regards,
Raja.

sundialsvcs · 02-13-2008, 08:45 AM

If there is "swap usage," and especially if you are seeing writes, then you can be fairly certain that a certain amount of the memory you are reading in is getting swapped out. Which is a type of load that you really don't want in this case.

It's also obvious to me that 75 threads is far, far too many. You should probably repeat this test with only about 8 or 10 threads. Basically, the "ruling constraint" on this system is going to be the number of simultaneous I/O requests you can handle -- which is really going to be determined by the number of drives you have, and the number of independent I/O-channels leading out to those drives. EIDE really can't cut the mustard like SCSI and FireWire can. Multiple controllers on different IRQs may be needed.

Scatter your "10 million files" into a nest of subdirectories, using increasingly-larger portions of the ID as the subdirectory names, so that for example item #123456 will be in "/mydata/1/12/1234/123456.txt." Then, using symlinks or whatnot, distribute those files among drives and make sure that your I/O subsystem can support truly-simultaneous disk access to all of them.

Now, your application... "one thread per request" won't cut it. Instead, grab the requests from the socket, put them into a queue, and perhaps at that point start thinking about the layout of your drives.

For example, when the input-thread puts another request into the pipe, another thread might wake-up to see if the thread can be satisfied from an in-memory cache; if so, it sends the request straight to an outbound queue, where the output-thread can pick it up and send it. Otherwise, the request gets dropped into one of several queues, each corresponding to a device.

A per-device thread takes each request, issues the file read (a synchronous read), gets the data, and puts it into the outbound queue.

The outbound cache-manager thread will see the request next, to decide if it wants to put a copy of the data into its cache. In any case, the outbound cache-manager forwards the request on to the output-thread.

That is the kind of "separation of concerns" architecture that will deliver the performance you need, with only a handful of threads, orchestrated around the nature of the ruling-constraints, which are "the hardware."