dd if=file1 of=file2 bs=1k PUSHES AVG QUEUE LENGTH UP TO 80 !!!!!

Jung · 04-08-2009, 06:30 AM

Hi all,

I assume the following command would read 1k first and then write it to output and then repeat the cycle to the end.

# dd if=/dev/sda of=/dev/null bs=1k

Because 1k bytes are only 2 sectors on disk, there would only one request to the disk I guess.
Therefore, dd would issue a request to read 1k and then wait for the data and write it to /dev/null and then request a read again.
So there can be only one request in the request queue for the disk at any time, at least I thought.

However, what happened actually was as below.

While dd is using bs of 1k to read from /dev/sda and write to /dev/null,
# watch -n .1 cat /proc/diskstats
shows that there are as many as 80 requests in progress (that is the 3rd number from the end in each line)

202 0 sda 4597064 37943747 66911714 17175328 1248147 3642689 38644060 175422532 80 3681860 192598428

I am struggling to understand how there can be 80 requests when there should be only 1 at most.

Interestingly, I have this problem with disks with HP cciss driver but not with scsi driver. With SCSI driver, the queue length is 1 usually.
SCSI disks are used for both tests.

Can anyone please shed a light on this issue?

syg00 · 04-08-2009, 06:39 AM

Quote:

Originally Posted by Jung

Hi all,

I assume ...

Always a bad start.
Testing to validate (or not as the case may be) your assumptions is commendable.
Try changing the I/O scheduler to NOOP and run your tests again.

Jung · 04-08-2009, 07:13 AM

I can't test it with NOOP at the moment.

However, I strongly believe that with bs=1k, a single read request from dd should be the only IO request to the disk until the disk returns the requested data.

I have confirmed the theory by testing as below.

# dd if=/dev/sda of=/dev/null bs=1G
and then trace it
# strace -p 'pid of dd'

Because it takes some time to read 1G, strace shows that dd is stuck with read for long time and then quickly write to /dev/null and then spend a while again to read and so on.
(This proves that dd processes only one request at a time. But with a request of 1G-read from dd would make MANY smaller requests to the I/O scheduler.
That is why I used bs=1k to make the problem easier to understand)

I just don't know enough how there can be 80 requests in the request queue.

There is absolutely no other apps using the disk except my dd.
No disk swapping. CPU idle 100% before testing.

syg00 · 04-08-2009, 07:27 AM

Quote:

Originally Posted by Jung

I can't test it with NOOP at the moment.

I/O scheduler can be set per device - "on the fly"; doesn't require a (re-)boot. If you can find a quiet period for the environment you may be able to test.

Quote:

However, I strongly believe that with bs=1k, a single read request from dd should be the only IO request to the disk until the disk returns the requested data.

I have confirmed the theory by testing as below.

# dd if=/dev/sda of=/dev/null bs=1G
and then trace it
# strace -p 'pid of dd'

using a test of 1 Gig blksize to test ("confirm" ... ???) a theory re 1 K blocksize (or 1 sector) is not valid in any sense.

You might be interested in the data provided by blktrace - I found it quite instructive. But I wouldn't use it in anything but a test system.

GazL · 04-08-2009, 07:56 AM

If dd is a single thread process, then regardless of block-size, it's only going to issue one read() at a time. However, perhaps that request is being broken down in kernel space into smaller amounts, 4K pages sounds plausible, or maybe even individual disk blocks(sectors). Maybe the disk driver always reads a track or cylinder at a time. Does that 80 have any relation to your disk geometry values at all? sectors per track or anything like that?

This is all conjecture on my part, I've not studied Linux internals enough to do much more, just throwing it up there to give you some things to consider.

Jung · 04-08-2009, 08:56 AM

>> using a test of 1 Gig blksize to test ("confirm" ... ???) a theory re 1 K blocksize (or 1 sector) is not valid in any sense.

By issuing #dd bs=1G and stracing it, I was able to see that dd is single threaded and waits for completion of a read request of 1G before proceeding with a write of 1G.

So I was lead to believe that it would be the same even if bs=1k.
1k read, completed, 1k write, completed, and so on.

Now, for bs=1k, how many requests should I expect to see in the request queue? Just ONE.

But, as far as HP cciss driver is concerned, it goes beyond 1 and as high as 80~81 on my system.

80 doesn't mean anything special but just happens to be the highest number I have been seeing with the test.

Anyone please tell me how this can be explained?

(I have just tested the same thing on a Xen virtualised guest OS and the queue length is between 4~5 while #dd bs=1k is running)

GazL · 04-08-2009, 09:31 AM

Quote:

Field 9 -- # of I/Os currently in progress The only field that should go to zero. Incremented as requests are given to appropriate struct request_queue and decremented as they finish.

You're still assuming that One read() call == One request. I don't know whether that assumption is correct or not, but the simplest explanation to the results that you're seeing is that it isn't.

Jung · 04-08-2009, 11:20 PM

I didn't mean any one request should be one call to disk.
A request of 1G read would be converted to lots of smaller calls to disk, of course.

But when it is only a request of 1k read, I assume it would be only one call to disk (assuming that there is no other access to the same device).

I am still curious to know why HP cciss shows upto 80 io requests in progress while scsi driver shows only 1 io in progress.

GazL · 04-09-2009, 04:36 AM

Quote:

Originally Posted by Jung

But when it is only a request of 1k read, I assume it would be only one call to disk (assuming that there is no other access to the same device).

Yes, that does sound as if it ought to be the case doesn't it.

Perhaps even though dd is only fetching 1K with the read() the kernel is anticipating further reads and is pre-fetching subsequent sectors for efficiency purposes? But again, I'm only guessing here.

I get the feeling this is something that needs a fairly indepth understanding of the I/O specific parts of the kernel to explain. I'm quite curious about this too now, but I don't think we're going to find an answer here on LQ.

Anyway, any further guesswork on my part isn't really going to add any value from this point on so I'll bow out. Hope you didn't mind me sharing my thought processes on this.