LinuxQuestions.org - one 16K random read I/O issues 2 scsi I/O (16K and 4K) requests

- Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)

- - one 16K random read I/O issues 2 scsi I/O (16K and 4K) requests (https://www.linuxquestions.org/questions/linux-kernel-70/one-16k-random-read-i-o-issues-2-scsi-i-o-16k-and-4k-requests-4175439572/)

one 16K random read I/O issues 2 scsi I/O (16K and 4K) requests

I noticed weird issue when benchmarking random read I/O for files in
linux.
(Linux 2.6.18-274, files on ext3 FS).
The Benchmarking program is my own program and it simply keeps reading
16KB of a file from a random offset.

I traced I/O behavior at system call level and scsi level with systemtap and
I noticed that one 16KB pread issues 2 scsi I/Os as following.

=============================================
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 128137183232
SCSI random(8472) 0 1 0 0 start-sector: 226321183 size: 4096 bufflen
4096 FROM_DEVICE 1354354008068009
SCSI random(8472) 0 1 0 0 start-sector: 226323431 size: 16384 bufflen
16384 FROM_DEVICE 1354354008075927
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 21807710208
SCSI random(8472) 0 1 0 0 start-sector: 1889888935 size: 4096 bufflen
4096 FROM_DEVICE 1354354008085128
SCSI random(8472) 0 1 0 0 start-sector: 1889891823 size: 16384 bufflen
16384 FROM_DEVICE 1354354008097161
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 139365318656
SCSI random(8472) 0 1 0 0 start-sector: 254092663 size: 4096 bufflen
4096 FROM_DEVICE 1354354008100633
SCSI random(8472) 0 1 0 0 start-sector: 254094879 size: 16384 bufflen
16384 FROM_DEVICE 1354354008111723
SYSPREAD random(8472) 3, 0x16fc5200, 16384, 60304424960
SCSI random(8472) 0 1 0 0 start-sector: 58119807 size: 4096 bufflen
4096 FROM_DEVICE 1354354008120469
SCSI random(8472) 0 1 0 0 start-sector: 58125415 size: 16384 bufflen
16384 FROM_DEVICE 1354354008126343
=============================================

As shown above, one 16KB pread issues 2 scsi I/Os. (I traced scsi io
dispatching with probe scsi.iodispatching)

One scsi I/O is 16KB I/O as requested from the application and it's OK.
The thing is the other 4KB I/O which I don't know why linux issues that I/O.

Of course, I/O performance is degraded by the weired 4KB I/O and I am
having trouble.
I also use fio (famous I/O benchmark tool) and noticed the same issue,
so it's not from the application.
Does anybody know what is going on ?
Any comments or advices are appreciated.

Thanks

Do you have timing of those accesses? Just a wild guess, but may the other read be a read from the filesystem structure to find where to find your random chunk?

Thank you for the comment.

>Do you have timing of those accesses? Just a wild guess, but may the other read be a read from the filesystem structure to find where to find your random chunk?

At the application level, no.
This issue happens with even "cat" program.
"cat" a small file (less than 4K), comes with 4K I/O and the other 4K I/O which I don't know what it is.

I figured out what is going on, but I don't know what it is for.

Ext3 filesystem has some 4KB data in each 4096KB(8192 sectors) data.
Visually, data is aligned like the following.

|4KB|4096KB|4KB|4096KB|4KB|4096KB| ...

And 4096KB area in only accessible by application programs.
When accessing the first 4096KB area for the first time,
then OS reads the 4KB just before the 4096KB area first
and then read the requested data in the 4096KB area.

When accessing a large file (compared to the DRAM size) randomly,
every I/O has rare chance of hitting page cahce,
so every I/O request comes together with 4KB I/O.

The thing is what the 4KB data is for ?
Is this location metadata for filesystem ?
Is there any way I can remove this ?
Or Is there any way I can clear the 4096KB area only ?

Any comments and advices are appreciated.

(I tested in many machines with many kernel versions. this happens in
all machines.)

Thanks.

I figured it out. It's from ext3 indirect block mapping. (Ext3 has a block which has block pointers in every 1024 blocks.)
I changed the filesystem to ext4 makes the issue disappear. (Ext4 has more efficient scheme for block addressing.)

Thank you all.

I would hardly call this a bona fide "issue."

Over time, the various caches will do more-or-less good. What you should strive to do is to arrange the file access request pattern to improve the chances of the next piece of data already being cached somewhere. For example, sorting the locations in ascending order and issuing requests that way.

It is a "issue" for large file, for example TB of data.
It is a really bad design for those large file, that is why ext4 extent is introduced.

Of course, we should care about locality, but it is a another thing and it has nothing to do with filesystem's bad design.
And also, we can't always sort data for accesses.
For example in database, we can't always have clustered index for all attributes, we have to have some secondary indexes for some attributes,
and accesses with secondary index are ways random accesses.

An interesting idea, although here's more about why my thoughts are still, "no, it really doesn't." (And we can probably just let it rest at that.)

No matter how big the file is, you're going to hit one-or-more index blocks followed by access to the data itself. Admittedly, file systems are geared toward "millions of small files" and "enormous single files" are certainly uncommon. But you can still reach anywhere in the file by making one or two index-block reads (and this only to the extent that they're not cached) followed by the data that you want.

I suggest that, with "large files," the root problem might well be that you're making a lot of accesses to it. Indeed, if those accesses are pure-random, you might be hitting several index-blocks per access instead of one. But once again what you'd really like to find a way to do is to access the data in somewhat of a less-than-random fashion. Make those buffer caches work for you.

Yes, ext4 does make a nod to the "gigantic file" case, and certainly one reason why they did this was to accommodate humongous (e.g. MySQL) databases that live in the filesystem.