writing to disk without flooding kernel buffers

Skaperen · 11-02-2020, 08:14 PM

when i write a big bunch of blocks to a disk partition (a few to many GB), it ends up flooding kernel buffers, usually causing lots of dirty pages to be swapped out, contributing to seek thrashing on the disk that has swap. so i have been looking at ways to deal with that. i have tried using O_DIRECT and that seems to work. but performance seems to be reduced. in some cases it is a drastic reduction. i suspect this is because after a write is done on the physical device, the next physical write won't happen until a complete trip through the userland process and it doing another write, even if it acquired the next block of data concurrent with the previous write and has it ready to go.

i am wondering about how the kernel handles write completion. i assume that if it is writing out dirty buffers and one completes, it starts the next one about as fast as its interrupt handling. what i would like to know is if a process makes a write on one block via an O_DIRECT fd, if another process does a write on the very next block via its different O_DIRECT fd, will the kernel start the 2nd write similarly as fast (faster than waiting for a userland write call).

my thought is for my program to start 2 or more threads or fork 2 or more child process to have them interleave the write calls so the kernel always has one in the queue ready to go when the physical write completes. it woul use pwrite to interleave. the parent would be reading into a shared buffer and sending something (to be decided) to the writers to indicate what is ready.

berndbausch · 11-02-2020, 08:41 PM

From the open(2) man page:

Quote:

The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred.

Which is a bit vague. I understand that O_DIRECT makes writes more or less synchronous, but we don't know for sure.

Instead of a complicated application structure you may want to consider measures that limit the cache size or write blocks to disk more aggressively, for example following the suggestions in https://unix.stackexchange.com/quest...cache-in-linux.

syg00 · 11-03-2020, 05:01 AM

Sounds like a application design problem. If you are subjected to swapping, you (that would be you, not the system) are allocating too many dirty anonymous pages. Allocating (and touching) gigabytes of storage before writing any of it out sounds naive.
Having swap on the same disk as target for high I/O likewise. If it's unavoidable, redesign what you are doing. Another solution might be SSD (nvme better) with an appropriate scheduler - which should be set up automatically. That will allow parallel concurrent I/O streams on a single device. Much nicer.

pan64 · 11-03-2020, 06:03 AM

I think that is how it works. The kernel tries to speed up things by using cache. That works for small files, but obviously will not work if you want to write huge amount of data (more than the available RAM).
Caching will only virtually speed up the i/o, because actually it will be executed later and slowly, but you need not wait for that.
The direct write to the disk is a solution (to avoid those dirty pages) and it is obviously much slower (because there is no cache and you need to wait for the real completion).

Skaperen · 11-03-2020, 01:10 PM

Quote:

Originally Posted by syg00

Sounds like a application design problem. If you are subjected to swapping, you (that would be you, not the system) are allocating too many dirty anonymous pages. Allocating (and touching) gigabytes of storage before writing any of it out sounds naive.
Having swap on the same disk as target for high I/O likewise. If it's unavoidable, redesign what you are doing. Another solution might be SSD (nvme better) with an appropriate scheduler - which should be set up automatically. That will allow parallel concurrent I/O streams on a single device. Much nicer.

the system has one disk because disks are huge, these days, and the system hardware has only 4 interfaces for disks with 2 wired internal (1 used) and 2 wired external (that i use). everything is on that disk and this system is used pretty much for everything. this app basically reads or generates data to be written to disks i plug in externally to initialize them. sometimes it writes a portion of the device and sometimes it writes the whole disk. the app is quite simple and uses at most about 2MB of memory and writes power of two size buffers up 512K. the amounts typically input vary from ISO size and system size (8GB to 128GB) to 4TB (generated, typically data patterns). the rest of the system, usually 1000+ processes, is the user of the bulk of memory when i'm not running the app to set up a disk.

if i use regular writing, the system get trashed. most everything gets swapped out and takes a while to get resumed because it needs swapped back in.

writing with O_DIRECT solves that problem. so does O_SYNC, but i have had past problems with O_SYNC so i use O_DIRECT, now. but it takes about 5% to 40% longer to write a disk. i suspect that the delay to start the next write is letting the disk rotate beyond the point it can start writing the next physical sectors and it has to wait another rotation. this is what i want to speed up, but am unsure about the kernel actions for it. i'm hoping that if another process/thread does a pwrite call for the next chunk of data, the kernel/driver will do that one as soon as the active one is done. i just need to have these workers get their data at least that fast (shared buffers instead of piping it when the parent gets it).

Skaperen · 11-03-2020, 01:22 PM

Quote:

Originally Posted by pan64

The direct write to the disk is a solution (to avoid those dirty pages) and it is obviously much slower (because there is no cache and you need to wait for the real completion).

my worry is that userland processes can't do their own caching as effectively if they can't do writes fast enough to keep a disk rotation as busy as the kernel can when its own cache is happening. my idea is to parallel writes enough to achieve that. O_NONBLOCK is ineffective for disks, so my design it to use blocking writes from 2 or more child processes (or maybe threads) with pwrite to interleave the chunks.

pan64 · 11-04-2020, 01:03 AM

Quote:

Originally Posted by Skaperen

my worry is that userland processes can't do their own caching as effectively if they can't do writes fast enough to keep a disk rotation as busy as the kernel can when its own cache is happening.

You would need to demonstrate it somehow, but I think this statement is not valid. kernel will keep the disk "online" but writing gigabytes to a spinning disk will take a lot of time.
The average speed is at about 100 MB/s https://forums.tomsguide.com/threads...-speed.384289/, so writing 6 GB will take a full minute (if nothing else wanted to access this device in the meantime).

Quote:

Originally Posted by Skaperen

my idea is to parallel writes enough to achieve that. O_NONBLOCK is ineffective for disks, so my design it to use blocking writes from 2 or more child processes (or maybe threads) with pwrite to interleave the chunks.

I don't think it can speed up things, but you will try it and inform us about the result.

You forgot the disk has its own cache too and even the kernel has no any idea what's going on inside.... There is no way to have direct access to the real disk (I mean not the device, but the magnetic disk). The communication to the device is made by the kernel only (no userland processes involved) and actually there can be a lot of processes which continuously do i/o, so a process level cache is just pointless.
The [software in] hdd itself tries to optimize head movements, uses virtual head/track/sector mappings and do a lot of other tricks.

Believe me, what you want is already invented, implemented, built into the kernel (not to speak about the fact: it is already tested by thousands of companies and millions of users). Optimizing the disk i/o is one of the most important parts of the kernel.
If you really found a much better way please implement it and contribute....

you can still try to fine tune your system (see post #2). Or redesign your app (post #3).

Skaperen · 11-04-2020, 02:16 PM

are you talking about the use of O_DIRECT or my investigation into using two processes to keep O_DIRECT as active as possible?