LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel
User Name
Password
Linux - Kernel This forum is for all discussion relating to the Linux kernel.

Notices


Reply
  Search this Thread
Old 11-02-2020, 08:14 PM   #1
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
writing to disk without flooding kernel buffers


when i write a big bunch of blocks to a disk partition (a few to many GB), it ends up flooding kernel buffers, usually causing lots of dirty pages to be swapped out, contributing to seek thrashing on the disk that has swap. so i have been looking at ways to deal with that. i have tried using O_DIRECT and that seems to work. but performance seems to be reduced. in some cases it is a drastic reduction. i suspect this is because after a write is done on the physical device, the next physical write won't happen until a complete trip through the userland process and it doing another write, even if it acquired the next block of data concurrent with the previous write and has it ready to go.

i am wondering about how the kernel handles write completion. i assume that if it is writing out dirty buffers and one completes, it starts the next one about as fast as its interrupt handling. what i would like to know is if a process makes a write on one block via an O_DIRECT fd, if another process does a write on the very next block via its different O_DIRECT fd, will the kernel start the 2nd write similarly as fast (faster than waiting for a userland write call).

my thought is for my program to start 2 or more threads or fork 2 or more child process to have them interleave the write calls so the kernel always has one in the queue ready to go when the physical write completes. it woul use pwrite to interleave. the parent would be reading into a shared buffer and sending something (to be decided) to the writers to indicate what is ready.
 
Old 11-02-2020, 08:41 PM   #2
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
From the open(2) man page:
Quote:
The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred.
Which is a bit vague. I understand that O_DIRECT makes writes more or less synchronous, but we don't know for sure.

Instead of a complicated application structure you may want to consider measures that limit the cache size or write blocks to disk more aggressively, for example following the suggestions in https://unix.stackexchange.com/quest...cache-in-linux.
 
Old 11-03-2020, 05:01 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,125

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Sounds like a application design problem. If you are subjected to swapping, you (that would be you, not the system) are allocating too many dirty anonymous pages. Allocating (and touching) gigabytes of storage before writing any of it out sounds naive.
Having swap on the same disk as target for high I/O likewise. If it's unavoidable, redesign what you are doing. Another solution might be SSD (nvme better) with an appropriate scheduler - which should be set up automatically. That will allow parallel concurrent I/O streams on a single device. Much nicer.
 
Old 11-03-2020, 06:03 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,830

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
I think that is how it works. The kernel tries to speed up things by using cache. That works for small files, but obviously will not work if you want to write huge amount of data (more than the available RAM).
Caching will only virtually speed up the i/o, because actually it will be executed later and slowly, but you need not wait for that.
The direct write to the disk is a solution (to avoid those dirty pages) and it is obviously much slower (because there is no cache and you need to wait for the real completion).
 
Old 11-03-2020, 01:10 PM   #5
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by syg00 View Post
Sounds like a application design problem. If you are subjected to swapping, you (that would be you, not the system) are allocating too many dirty anonymous pages. Allocating (and touching) gigabytes of storage before writing any of it out sounds naive.
Having swap on the same disk as target for high I/O likewise. If it's unavoidable, redesign what you are doing. Another solution might be SSD (nvme better) with an appropriate scheduler - which should be set up automatically. That will allow parallel concurrent I/O streams on a single device. Much nicer.
the system has one disk because disks are huge, these days, and the system hardware has only 4 interfaces for disks with 2 wired internal (1 used) and 2 wired external (that i use). everything is on that disk and this system is used pretty much for everything. this app basically reads or generates data to be written to disks i plug in externally to initialize them. sometimes it writes a portion of the device and sometimes it writes the whole disk. the app is quite simple and uses at most about 2MB of memory and writes power of two size buffers up 512K. the amounts typically input vary from ISO size and system size (8GB to 128GB) to 4TB (generated, typically data patterns). the rest of the system, usually 1000+ processes, is the user of the bulk of memory when i'm not running the app to set up a disk.

if i use regular writing, the system get trashed. most everything gets swapped out and takes a while to get resumed because it needs swapped back in.

writing with O_DIRECT solves that problem. so does O_SYNC, but i have had past problems with O_SYNC so i use O_DIRECT, now. but it takes about 5% to 40% longer to write a disk. i suspect that the delay to start the next write is letting the disk rotate beyond the point it can start writing the next physical sectors and it has to wait another rotation. this is what i want to speed up, but am unsure about the kernel actions for it. i'm hoping that if another process/thread does a pwrite call for the next chunk of data, the kernel/driver will do that one as soon as the active one is done. i just need to have these workers get their data at least that fast (shared buffers instead of piping it when the parent gets it).
 
Old 11-03-2020, 01:22 PM   #6
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by pan64 View Post
The direct write to the disk is a solution (to avoid those dirty pages) and it is obviously much slower (because there is no cache and you need to wait for the real completion).
my worry is that userland processes can't do their own caching as effectively if they can't do writes fast enough to keep a disk rotation as busy as the kernel can when its own cache is happening. my idea is to parallel writes enough to achieve that. O_NONBLOCK is ineffective for disks, so my design it to use blocking writes from 2 or more child processes (or maybe threads) with pwrite to interleave the chunks.
 
Old 11-04-2020, 01:03 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,830

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
Quote:
Originally Posted by Skaperen View Post
my worry is that userland processes can't do their own caching as effectively if they can't do writes fast enough to keep a disk rotation as busy as the kernel can when its own cache is happening.
You would need to demonstrate it somehow, but I think this statement is not valid. kernel will keep the disk "online" but writing gigabytes to a spinning disk will take a lot of time.
The average speed is at about 100 MB/s https://forums.tomsguide.com/threads...-speed.384289/, so writing 6 GB will take a full minute (if nothing else wanted to access this device in the meantime).

Quote:
Originally Posted by Skaperen View Post
my idea is to parallel writes enough to achieve that. O_NONBLOCK is ineffective for disks, so my design it to use blocking writes from 2 or more child processes (or maybe threads) with pwrite to interleave the chunks.
I don't think it can speed up things, but you will try it and inform us about the result.


You forgot the disk has its own cache too and even the kernel has no any idea what's going on inside.... There is no way to have direct access to the real disk (I mean not the device, but the magnetic disk). The communication to the device is made by the kernel only (no userland processes involved) and actually there can be a lot of processes which continuously do i/o, so a process level cache is just pointless.
The [software in] hdd itself tries to optimize head movements, uses virtual head/track/sector mappings and do a lot of other tricks.

Believe me, what you want is already invented, implemented, built into the kernel (not to speak about the fact: it is already tested by thousands of companies and millions of users). Optimizing the disk i/o is one of the most important parts of the kernel.
If you really found a much better way please implement it and contribute....

you can still try to fine tune your system (see post #2). Or redesign your app (post #3).
 
Old 11-04-2020, 02:16 PM   #8
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
are you talking about the use of O_DIRECT or my investigation into using two processes to keep O_DIRECT as active as possible?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Separation into 4K buffers to be written to the disk zishan.shaikh Linux - Kernel 1 10-22-2012 08:58 AM
[SOLVED] Kernel audit msg flooding after yum update yitpong Linux - Server 2 03-14-2011 09:07 AM
Writing to tty causes writing to disk? (in ancient 2.2.x kernel) borsburn Linux - Kernel 0 12-17-2008 12:47 PM
kernel: possible SYN flooding on port 2790. Sending cookies. zekmaster Linux - Security 10 08-26-2008 03:02 AM
sync and Disk Buffers LinuxGeek Linux - General 3 01-12-2005 10:18 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel

All times are GMT -5. The time now is 03:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration