efficient access of huge files... or defrag ext4
I need to figure out how to arrange for the fastest-possible read-access of a large or huge memory-mapped file.
I'm writing high-speed real-time object-chasing software for a NASA telescope (on earth). This software must detect images of fast moving objects (across arbitrary fields of fixed stars), estimate what direction and speed the object image is traveling (based on the length and direction of a streak on the detection image), then chase after the object while capturing new 4Kx4K pixel images every 2~5 seconds, quickly matching its speed and trajectory, then continue to track and capture images until the object vanishes (below horizon, into earth shadow, etc).
I have created two star "catalogs". Both contain the same 1+ billion stars (and other objects), but one is a "master catalog" that contains all known information about each object (128 bytes per object == 143GB) while the other is a "nightly build" that only contains the information necessary to perform the real-time process (32 bytes per object == 36GB) with object positions precisely updated for precession and proper-motion each night. Almost always the information in the "nightly build" catalog will be sufficient for the high-speed (real-time) processes.
The following are some questions about the software I have developed so far:
#1: When I create these catalogs (the 143GB master catalog only rarely, but the 36GB catalog every night), my code has two paths (both are the same, except step B):
: A: create empty file.
: B: write 64MB block created by calloc() into file as many times as necessary to fill the file.
: C: memory map entire file to address 0x0000001000000000.
: D: assign memory map address to object structure array.
: E: process source catalog files to set values in [128-byte or] 32-byte object structure.
: F: write structure into object structure array at appropriate index/offset.
Step B takes a long, long time to execute... somewhere around half an hour (rough guess). To speed this process up, I replaced step B with the following steps:
: B1: move file-offset pointer to desired size of file in bytes minus 8.
: B2: write 8-byte integer variable (containing zero) to file.
This alternate scheme takes only a tiny fraction of a second to execute, and "df" in a console window before and after step B2 show the file has grown from 0-bytes to 36GB.
Given how fast steps B1 and B2 execute, obviously nothing has been written to disk, though obviously sectors have been reserved for the file (based upon the results printed by the "df" process).
My questions are the following:
___ does the first approach (writing 143GB or 36GB of zeros into the file) allocate disk sectors efficiently for pseudo-sequential access later (chasing objects across the sky)?
___ does the second approach (faking the 143GB or 36GB file into existence) allocate disk sectors efficiently for pseudo-sequential access later (chasing objects across the sky?
___ does some sort of defrag exist for ext4 (or ext3, or ext2) filesystems to force efficient ordering after these files are created and populated?
#2: When the real-time process is executing (chasing objects across the sky), the higher-level portions of the software expose an image at a fairly-accurately-known location in the sky, and call low-level "access routines" passing the approximately coordinates of the center of the field, plus the E/W "width" of the captured image/field and the N/S "height" of the captured image/field.
The entire [143GB or] 36GB catalog is mapped into memory at address 0x0000001000000000. Yes, this is obviously a 64-bit mode application! :-)
My low-level "access routines" return an array of (~63) indices and counts into the object structure array, the number of E/W "micro-fields" and the number of N/S "micro-fields". Given the nominal telescope and CCD, the array of indices will contain about 9 x 7 == 63 index (9 micro-fields in E/W and 7 micro-fields in N/S), each "micro-field" covering a 4 arc-minutes square on the sky.
The high-level routines can then efficiently access only those catalog stars that are near each star it detects on the captured image. It knows precisely the index into the catalog of the first object in each 4 arc-minute "micro-field", and how many objects are in that "micro-field" at sequential indices (ordered by increasing E/W coordinate).
By mapping the entire catalog into memory, both low-level and high-level routines need not worry about allocation and deallocation of object structures AND they need not worry about object structure address becoming obsolete (as would be the case if the data was loaded from disk after each exposure).
On average, each of the ~63 micro-fields will contain about 100 objects (at 32-bytes each), which works out to about 1/4 megabyte per captured image of sky. A given chase across the sky might involve the capture of a few images to about 100 maximum. In other words, only about 25MB of object information will be accessed per chased object. That is very modest, since this computer will contain 8GB ~ 16GB of DRAM.
My question is this:
___ does this approach risk involving any (OS or application) situation which might force or cause the OS to perform any lengthy process/delay before it can memory map an accessed region of the catalog?
Note that the "nightly build" catalog is ONLY read and never written during our real time processes. When the "nightly build" catalog is generated, it obviously is and must be created/opened and memory-mapped with write access because all the object data is written into the file at that time.
However, the "nightly build" catalog file is unmapped and closed after it is created. Sometime later (after dark) the "nightly build" catalog file is opened for read-only ("rbS") and mapped for read-only. So if the OS is smart enough (and typically linux is quite smart), there is never any reason to flush/write any of the "nightly build" catalog to disk. At least, that's how I see it.
I guess those are my two questions (for now). Sorry for giving so much detail... possibly too much detail... but I'm not sure what might matter in this situation.
PS: Writing object information into the huge 143GB file takes a long time (about 1 hour), and I print out the object information every 250,000 objects or so, so I can "watch it running". I noticed that a few times during this lengthy process the printout stopped for about 10 or 20 seconds, implying (to me anyway) that the OS was performing some kind of massive flush of catalog data to disk before it let my application continue. Admittedly this is while writing this huge catalog, which must be flushed out eventually, but the rather surprising length of the delay does prompt me to enquire about possible situations that might arise when reading memory-mapped data over long periods of time (all night).
PS: In case it matters, the motherboard will be a quad-core AMD CPU with 64-bit ubuntu 10.04 OS, and all code is written in C/C++.
Thanks in advance for taking time to provide comments, ideas and tips.
I should note that you can use posix_fallocate is a simpler way to achieve this preallocation.
Neither approach guarantees allocating disk sectors efficiently, but if you aren't allocating multiple files at the same time, and keep a reasonable amount of empty space on the disk, then you are very likely to get a contiguous file. Linux will normally reserve 10%, and the free space should be considerably more than your largest file.
Now you talked about 'pseudo-sequential' access of the catalog files; I'm not sure how you arrange this (since the view is two-dimensional). I imagine another arrangement of grid squares would follow some sort of Hilbert curve rather than a raster pattern. Still, at worst you are only bound by the disk seek times, which are of the order of milliseconds, probably bearable.
First, because the catalog is read only, the pages will never become dirty (as you point out). So you should never have a backlog of pages that need to be written back - perhaps worth specifying PROT_READ on the mmap to make certain of this, and don't mlock the pages!
Applying read-ahead (using the MAP_POPULATE flag) might increase your throughput (since there is some chance of sequential access), but I would avoid it, because it also makes the worst case take longer. Also, it would seem to me equally likely that you access the file in reverse sequential order...
Also, because you are using memory mapping, you need to be careful about the use of virtual memory by other parts of the application (or other applications). The risk is that if the real memory is exhausted by other data structures, then pages will have to be written to the virtual memory cache or flushed to files just to make room for new pages being read from the catalog file. How long these writes take will be somewhat unpredictable.
Given that you have a fair amount of RAM in the machine, you should not encounter this situation. But you will need make certain that (a) there are no memory leaks in any applications being run, and (b) that any data being written to other files is frequently flushed to disk (for example the images being collected).
So in general, if you want reliability, the tracking phase would be quite deterministic; you would know at all points in the tracking loop roughly how much memory it is going to use, even if this means flushing any collected data every time you go through the loop. This will reduce the throughput, but is necessary if you want control over response times.
You might also consider using real-time patches on the kernel, so that there are no other surprises (eg a lengthy network interrupt etc). I understand that the Ubuntu Studio version has these patches.
It sounds from your description that this is not a problem on the write, because this is a setup phase, and is not real-time. If it was, judicious use of msync at regular intervals would ensure that there wasn't a build up of dirty pages (ie, you could control the intervals between writes). However, having regular calls to msync would make the overall process take longer.
If overall throughput turned out to be a limiting factor (rather than real-time response), then there are other ways of improving this, such as RAID arrays, or running a separate disk for the catalog data (to reduce seek times).
Taking the defaults (including extents) should work fine with ext4 - here's a citation from the wiki
The pause probably is caused by write flushes, but this is periodic - it doesn't depend on page cache size or occupancy. Depending on kernel level you'll be using pdflush or bdi-flush. Hopefully the latter as it is more dynamic and adds threads as required by load - seems you might be on a pdflush kernel. Shouldn't really be an issue, but there are sysctls that force earlier (physical) write-outs. Not sure that would be a good idea - just let it write the data.
I can't see it being an issue on the smaller file.
I think that file access speed problem should be solved through RAID.
I have my doubts.
This delay is incorporated in the I/O schedulers and the (kernel) flush threads. Software RAID emulation is simply an additional device driver layer above, and in addition, to this. Hardware RAID is below.
Neither has (direct) visibility or influence.
No harm in trying, and I've been plenty wrong plenty often.
You're using the wrong filesystem, ext4 is not good with huge files. Use XFS or JFS. I recommend XFS because it DOES have a special defrag program. It's not like regular defrag, instead of putting everything at the beginning of the disk (like regular defrag, and it is a bad idea, because it will cause even more fragmentation), is simply rearranges files sparsely but so that they are not fragmented.
Summary: Use XFS NOT ext4.
Thanks for lots of good comments and suggestions.
My original post talked about two different situations with different requirements. My "master catalog" is only built rarely (when new source catalogs are added to the collection), so speed is not important. Well, it was quite painful to wait for 2 or 3 hours during development of the routines only to have the consistency check routines find bugs! Then I had to find and fix the bad code, and re-run the whole 2 ~ 3 hour process all over again. So speed in that part is helpful but not necessary.
The same is more-or-less true for the code that converts the "master catalog" into a "nightly build" catalog before the sky gets dark. This process consumes 1 ~ 2 hours, but that's okay unless the observer fails to start the system long enough before dark. But probably the system will turn itself on every afternoon, or be left on all the time, so that problem shouldn't occur.
Where performance really matters is during the night when chasing targets across the sky. Given the way the software is written, with the entire 36GB (or 143GB) catalog mapped into application memory space by mmap(), my initial tests imply the realtime performance is quite okay. Once my code has generated indices to the first 2 or 3 fields during a chase, it can infer where in the sky (and thus catalog) the subsequent requests will be --- and pre-access them seconds ahead of when they're required by the higher-level routines that call my library.
Thus the only serious problem that might arise is a massive flush of some kind after hours of chasing targets across the sky. Since the catalog data my routines will supply indices to for each captured image of sky is less than 1MB, and therefore less than 100MB for a worst-case chase across the sky, the system can chase dozens of targets before consuming available memory in the computer... assuming linux never recycled any of the memory pages accessed, even after one hour or more. After a few hours, however, the chases will likely exhaust available memory and linux will need to re-allocate some of the physical RAM memory accessed previously to newly accessed parts of the catalog. As long as linux doesn't try to flush all those memory pages, the delay should be a small fraction of a second. After all, the catalog is fopen("rbS") and mmap() as PROT_READ (read-only), so nothing needs to be written to disk. Linux simply needs to adjust its memory allocation structures to obsolete a bunch of memory pages to allow for new allocations.
I guess I'm almost at the point that I can run a simulation for 18 hours straight (to be safe) and see whether long delays occur in practice. The long delays I did notice were during writing of the entire 143GB or 36GB catalogs, not reading. Obviously writing huge memory-mapped files eventually does require a boatload of pages be written out (though I am a bit surprised Linux decides to spend 30 seconds continuously doing so before allowing a new write).
If my simulation shows up problems, I'll try some of the functions you guys mentioned to see whether I can make Linux perform its flushes sooner (while waiting to find a new moving object to chase). Hopefully we won't need RAID or a new filesystem. But if we do, then we do.
Thanks for all the ideas.
PS: When you say "ext4 is not good for huge files", I assume "ext3" and "ext2" are no better.
PS: The plan to put each catalog on a separate [primary] partition sounds wise to me.
PS: I will probably remove that "trick code" to allocate the entire 143GB and 36GB files by simply moving the file-offset pointer to the last byte and writing a byte. I am afraid that might also "trick" linux into allocating the sectors in some wacko order. With each catalog on its own [barely large enough] partition that starts out completely empty before the catalog is opened and written by fwrite() with all zeros, I'm betting Linux will allocate the sectors contiguously. To be sure, contiguous allocation is extremely important when reading parts of the catalog during the chase process.
PS: posix_fallocate() looks interesting. The documentation I've seen doesn't explicitly state the sectors are allocated immediately and contiguously... but that would certainly make sense.
PS: By "pseudo-sequential" access, I mean that the individual 4-arc-minute square fields are allocated sequentially in the catalogs. Once 5400 fields have been allocated, that completes one 4-arc-minute high strip across the sky from west to east (00h to 24h in RA (right-ascension)). The next 5400 fields are from 00h to 24h but 4-arc-minutes further north in the sky... and so forth.
So when providing indices that provide access to the objects in a real field (typically about 30-arc-minutes square on the current 4Kx4K CCD), all the fields in the 8 ~ 10 west-to-east strips are contiguous in the catalog, and hopefully on disk. However, each 4-arc-minute square field contains roughly 30 ~ 600 objects (77 objects in each field on average) at 128-bytes or 32-bytes for each object, which works out to about 13MB per west-to-east strip. Thus the separate west-to-east strips in each field will indeed by separated by a few cylinders. Since we'll have 8 ~ 10 strips in each captured image, we'll have about 8 ~ 10 short seeks to access that data. At 20ms each, that's 200ms == 0.200 seconds == tolerable (especially since I'm gonna try to "access-ahead" once I can estimate the direction and speed each target object is moving on the sky).
PS: When you say "ext4 is not good for huge files", I assume "ext3" and "ext2" are no better.
Yes, there is little difference, even performance-wise.
|All times are GMT -5. The time now is 09:26 PM.|