LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   open with delayed creation (https://www.linuxquestions.org/questions/programming-9/open-with-delayed-creation-4175675186/)

Skaperen 05-13-2020 09:30 PM

open with delayed creation
 
i would like to be able to open a file without actually creating it until the first write is performed. is there a way to do that without having a 2nd process doing that, yet get the file descriptor?

chrism01 05-14-2020 12:11 AM

Can you explain a bit why you are trying to do this?

If you are looking at 'hiding' it from other processes until it has some data in it, there are various solns depending on the rules/circumstance that apply.

NevemTeve 05-14-2020 02:42 PM

> is there a way to do that without having a 2nd process doing that

How could a second process help in your problem?

dugan 05-15-2020 09:55 AM

Quote:

Originally Posted by Skaperen (Post 6122689)
i would like to be able to open a file without actually creating it until the first write is performed. is there a way to do that without having a 2nd process doing that, yet get the file descriptor?

No, you can't have a file descriptor without a corresponding file (using the broad *nix definition of "file").

And note that you typically don't keep open-for-writing file handles open anyway. Typically, you'd build the file contents in memory, open the file for writing, write immediately, close immediately.

Skaperen 05-15-2020 06:50 PM

Quote:

Originally Posted by NevemTeve (Post 6123026)
> is there a way to do that without having a 2nd process doing that

How could a second process help in your problem?

a 2nd process could have a pipe connected to it and delay the open() until there is data to write. BTDT.

Skaperen 05-15-2020 06:52 PM

Quote:

Originally Posted by chrism01 (Post 6122710)
Can you explain a bit why you are trying to do this?

If you are looking at 'hiding' it from other processes until it has some data in it, there are various solns depending on the rules/circumstance that apply.

so there is never a time when the file exists but is empty.

scasey 05-15-2020 06:59 PM

Quote:

Originally Posted by dugan (Post 6123521)
And note that you typically don't keep open-for-writing file handles open anyway. Typically, you'd build the file contents in memory, open the file for writing, write immediately, close immediately.

Yes. Simply don’t open the file until you’re ready to write to it...

Skaperen 05-15-2020 07:02 PM

Quote:

Originally Posted by dugan (Post 6123521)
No, you can't have a file descriptor without a corresponding file (using the broad *nix definition of "file").

And note that you typically don't keep open-for-writing file handles open anyway. Typically, you'd build the file contents in memory, open the file for writing, write immediately, close immediately.

i have never written any code in C that builds the whole file content in memory before writing it. but i have had to do that a few times in Python. my C programs always did a appropriate write when the data to be written was available. i often has cases where data was buffered and no more that one buffer was written (small file) so in those cases, sure, it did end up collection the whole file. but it could have had large data and written many buffers.

GazL 05-16-2020 04:12 AM

One common approach is to first create the file with a temporary name, then rename it once it is "ready". You often see this with file downloaders, where they'll add a .partial suffix until the file is complete.

Failing that, you can look at the O_TMPFILE flag of open(2).

dugan 05-16-2020 11:00 AM

Quote:

Originally Posted by Skaperen (Post 6123676)
i have never written any code in C that builds the whole file content in memory before writing it. but i have had to do that a few times in Python. my C programs always did a appropriate write when the data to be written was available. i often has cases where data was buffered and no more that one buffer was written (small file) so in those cases, sure, it did end up collection the whole file. but it could have had large data and written many buffers.

Well I'm sorry to hear that you haven't written C since the 80s, but...

Is your current target platform one where memory constraints would dictate an approach like this?

Would your current target platform perform better with a few large buffers or many small ones? Hint: does the target platform have a CPU cache?

Also keep in mind that on some platforms, sporadically writing many small files would create more disk fragmentation than writing a single large file in one operation.

Finally, if you're keeping the file open and locked throughout the lifetime of the application, well, that's not the way you're supposed to do it on *nix. Locking is supposed to be done only when necessary.

EdGr 05-16-2020 11:27 AM

I have written code that creates a temporary file, writes to it, closes it, and renames the temporary file over an existing file. The key feature is that the update appears atomic. I am guessing that Skaperen really wants atomic update.
Ed

Skaperen 05-16-2020 02:48 PM

Quote:

Originally Posted by scasey (Post 6123675)
Yes. Simply don’t open the file until you’re ready to write to it...

which means i will need to modify the library i will be calling to do the processing and output. i don't have source. it's not free.

NevemTeve 05-17-2020 03:49 AM

Could you please explain X in this XY-problem?

syg00 05-17-2020 04:48 AM

I though that was my line ... :p

MadeInGermany 05-17-2020 06:11 AM

In the application you can have a framework that opens the file at the first writing. For example awk:
Code:

awk -F: '$7~/\/zsh/ { print $1 > "outfile" }' /etc/passwd
This writes all users with a zsh login shell to outfile.
If no user got a zsh then outfile is not created at all.

Skaperen 05-17-2020 03:40 PM

Quote:

Originally Posted by dugan (Post 6123920)
Well I'm sorry to hear that you haven't written C since the 80s, but...

i wrote one last month.

Quote:

Originally Posted by dugan (Post 6123920)
Is your current target platform one where memory constraints would dictate an approach like this?

operability constraints would. if you had to collect a day-long stream of messages from a network connection, would you collect it in memory and write it all at the end of the day? this may be an exception. what is to be gained by your approach?

Quote:

Originally Posted by dugan (Post 6123920)
Would your current target platform perform better with a few large buffers or many small ones? Hint: does the target platform have a CPU cache?

today's operating systems do equally well in both situations. caching takes care of this.

Quote:

Originally Posted by dugan (Post 6123920)
Also keep in mind that on some platforms, sporadically writing many small files would create more disk fragmentation than writing a single large file in one operation.

how did this become a many small files situation? i didn't bring up this topic. if you want to discuss it, start a new thread and PM me the URL.

Quote:

Originally Posted by dugan (Post 6123920)
Finally, if you're keeping the file open and locked throughout the lifetime of the application, well, that's not the way you're supposed to do it on *nix. Locking is supposed to be done only when necessary.

how did this become a file lock situation? there is no need to lock a file to write it sequentially. there are often many alternatives, depending on the application that typically gets implemented with locking.

Skaperen 05-17-2020 03:52 PM

Quote:

Originally Posted by MadeInGermany (Post 6124209)
In the application you can have a framework that opens the file at the first writing. For example awk:
Code:

awk -F: '$7~/\/zsh/ { print $1 > "outfile" }' /etc/passwd
This writes all users with a zsh login shell to outfile.
If no user got a zsh then outfile is not created at all.

awk is designed to do it this way. C isn't. Even Python isn't. when the tool lets you output to a not-yet-opened reference, such as a string with the name of the target file, the issue is solved. i have done many projects in awk, but many others need way beyond its capability. but, at least awk is not buffering the whole contents in memory before actually asking the system to write some of it.

rnturn 05-17-2020 03:56 PM

Quote:

Originally Posted by Skaperen (Post 6123676)
i have never written any code in C that builds the whole file content in memory before writing it. but i have had to do that a few times in Python. my C programs always did a appropriate write when the data to be written was available.

I remember the days when I had to demonstrate to users that it was actually not faster to process data by reading in an entire file into memory and then write it out all at once after processing the data. Especially on a multi-user system where, because of quotas, you simply will not typically have permission to have access to all available memory. Displaying wall clock time and the post-execution resource utilization statistics accumulated during program execution was the eye-opener for many (this was, I think, easier to do then than it is today on Linux). Sucking an entire dataset into memory cost big time in paging activity and increased run time immensely. And with many users' taking the advice to not allocate swap space, a large dataset will likely have you reaching for the Big Red Switch. (There are times when I think that anyone writing software should be forced to write code on a small memory system for a while---it forces you to think about the problem at hand a bit and work within a finite set of resources.)

Just my $0.02.

Later...

dogpatch 06-01-2020 10:47 AM

You seem to be saying you never want an empty file. If you want to make sure the file has at least some data in it, then follow the advice above: create the file by opening it in 'w'rite mode when you have some data. Immediately after opening / creating the file, you write the data, and close the file. Subsequent writes, open in 'a'ppend mode, write the data, close the file.

If you want to make sure the file is complete before you create it, then do the above, but with a temporary file name. When it is complete, rename the file to its permanent name.

Skaperen 06-01-2020 02:48 PM

i want to replicate a behavior i have seen in a different OS which is equivalent to the file won't exist if the system is shutdown between open and the first write AND if another process attempts to open the same file name for exclusive creation after that first open by the first process before its first write, then it would fail. that kernel logic seems to be that it locks the name when the first successful open is done and actually creates the file in the file system when the first write is done.

dogpatch 06-01-2020 04:03 PM

In that case, create and lock the file. Then create a temporary filename, and right after writing data to it, unlock the permanent file and rename the temp file to its permanent name.

SoftSprocket 06-01-2020 04:03 PM

If you call open, immediately followed by unlink the file is deleted when it is closed i.e. close is called or the program exits.

You might be able to use that behavior to achieve what you're after.

Skaperen 06-01-2020 06:19 PM

@dogpatch you mean have 2 files, one to lock the name (created first), and the other to write to and eventually replace the first? hmmm, that might work.

Skaperen 06-01-2020 06:26 PM

@SoftSprocket the unlink makes that name go away. while it's gone, some other process might open a file at that name. and something needs to be done to bring the file name back and the only ways to do that either create an all new empty file or need to reference an existing name.

SoftSprocket 06-02-2020 08:40 AM

Ah, quite so - I should have tested first. As my penance I scribbled something that does work:

Code:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/stat.h>

char* reserve_fn;
void delete_on_close () {
        printf ("atexit %s\n", reserve_fn);
        if (reserve_fn != NULL) {
                if (unlink (reserve_fn) > 0) {
                        perror ("unlink");
                }
        }
}

int main () {
        reserve_fn = "reserved.txt";

        int fd = creat (reserve_fn, O_RDWR);

        if (fd < 0) {
                perror ("open");
                exit (EXIT_FAILURE);
        }
        fchmod (fd, S_IRUSR | S_IWUSR);
       
        signal(SIGINT, delete_on_close);
        atexit (delete_on_close);

        printf ("sleeping\n");
        sleep (10);

        printf ("Using fd\n");

        reserve_fn = NULL;       

        write (fd, &fd, sizeof fd);

        close (fd);

        return 0;
}

Exit on interupt or when exit is called and the file is removed - unless the reserve_fn is NULL. Not exactly elegant but it does work.

Skaperen 06-02-2020 02:54 PM

what's the point of delete_on_vlose() when the final desire is to have the fire? the whole point is to have the file never be empty. you need a way to block the name from being used while it doesn't exist.

SoftSprocket 06-02-2020 02:59 PM

Your original ask was to prevent the name from being used but delete the file if it isn't used. That is what the example will do. While the program is in the sleep you can interrupt the program and the file will be deleted. However if you don't the file will be preserved. Presuming real work going on before the write, an exit (say after a failed system call) will also remove the file. Did you try the program? If you don't interrupt during the sleep the file will be there.

Skaperen 06-04-2020 01:49 AM

what the kernel could do for this kind of open is have an internal set of which object names are open with delayed creation. this can be included in the test for "already exists" so another process cannot open it. but nothing is saved on disk, yet. then when the first write happens, the creation is completed. things like permission need to be tested at open time. the write could fail due to out of space. the file may be created empty if its parent directory does not need another block for it. this might need to be suppressed or reverted if it really is intended to avoid an empty file.

decades ago i worked on IBM mainframes. empty files were "impossible". i never was concerned about why or how enough to investigate (i did have the source code).

SoftSprocket 06-04-2020 07:51 AM

You might be able to use FUSE to do something like that. https://www.kernel.org/doc/html/late...tems/fuse.html. In addition to being used to write a filesystem in user space I thnk it can also be used as a filter of some sort.

I got my start on IBM mainframes. My main memory was the frustration of waiting while what I typed traveled down tie lines and back. Everything went through a switch that managed priorities and I as the low rung. Mercifully they put me on PCs where, even with 9 inch floppies and no hard drives, the experience was far more satisfying.

Geist 06-05-2020 05:04 PM

Even if you don't have access to this library, you have access to your own code.
So if the library wants to write a file immediately, then simply don't call the library functions until you want to.
That should work, right?

Keep your own copy of whatever you want to ultimately write in memory and only call the library function when you want to.

Skaperen 06-05-2020 06:51 PM

Quote:

Originally Posted by Geist (Post 6131278)
Even if you don't have access to this library, you have access to your own code.
So if the library wants to write a file immediately, then simply don't call the library functions until you want to.
That should work, right?

Keep your own copy of whatever you want to ultimately write in memory and only call the library function when you want to.

when my code does, eventually, call that library code, there will still be a period of time between when it calls open() and when it calls the first write().

i'm thinking to create an empty file distinct to the target file, though not the same exact name as the target file, and lock on that file to avoid concurrently doing this part in my code. that will at least be a start. locks will need to be added to other code, too.

Geist 06-06-2020 12:23 AM

Quote:

Originally Posted by Skaperen (Post 6131301)
when my code does, eventually, call that library code, there will still be a period of time between when it calls open() and when it calls the first write().

i'm thinking to create an empty file distinct to the target file, though not the same exact name as the target file, and lock on that file to avoid concurrently doing this part in my code. that will at least be a start. locks will need to be added to other code, too.

I don't see how you will be able to do anything then.
This makes it seem like you do one call to the library, which then doesn't return until it writes the file, which entails opening it first for some time while it does some operation that takes some time.

I can't think of a reliable, non crazy way to use a second thread to somehow balance the time between the library call starts (and creates the file) and finishes its prepwork to write to it.

There's so many variables.
Why does it even matter?

Security?
Well then encrypt the data first before you send it to the library and then decrypt it in a post processing step.

Skaperen 06-06-2020 12:59 PM

yeah, there's no hard definitive solution under the I/O model Posix/Unix/Linux uses. there's only designing around the problem to be sure no issues can happen in the first place. and that might mean having some way to lock against double uses of the file before some like first write or file completion.

there is a related problem. it's related because these programs will often take all day to complete the file due to the nature of what they are doing. consider opening a file for appending. now consider 2 processes opening the very same file for appending.

Skaperen 06-12-2020 09:43 PM

if the kernel (of any POSIX OS) were changed to work as i described, what would break? any applications? any libraries? is it something that would be harder to implement had the kernel done it this way since its first creation?

EdGr 06-12-2020 11:13 PM

Zero-length files are valid and useful. They indicate the absence of something. The "test -s" command checks for non-zero-length files. I have used it many times.

I am still not sure why you want to avoid zero-length files.
Ed

Skaperen 06-14-2020 01:55 PM

zero length files have value where intended. i am trying to avoid them where they are not intended.

EdGr 06-14-2020 08:33 PM

We posted the solution: create a temporary file, write to it, and then rename the temporary file to the desired name.
Ed

chrism01 06-15-2020 01:20 AM

... and create it under /tmp or similar ie somewhere that gets cleaned out by the OS if it gets rebooted.

GazL 06-15-2020 05:59 AM

...or, as I said back in post #9, open() it with O_TMPFILE: though that is a Linux specific solution and is not portable.

Skaperen 06-15-2020 06:17 PM

Quote:

Originally Posted by EdGr (Post 6134415)
We posted the solution: create a temporary file, write to it, and then rename the temporary file to the desired name.
Ed

the contrary case is something that runs for hours and hours collecting some statistical data that needs to be reviewed periodically all day. and this data needs to not be lost if the system reboots, eliminating the use of /tmp.

Skaperen 06-15-2020 06:17 PM

Quote:

Originally Posted by GazL (Post 6134518)
...or, as I said back in post #9, open() it with O_TMPFILE: though that is a Linux specific solution and is not portable.

you sure did. and i overlooked it thinking it was something different. that does look like a doable approach. i'll have to make sure there is a lock to prevent duplicate attempts. but this is certainly a new, good, perspective on the problem.

GazL 06-16-2020 06:35 AM

Well, it's an option, but after reading your later posts (#20 specifically), IMO you're approaching the problem from the wrong end. Catch SIGTERM/SIGINT and give it a handler that will remove an empty file before exiting; then you can simply rely on open(filename, O_CREAT|O_EXCL) semantics to provide serialisation.

Skaperen 06-19-2020 07:54 PM

i'm trying to work out a way where an empty file doesn't matter. i can't reveal the code at that level.


All times are GMT -5. The time now is 10:40 AM.