redirecting stdin, stdout and stderr, and finding files name and other stats
I would like to have a general discussion about C programming redirected io, file names, link handling and related.
If you have the time and interest to read this, and share the curiosity or have good answers to give, then please comment.
I'm working on an application used for backup/archiving. That can be archiving contents on block devices, tapes, as well as regular files. The application stores data in hard packed low redundancy heaps with multiple indexes pointing out uniquely stored, (shared), fractions in the heap. And the application supports taking and reverting to snapshot of total storage on several computers running different OS, as well as simply taking on archiving of single files. It uses hamming code diversity to defeat the disk rot, instead of using raid arrays which has proven to become pretty much useless when the arrays climb over some terabytes in size. It is intended to be a distributed CMS (content management system) for a diversity of platforms, with focus on secure storage/archiving.
In doing this heap ofjob, i stumbled into the topic of how to manage multiple files/devices/pipes/fifos with a common method.
Lets say i have a unix shell tool that acts like gzip, cat, dd etc in being able to pipe data between applications.
dd if=/dev/sda bs=1b | gzip -cq > my.sda.raw.gz
the tool can handle different files in a struct array, like:
I can then call the following function to get the important details of all the different files:
The problem is when handling redirected stdio with the same method. How do we get the name of a file that was "redirected into" standard input, (stdin)?
There simply does not seem to exist quick and universal method to apply for this.
One way to be able to specify redirection is to use the single dash argument.
Like: dd if=/dev/sda1 bs=1b | toolname - - > some.file
The first dash argument to the program "toolname" will indicate the input file to be piped in from stdin, and the second dash indicates the output file to be piped to via stdout.
The following code snippet will handle the dashes to represent stdin, stdout and also stderr, depending on the order they appear on the command line.
It depends upon the assumption that, for example, the file /proc/self/fd/0 is exactly the same as the process stdin.
When performing stat() on /proc/self/fd/0 for the following command line: toolname - < /dev/sde1
Also available in the file space is /dev/stdin, which will lstat() like:
Using stat() instead to follow the symlink will give us:
So this method seems to work pretty well. But it doesn't feel like a robust method, since it has lots of dependencies that need to be carefully checked for.
The questions i stand with:
If you have any input on this, or some questions, then please don't hesitate to post in this thread.
To add some offtopic to the thread - Here is a performance tip: When doing data shuffling on streams one should avoid just using some arbitrary record length, (like 512 bytes). Use stat() to get the recommended block size in stat.st_blksize and use copy buffers of that size to get optimal throughput in your programs.
very dangermouse I would say.
redirection has nothing to do with the OS.
it's something that the shell does.
If i was you I would bar redirection.
stat(3) IS_REGULAR_FILE or !IS_PIPE some such.
inodes cannot be trusted.
I was thinking maybe you should create a library similar to boost that can run in many systems that will do the complex task you need. On how to do it in other systems, I honestly still don't know any idea. Hey maybe you can find some in the source codes of boost? Take note of the license first btw. Maybe the idea is free to copy and just not the source.
Everybody may not know that if one happens to push random binary data to stdout, eventually something random comes back to stdin from the terminal and may do horrendous things like recapitulating interesting commands like .rm -rf /* or mkdosfs :) from the history file.
But i would rather have redirection easily available with < > | and such, than people doing more or less creative things to achieve redirection anyhow. Perhaps in a less reliable manner.
In VMS, piping commands are not supported by default. Still, i've used redirection and even command piping a lot when working with VMS. If there is a means, there is a way to do it.
Say one makes a check in the program so that binary dumps never go to a terminal. Still someone can do mkfifo f1 ; dd if=/dev/sdf2 | gzip -cq > f1 & sleep 20 ; cat f1 in a remote shell and then try to upload a disk dump through a remote login terminal session log file, not knowing what satanic rituals the random data may trigger in the tty. :)
But seriously - maybe this behaviour is just what we want?
But a backup system must be faithful to the directory hierarchy / file names, and cannot expect the inode numbers to remain the same over time.
In fact, the most essential thing about backup, is saving the actual content of the files. Never mind the directories, file names or ACL's.
With thrashed directories, but fully recovered file contents, one can often recover the directory structure enough from memory and looking at older backups and investigating contents of files.
But loosing a bunch of blocks in the beginning of a zip file - Then it does not matter that you still got the file name, size and directory hierarchy. One will need a CUDA cluster worthy SETI or NSA to be able to brute force recover the zipfile from remaining contents and crc. :)
Trusting the enumeration:
Another example of things not to trust, besides inode mapping, is how to keep track of your disk partitions over boots, and when disks are added/removed/repartitioned or when boot order changes in BIOS.
Say you have a tool that does incremental image backup of your disks. It then need to keep track of the enumeration of disks/partitions in your system. This may change! If you change settings in BIOS, upgrade BIOS, upgrade kernel or do some other changes to the system.
Since your SATA controller have two SATA3 ports, which happen to enumerate as /dev/sdc and /dev/sdd, you decide to have your system root in the partition /dev/sdc1, swap partition in /dev/sdd1, home partition mirrored over /dev/sd[ab]1 and a raid5 array on all four drives partition 2, and it works perfectly like that for some months.
Then one of the disks in the raid array starts to go bad and you buy another drive to recover. You hotswap add your new disk since you dare not reboot with a failed raid, and the new disk enumerates as /dev/sde and you happily add the new drive to your raid array and let the raid array resynch.
Finally, when you got the raid up and running perfectly, you reboot...
...Grub gives a sneaky comment that it can not find your root partition...
XYZZY - Nothing happens :eek:
It turns out your new drive now enumerates as /dev/sdc since it is on one of the SATA2 ports, that BIOS decided to enumerate before the two SATA3 ports. And your root partition has automagically moved to /dev/sdd.
Even worse if your raid is not set up with superblocks. Then the files on the raid disk may, if you are the slightest bit unlucky, soon become as organised as a bowl of cornflakes.
Of course, using superblocks in raid arrays, mounting everything with UUID instead of device name, put an MBR on all disks, and ordering grub to use boot from partition with the matching UUID of your root partition will counteract this problem.
Caveats of using UUID's and volume identifiers is that you might end up with two drives/partitions with identical volume id's or UUID's if you clone disks/partitions, or hot add f.ex. a USB disk.
So a backup tool should look at the actual contents of data, and build indexes around the data to reflect the enumerations of drives/partitions. Instead of relying on the enumerations, and ending up making huge differential backups when the order of drives/partitions change.
We can make the assumption that on most newer Linux installations we will find descriptors for stdin, stdout and stderr in the proc file system, (/proc/self/fd/), or in /dev/stdin etc.. but still, it is there just because somebody said it should be so. And it might well not always be the case.
Some people work on obsoleting the proc file system... Some installations of Linux use the udev file system, while others don't... And people tend to make their own device type mappings, despite the effort there is to keep it at least half consistent. One recent example is example that someone decided to start masking ATA drives as SD drives instead of continuing to use the HD driver.
One simply can't trust the configuration! This is in my humble opinion one of the bigger threats to Linux. There should be an even harder effort to maintain standard.
Making an application for Linux that need to dig the slightest bit deeper that will work on all installations is becoming a challenge. I guess, for example, that the developers at VMWARE will concur to this. ;)
And i also want to provide commercial tools for recovery/backup and archiving aimed for the MS platforms.
I have decided to avoid C++ as much as possible for speed and reliability, and stick to C. In the good old spirit of both Linux as well as "embedded software development".
Still, it is quite feasible to produce C++ that don't "inherit" bad behaviour, fragment the memory, or use 90% resources just to keep the GUI up. :)
So it might be that when the lower level routines have stabilized, that i will make a C++ library of it, or add it to some open software library. Could be boost.
Looking at other libraries to try sticking to the convention and possibly making ones routines portable is a good idea.
But if you take a look at the source code of f.ex. gzip, you will find that because it must be as portable as possible, it is using a very old standard of C. So it can compile on DOS, RT11, VMS... as well as for Linux and Windows.
(Having in mind what i posted before on maintaining standard). ;)
|All times are GMT -5. The time now is 11:50 AM.|