I would like to have a general discussion about C programming redirected io, file names, link handling and related.
If you have the time and interest to read this, and share the curiosity or have good answers to give, then please comment.
Intro:
I'm working on an application used for backup/archiving. That can be archiving contents on block devices, tapes, as well as regular files. The application stores data in hard packed low redundancy heaps with multiple indexes pointing out uniquely stored, (shared), fractions in the heap. And the application supports taking and reverting to snapshot of total storage on several computers running different OS, as well as simply taking on archiving of single files. It uses hamming code diversity to defeat the disk rot, instead of using raid arrays which has proven to become pretty much useless when the arrays climb over some terabytes in size. It is intended to be a distributed CMS (content management system) for a diversity of platforms, with focus on secure storage/archiving.
In doing this heap ofjob, i stumbled into the topic of how to manage multiple files/devices/pipes/fifos with a common method.
Lets say i have a unix shell tool that acts like gzip, cat, dd etc in being able to pipe data between applications.
Example:
dd if=/dev/sda bs=1b | gzip -cq > my.sda.raw.gz
the tool can handle different files in a struct array, like:
Code:
enum FilesOpenStatusValue {
FileIsClosed = 0,
FileIsOpen,
FileIsFopen,
FileIsPopen
};
// Struct declarations
//
// Definition of FileRecord
typedef struct FileRecord_t {
char name[FILENAME_MAX]; // File name
int open; // Does the file exist? Is the file opened? And how is the file opened?
union FileHandle_u {
FILE * stream; // Stream pointer
int handle; // File handle
} f;
int flags; // File open flags
int statStatus; // Does the file exist? Or what is the errno from stat?
struct stat stat; // File statistics struct
int encoding; // File encoding bitmap
struct hd_geometry BiosGeom; // Geometry as reported from BIOS if file is a block device
off_t sectorSize; // Sector size on device if file is a block device
off_t sectors; // Number of sectors on device if file is a block device
int readahead; // Readahead value if file is a block device. (BLKRAGET)
} FileRecord;
The above struct can contain the sum of all "files" the tool operates on. Regular files, fifos and pipes such as stdin, stdout and stderr redirected or not, (as well as threaded i/o like, for example, popen() threads.
I can then call the following function to get the important details of all the different files:
Code:
int FilesStatFromName(struct FileRecord_t * fr, int num, int nodereference)
{
int tmpflags;
int c;
int n;
int s;
n=0;
for (c=0; c<num; c++) {
fr[c].encoding = FileNameSuffixType(fr[c].name);
if (nodereference) {
// Do not follow symlinks
s = lstat(fr[c].name, &fr[c].stat);
} else {
// Follow symlinks
s = stat(fr[c].name, &fr[c].stat);
}
if (s == -1) {
fr[c].statStatus=errno;
continue;
} else {
fr[c].statStatus=0; // Stat was good
// If it is a block device...
if ((fr[c].stat.st_mode & S_IFBLK) == S_IFBLK) {
// Need to temporarily open the block device file to get extended stats
tmpflags = fr[c].flags;
fr[c].flags = O_RDONLY | O_CLOEXEC | O_NOATIME;
if ((fr[c].f.handle = OpenBlock(fr[c].name,fr[c].flags)) == -1) {
fr[c].statStatus=errno;
fr[c].flags = tmpflags;
continue;
}
/* get disk sector size */
if (ioctl(fr[c].f.handle,BLKSSZGET,&fr[c].sectorSize) == -1) {
fr[c].statStatus=errno;
continue;
}
fr[c].sectorSize = fr[c].sectorSize & 0xFFFFFFFFl; // Avoid bug/unclarity in ioctl interface description
/* get disk size in number of 512 byte blocks */
if (ioctl(fr[c].f.handle,BLKGETSIZE,&fr[c].sectors) == -1) {
fr[c].statStatus=errno;
continue;
}
if (fr[c].sectorSize > 512) {
unsigned long bsm;
bsm=fr[c].sectorSize/512;
if ((512*bsm) != fr[c].sectorSize) {
// Blocksize on input device is not a integer multiple of 512. This is unsupported.
fr[c].statStatus=EINVAL;
continue;
} else {
fr[c].sectors = fr[c].sectors / bsm; // correct block count for actual sector size
}
} else if (fr[c].sectorSize < 512) {
unsigned long bsm;
bsm=512/fr[c].sectorSize;
if ((512/bsm) != fr[c].sectorSize) {
// Blocksize on input device is not an integer divisor of 512. This is unsupported.
fr[c].statStatus=EINVAL;
continue;
} else {
fr[c].sectors = fr[c].sectors * bsm; // correct block count for actual sector size
}
}
// Else leave FileRec[0].sectors as it is for FileRec[0].sectorSize == 512
// Since stat reports st_size of block devices to be zero, we
// need to update st_size with correct value for the block device.
fr[c].stat.st_size = fr[c].sectorSize * fr[c].sectors;
/* Get readahead */
if (ioctl(fr[c].f.handle,BLKRAGET,&fr[c].readahead) == -1) {
fr[c].readahead = 0;
}
fr[c].flags = tmpflags;
if (CloseBlock(fr[c].f.handle) == -1) {
fr[c].statStatus=errno;
continue;
}
}
++n;
}
}
return n;
}
Simply calling FilesStatFromName(TheFiles, NumberOfFiles, 0); will give all the necessary stats for all the files in the array.
The problem is when handling redirected stdio with the same method. How do we get the name of a file that was "redirected into" standard input, (stdin)?
There simply does not seem to exist quick and universal method to apply for this.
One way to be able to specify redirection is to use the single dash argument.
Like: dd if=/dev/sda1 bs=1b | toolname - - > some.file
The first dash argument to the program "toolname" will indicate the input file to be piped in from stdin, and the second dash indicates the output file to be piped to via stdout.
The following code snippet will handle the dashes to represent stdin, stdout and also stderr, depending on the order they appear on the command line.
Code:
int stdiocnt = 0;
TotalFiles = 0;
while (optind < argc) {
strcat(cmdstr,argv[optind]);
strcat(cmdstr,"\n");
if (strcmp(argv[optind],"-") == 0) {
// Special handling of redirect for stdin, stdout and stderr with a dash.
switch (stdiocnt) {
case 0 : // stdin
strncpy(FileRec[TotalFiles].name,"/proc/self/fd/0",FILENAME_MAX);
++stdiocnt;
break;
case 1 : // stdout
strncpy(FileRec[TotalFiles].name,"/proc/self/fd/1",FILENAME_MAX);
++stdiocnt;
break;
case 2 : // stderr
strncpy(FileRec[TotalFiles].name,"/proc/self/fd/2",FILENAME_MAX);
break;
default:
usage("More than three - specified. Have only stdin, stdout and stderr to redirect");
break;
}
} else {
// Default, if no redirect with a dash.
strncpy(FileRec[TotalFiles].name,argv[optind],FILENAME_MAX);
}
FilesStatFromName(&FileRec[TotalFiles], 1, CmdPar.nodereference);
++TotalFiles;
++optind;
}
The above code is very much dependent on the platform being Linux, and will presumably not work on all flavors of Linux.
It depends upon the assumption that, for example, the file /proc/self/fd/0 is exactly the same as the process stdin.
When performing stat() on /proc/self/fd/0 for the following command line: toolname - < /dev/sde1
I get:
Code:
File stat info for /proc/self/fd/0:
Device Id: 15
Inode: 1601
Mode: 24992, block device
Permissions: 640
Hard links: 1
UID/GID: 0/6
Rdev: 2113
Size: 640132383744
Block size: 4096
Blocks: 0
Sector size: 512
Sectors: 1250258562
Readahead: 256
Atime: Fri Aug 6 23:13:46 2010
Mtime: Tue Jul 13 21:24:10 2010
Ctime: Tue Jul 13 21:24:18 2010
Stat status: 0, (Success)
If i specify not to follow symlinks, and thus use lstat(), like: toolname --no-dereference - < /dev/sde1
I get:
Code:
File stat info for /proc/self/fd/0:
Device Id: 3
Inode: 7870887
Mode: 41280, Symbolic link
Permissions: 500
Hard links: 1
UID/GID: 0/0
Rdev: 0
Size: 64
Block size: 1024
Blocks: 0
Sector size: 0
Sectors: 0
Readahead: 0
Atime: Sun Aug 8 23:16:32 2010
Mtime: Sun Aug 8 23:16:32 2010
Ctime: Sun Aug 8 23:16:32 2010
Stat status: 0, (Success)
Symbolic link points to: /dev/sde1
Inode 7870887 is verified to be that process stdin stream. So /proc/self/fd/0 is a link to what got redirected to stdin.
Also available in the file space is /dev/stdin, which will lstat() like:
Code:
File stat info for /dev/stdin:
Device Id: 15
Inode: 2481
Mode: 41471, Symbolic link
Permissions: 777
Hard links: 1
UID/GID: 0/0
Rdev: 0
Size: 15
Block size: 4096
Blocks: 0
Sector size: 0
Sectors: 0
Readahead: 0
Atime: Sun Aug 8 21:35:56 2010
Mtime: Tue Jul 13 21:24:17 2010
Ctime: Tue Jul 13 21:24:17 2010
Stat status: 0, (Success)
Symbolic link points to: /proc/self/fd/0
Please note that using:
Code:
if ((fr[c].stat.st_mode & S_IFMT) == S_IFLNK) {
char linkname[FILENAME_MAX];
int linklen;
if ((linklen = readlink(fr[c].name, linkname, FILENAME_MAX)) != -1) {
linkname[linklen] = '\0'; // Stupid bug in system call readlink() !!
ConsoleLogPrintf(" Symbolic link points to: %s\n", linkname);
}
}
...shows us that /dev/stdin is a link that points to /proc/self/fd/0
Using stat() instead to follow the symlink will give us:
Code:
File stat info for /dev/stdin:
Device Id: 15
Inode: 1601
Mode: 24992, block device
Permissions: 640
Hard links: 1
UID/GID: 0/6
Rdev: 2113
Size: 640132383744
Block size: 4096
Blocks: 0
Sector size: 512
Sectors: 1250258562
Readahead: 256
Atime: Fri Aug 6 23:13:46 2010
Mtime: Tue Jul 13 21:24:10 2010
Ctime: Tue Jul 13 21:24:18 2010
Stat status: 0, (Success)
Aha! Inode is 1601, which we recognize as the disk partition we redirected to stdin above.
So this method seems to work pretty well. But it doesn't feel like a robust method, since it has lots of dependencies that need to be carefully checked for.
The questions i stand with:
- Is there a better way to do this?
- What about portability?
- Is there a better way of getting the file name of the redirected file, (respecting the fact that there may not always exist such a thing as a file name for a redirection pipe).
- Should i work with inodes instead, and then take a completely different approach when porting to non-unix platforms?
- Why isn't there a system call like get_filename(stdin); ?

If you have any input on this, or some questions, then please don't hesitate to post in this thread.
To add some offtopic to the thread - Here is a performance tip: When doing data shuffling on streams one should avoid just using some arbitrary record length, (like 512 bytes). Use stat() to get the recommended block size in stat.st_blksize and use copy buffers of that size to get optimal throughput in your programs.