Comparing Directories

hawkfan50 · 05-14-2012, 04:58 PM

So this program reads in 2 directories and then outputs which files are NEW and which have been MODIFIED. I can't figure out how to tell the difference between a subdirectory and a file in the modified directory. As it stands right now, if there's a subdirectory in moddir it'll treat it like a file and just call it NEW. But it's not suppose to do that. If the -R flag is on then the program should recurse through the subdirectories and display the corresponding information. Any suggestions would be great. I'm stumped.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>


int main(int argc, char ** argv)
{

	int recurse = 0;	
	int diff = 0;		
	char * basedir = NULL;
	char * moddir = NULL;	

	int i;
	for (i = 1; i < argc; i++)
	{
		if (argv[i][0] == '-')
		{
			if (strcmp(argv[i], "-R") == 0)
			{
				recurse = 1;
			}
			else if (strcmp(argv[i], "-D") == 0)
			{
				diff = 1;
			}
			else
			{
				printf("error: invalid option %s\n", argv[i]);
				exit(-1);
			}
		}
		else
		{
			if (basedir == NULL)
			{
				basedir = argv[i];
			}
			else if (moddir == NULL)
			{
				moddir = argv[i];
			}
			else
			{
				printf("error: invalid argument %s\n", argv[i]);
				exit(-2);
			}
		}
	}

	if ((basedir == NULL) || (moddir == NULL))
	{
		printf("error: must pass both basedir and moddir\n");
		exit(-3);
	}

	DIR * moddir_contents = opendir(moddir);
	if (moddir_contents != NULL)
	{
		struct dirent * moddir_entry = readdir(moddir_contents);
		while (moddir_entry != NULL)
		{

			DIR * basedir_contents = opendir(basedir);
			if (basedir_contents != NULL)
			{
				int matched = 0;	
			
				struct dirent * basedir_entry = readdir(basedir_contents);
				while (basedir_entry != NULL)
				{

					if (strcmp((*moddir_entry).d_name, (*basedir_entry).d_name) == 0)
					{
						matched = 1;
						
						struct stat buf;

						time_t mod_stamp;
						char modfilename[1024];
						strcpy(modfilename, moddir);
						strcat(modfilename, "/");
						strcat(modfilename, (*moddir_entry).d_name);
						if (stat(modfilename, &buf) == 0)
						{
							mod_stamp = buf.st_mtime;
						}
						else
						{
							printf("error: failed to get modified time for file %s\n", (*moddir_entry).d_name);
			
							exit(-5);
						}

						time_t base_stamp;
						char basefilename[1024];
						strcpy(basefilename, basedir);
						strcat(basefilename, "/");
						strcat(basefilename, (*basedir_entry).d_name);
						if (stat(basefilename, &buf) == 0)
						{
							base_stamp = buf.st_mtime;
						}
						else
						{
							printf("error: failed to get modified time for file %s\n", (*basedir_entry).d_name);
							exit(-5);
						}

						if (mod_stamp > base_stamp)
						{
							printf("%s MODIFIED\n", (*moddir_entry).d_name);
						}
					}

					basedir_entry = readdir(basedir_contents);
				}

				if (matched == 0)
				{
					printf("%s NEW\n", (*moddir_entry).d_name);
				}

				closedir(basedir_contents);
			}
			else
			{
				printf("error: failed to open basedir %s\n", basedir);
	
				exit(-5);
			}

			moddir_entry = readdir(moddir_contents);
		}

		closedir(moddir_contents);
	}
	else
	{
		printf("error: failed to open moddir %s\n", moddir);
		exit(-4);
	}

return 0;
}

Nominal Animal · 05-14-2012, 07:38 PM

Your general approach is faulty. You cannot do it recursively using a loop alone. Also, your approach of scanning the second directory for each match in the first is not only slow, but incomplete: what about files that only exist in the second directory?

Instead, I would recommend you write a function that compares the contents of two directories. For example, something along the lines of

Code:

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <errno.h>

struct list *compare(const char *const directory1,
                     const char *const directory2,
                     const int         flags)
{
    /* ... */
}

You will need to scan both directory1 and directory2 once. Whichever you scan first, you'll find entries that exist in that one alone, and in both. (Just try to get statistics for an entry by that name, under both directories.) Whichever you scan last, you are concerned only with the entries that do not exist in the first; remember, you already found all those.

When recursing into directories, you'll need to construct the new directory names, since the directory entries only contain the final segment (name), not the full path. Here is a helper function that returns the full path needed, as a dynamically allocated string:

Code:

#include <stdlib.h>
#include <string.h>
#include <errno.h>

char *pathto(const char *const dir, const char *const name)
{
    const size_t dirlen = (dir) ? strlen(dir) : 0;
    const size_t namelen = (name) ? strlen(name) : 0;
    const size_t size = dirlen + namelen + 2;
    size_t       len;
    char        *path;

    if (size < 3) {
        errno = EINVAL;
        return NULL;
    }

    path = malloc(len);
    if (!path) {
        errno = ENOMEM;
        return NULL;
    }

    len = dirlen;
    if (dirlen > 0) {
        memcpy(path, dir, dirlen);
        if (path[len-1] != '/')
            path[len++] = '/';
    }
    if (namelen > 0) {
        memcpy(path + len, name, namelen);
        len += namelen;
    }
    path[len] = '\0';

    return path;
}

After the recursive call has been done, using the dynamically allocated directory names, you need to free() them explicitly.
_ _ _ _ _ _ _ _ _ _

If this is part of a library or real application code, there is one deeply technical issue I'd like to bring up.

When handling directory structures, directories and files may be renamed at any point. Because of that, it is recommended that instead of relying solely on paths, applications should retain a descriptor to the directory, and use the fstatat(dirfd, name...) function (POSIX.1-2008, i.e. #define _POSIX_C_SOURCE 200809L). The descriptor will stay valid, even if the name of the underlying directory happened to change.

I understand that this is something that is totally new and alien to programmers with only Windows experience. Let me assure you: directory names are surprisingly volatile (and for useful reasons) in other OSes. Do not let Windows-isms drag down the quality of your code. (In Linux and BSDs and derivatives, you can rename or delete open files, even executables. I've found that many with only Windows experience expect having the file open be some kind of lock that should forbid such actions. I've never understood the reasoning for that.)

To open a directory for the purpose of scanning its contents, you can use

Code:

#define _POSIX_C_SOURCE 200809L
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

DIR *opendirat(int dirfd, const char *pathname)
{
    DIR *dir;
    int  fd, result, saved_errno;

    saved_errno = errno;

    do {
#ifdef O_NOCTTY
#ifdef O_DIRECTORY
        fd = openat(dirfd, pathname, O_RDONLY | O_DIRECTORY | O_NOCTTY);
#else
        fd = openat(dirfd, pathname, O_RDONLY | O_NOCTTY);
#endif
#else
#ifdef O_DIRECTORY
        fd = openat(dirfd, pathname, O_RDONLY | O_DIRECTORY);
#else
        fd = openat(dirfd, pathname, O_RDONLY);
#endif
#endif
    } while (fd == -1 && errno == EINTR);
    if (fd == -1)
        return NULL;

    do {
        dir = fdopendir(fd);
    } while (!dir && errno == EINTR);
    if (!dir) {
        saved_errno = errno;
        do {
            result = close(fd);
        } while (result == -1 && errno == EINTR);
        errno = saved_errno;
        return NULL;
    }

    /* fd is now incorporated into the dir handle;
     * it will be closed when the dir is closed.
    */

    errno = saved_errno;
    return dir;
}

In the general case, a process can acquire a descriptor to a directory it cannot read, as long as it can enter it. In this case, that is not necessary, because you cannot get the listing from such a directory anyway. Therefore the above simplified version is perfectly adequate, but only for when you need to scan the contents of that directory. It is not sufficient if you don't necessarily need to get a listing, but only need to enter said directory; although that sounds even simpler, for that you do need the general, rather complex version of the function.

In the most general case, a opendirat() implementation requires a child process entering the desired directory, passing the descriptor back via an ancillary socket message, because otherwise all other threads and signal handlers (and library code) would see the current working directory flipping back and forth while the code is running. Because the current working directory is process-wide, you really need to use a separate process to enter the new directory, then pass back a reference to it. Fortunately, this mess can almost always be avoided. For example, you can use a mutex or an rwlock to protect any access that has to do with the current working directory.

The difference between the general case and the above implementation is that the above implementation will only work if the current user has read rights to the directory (i.e. can see the directory listing).