[SOLVED] is there even a way to speed up file loading into a sorted linked list?

BW-userx · 09-05-2017, 02:39 PM

I guess this is what one could call a sorted link list.

I got

Quote:

596,221

files and it has already been over an hour and a half and it is still loading them into a linked list that is sorting them alphabetically by using strcmp.

Code:

void insertNode(mylist** head, char *filename, int c)
{
  //  printf("Adding: %s\n", filename);

    while (*head && strcmp((*head)->filename, filename) < 0)
        head = &(*head)->next;

    mylist *p = (mylist *) malloc(sizeof *p);
    p->filename = strdup(filename);
    p->listcount = c;
    p->next = *head;
    *head = p;
}

I've googled this question but got nothing, I think it is limited by hardware speed and having to check every file first. I was just wondering if their is a quicker way of doing this.

astrogeek · 09-05-2017, 03:01 PM

Quote:

Originally Posted by BW-userx

I've googled this question but got nothing...

Seriously? I see lots of helpful links on google and DDG.

As you have been asked before, please do your own basic research before posting questions.

dugan · 09-05-2017, 04:25 PM

Quote:

Originally Posted by BW-userx

loading them into a linked list that is sorting them alphabetically by using strcmp.

I was just wondering if their is a quicker way of doing this.

Yes. Use a binary tree.

BW-userx · 09-05-2017, 05:49 PM

Quote:

Originally Posted by astrogeek

Seriously? I see lots of helpful links on google and DDG.

As you have been asked before, please do your own basic research before posting questions.

Seriously, no I was lying.

I did as posted I did.
What key words did as you use?

BW-userx · 09-05-2017, 05:51 PM

Quote:

Originally Posted by dugan

Yes. Use a binary tree.

http://www.geeksforgeeks.org/binary-...ata-structure/

book marked.

thanks

astrogeek · 09-06-2017, 03:57 AM

Quote:

Originally Posted by BW-userx

Seriously, no I was lying.

I did as posted I did.
What key words did as you use?

The search terms I used are visible in the links I provided for those who will follow them.

But what is actually relevant is, what searches have you done and what have you found and tried before posting. But once more you did not include that relevant information in your post

As I have asked repeatedly, please end this pattern of asking others to perform your basic work for you, or please consider findling help elsewhere, per the LQ posting guidelines to which I again refer you.

pan64 · 09-06-2017, 04:42 AM

do you want to speed up file loading or sorting or both? Do you want to sort by filename, by size or by content or what?
If you sort by names you do not need to load those files into memory at all. How big is that 596,221 files altogether? Will that fit into RAM at all?
You gave almost no any information in your original post...

jamison20000e · 09-06-2017, 07:23 AM

My SSD partitions have realtime enabled so I can use, eg: sudo systemmonitor to give even misbehaving processes too much power...

and a chillpad.

rtmistler · 09-06-2017, 07:57 AM

Quote:

Originally Posted by BW-userx

I guess this is what one could call a sorted link list.

Quote:

596,221

I got files and it has already been over an hour and a half and it is still loading them into a linked list that is sorting them alphabetically by using strcmp.

Code:

void insertNode(mylist** head, char *filename, int c)
{
  //  printf("Adding: %s\n", filename);

    while (*head && strcmp((*head)->filename, filename) < 0)
        head = &(*head)->next;

    mylist *p = (mylist *) malloc(sizeof *p);
    p->filename = strdup(filename);
    p->listcount = c;
    p->next = *head;
    *head = p;
}

I've googled this question but got nothing, I think it is limited by hardware speed and having to check every file first. I was just wondering if their is a quicker way of doing this.

Quote:

Originally Posted by astrogeek

The search terms I used are visible in the links I provided for those who will follow them.

But what is actually relevant is, what searches have you done and what have you found and tried before posting. But once more you did not include that relevant information in your post

As I have asked repeatedly, please end this pattern of asking others to perform your basic work for you, or please consider findling help elsewhere, per the LQ posting guidelines to which I again refer you.

Quote:

Originally Posted by pan64

do you want to speed up file loading or sorting or both? Do you want to sort by filename, by size or by content or what?
If you sort by names you do not need to load those files into memory at all. How big is that 596,221 files altogether? Will that fit into RAM at all?
You gave almost no any information in your original post...

I would suggest that you take a bit of time to consider the feedback of both astrogeek as well as pan64. My thinking is that you have some experience of time with LQ as well as recurrence, due to the high post count. And given that you've started a number of programming threads as of late, and been given feedback that questions are written either confusing, or incomplete, it would seem to make sense that you would put the effort forwards to improve the quality of how you ask.

Think of it in the other direction. You, yourself have offered advice to other LQ members. We're happy to offer advice towards your programming questions. We do realize that you are progressing very rapidly, being tenacious with attacking your problems, and working very diligently towards solutions. Perhaps a better tact might be that when you find something you don't understand, is to work more at it, as you obviously are doing, but to not post a LQ thread just yet until you have considered all your options first.

What you are doing is potentially frustrating those who read your threads. First, they see a fast changing thread, and second, they see very unclear, or incomplete information.

Ask yourself how you would react to another member doing this same thing? What Linux talents are you a very accomplished expert at? Say it were regex, and you saw a regex question. But then you went to answer it and found that the OP edited their post, updated it a few times even before you were able to answer their first question. Or say they just put in some regex and said "It doesn't work", meanwhile you're asking "What is your input? What did you originally want it to do?" Sorry to say, but this is a similar position you are placing others into with threads like these.

dugan · 09-06-2017, 08:57 AM

If you want a better answer, then post the following code:

Quote:

Originally Posted by BW-userx

I got files... it is still loading them into a linked list

Post the code that "loads" the "files" (no, your node insert code obviously isn't it)

Quote:

that is sorting them alphabetically by using strcmp.

And also post the code that's sorting them. That wasn't in the snippet you posted either.

BW-userx · 09-06-2017, 09:36 AM

Quote:

Originally Posted by dugan

If you want a better answer, then post the following code:

Post the code that "loads" the "files" (no, your node insert code obviously isn't it)

And also post the code that's sorting them. That wasn't in the snippet you posted either.

I really didn't want to post my entire code only to have to worry about a someone that seems to like to .... about everything I do in there now.

I switched it back to the node I was using before hand that loads first come first serve and I loaded

Code:

1,135,597

files in a matter of seconds. So now I believe I'm changing my plan of attack to looking into a short algorithm that is quickly sort it alphabetically, just to make it a feature/option after it is loaded.

Because it seems to me that the information is already in memory so it should be faster afterwords?

Keeping mind this is my very first time dealing with linked list and I am playing quick study with the mind set of only wanting the necessary information to get this job done. I am not going for a job at this nor am I going for some nobel prize over this. Truth be told.

my knowledge that I gain on this will only be used for this one thing I am updating, that is as far as I can see it going. After it is finished then it goes into that bin in the back of ones mind.

If you still want to look at the code that does the work to get the files. Here you go.

Code:

void add_file_to_filelist_recursively(char *origpath)
{
	struct stat st;
	char *path;
	int recursive = 1;

	if (!origpath)
		return;

	path = strdup(origpath);

	if (level == FILELIST_FIRST)

	{
		// First time through, sort out pathname 
		int len = 0;

		len = strlen(path);
		if (path[len - 1] == '/')
			path[len - 1] = '\0';

	char *newpath = get_absolute_path(path);

			free(path);
			path = newpath;

	}

	errno = 0;
	if (stat(path, &st)) {

		free(path);
		return;
	}

	if ((S_ISDIR(st.st_mode)) && (level != FILELIST_LAST)) {
		struct dirent **de;
		DIR *dir;
		int cnt, n;

		if ((dir = opendir(path)) == NULL) {
            printf("couldn't open directory %s:", path);
			free(path);
			return;
		}
		n = scandir(path, &de, NULL, NULL);
		if (n < 0) {
			switch (errno) {
			case ENOMEM:
				printf("Insufficient memory to scan directory %s:", path);
				break;
			default:
				printf("Failed to scan directory %s:", path);
			}
		} else {
			for (cnt = 0; cnt < n; cnt++) {
				if (strcmp(de[cnt]->d_name, ".")
						&& strcmp(de[cnt]->d_name, "..")) {
					char *newfile;
                                             // function that adds absolute path to filename.
					newfile = strjoin("", path, "/", de[cnt]->d_name, NULL);

					//gets recursive always
					if (recursive)
					{

						add_file_to_filelist_recursively(newfile, FILELIST_CONTINUE);
					}
					else

						add_file_to_filelist_recursively(newfile, FILELIST_LAST);
                        free(newfile);
				}
				free(de[cnt]);
			}
			free(de);
		}
		closedir(dir);
	} else if (S_ISREG(st.st_mode)) {

        fcount++;

      if (opts.printlistintake == 1)
            printf("File: %s\n", path);

        //sorted list
       // insertNode(&filelist, path, fcount);

       // first come first serve
        add_2_front(&filelist, path, fcount);

	}
	free(path);
	return;
}

add_2_front function

Code:

void add_2_front(mylist** filelist, char *filename, int c)
{
    mylist* new_file = (mylist *)malloc(sizeof(mylist));
	new_file->filename = strdup(filename);
	new_file->listcount = c;



	if(!(*filelist)){
		new_file->next = NULL;
		(*filelist) = new_file;
	}else{
		new_file->next = (*filelist);
		(*filelist) = new_file;
	}
 return;
}

dugan · 09-06-2017, 12:31 PM

I don't immediately see in the new code where the slowdown is. It is, however, similar to something I've written that you might want to look at:

https://gist.github.com/duganchen/1e917c11fce44267b4c4

What that does is recursively go through the filesystem, grab the size and path for every file, and then group them by size. My program is very fast, despite being written in Python, and it does a lot of the same stuff that I see in your new code.

BW-userx · 09-06-2017, 01:13 PM

Quote:

Originally Posted by dugan

I don't immediately see in the new code where the slowdown is. It is, however, similar to something I've written that you might want to look at:

https://gist.github.com/duganchen/1e917c11fce44267b4c4

What that does is recursively go through the filesystem, grab the size and path for every file, and then group them by size. My program is very fast, despite being written in Python, and it does a lot of the same stuff that I see in your new code.

I'd have to say it is that when it is comparing the file names in that node function -- that is all I can see why -- having to take every one of them and look at them and set them in order.

I do not have any experience in strcmp beyond 2 files, this is over 1 million files. I am sure that is going to take some time, regardless?

dugan · 09-06-2017, 01:28 PM

Quote:

Originally Posted by bw-userx

i'd have to say it is that when it is comparing the file names in that node function -- that is all i can see why -- having to take every one of them and look at them and set them in order.

I do not have any experience in strcmp beyond 2 files, this is over 1 million files. I am sure that is going to take some time, regardless?

Aah, okay. I think you're right that that's what's slowing it down. I can also see why now.

Code:

while (*head && strcmp((*head)->filename, filename) < 0)
        head = &(*head)->next;

You're calling this for every file, right?

When you insert the first file, you do no strcmps.

When you insert the second file, you do one strcmp.

When you insert the third file, you might be doing two strcmps.

So for n files, you might be doing as many as 1 + 2 + 3 + ... + n - 1 strcmp operations.

Do you see how that might be modeled mathematically? It's been a long time since I've actually done this, so I'll just link to this explanation:

https://brilliant.org/wiki/sum-of-n-n2-or-n3/

Okay, do you see the problem now?

For n files, you're doing roughly n-squared strcmps.

That means that for a million files, you're doing a million times a million strcmps. That's a trillion strcmps.

And that's why you should switch from a linked list to a binary tree. It would reduce the number of strcmp operations from n-squared to n times log n. Which is MUCH fewer strcmps.

EDIT: to be clear, using a binary tree will literally make your program a million times faster. This is especially true if you go all out and implement something like a red-black tree.

rtmistler · 09-06-2017, 01:47 PM

Some comments about that code and strcmp() as well as strjoin().

Yes, these functions are inherently slow.

There is no need to call a function when you know explicit and very simple "placement" as well as "pattern".

Comparing a highly specific index offset against '.' or 0x2E is far faster than calling a library function.

Thus something like (example):

Code:

#define PERIOD 0x2e

if ((str[cnt] == PERIOD) && (str[cnt+1] == PERIOD) && (str[cnt+2] == 0x00))

This will serve and be far, far more rapid than strcmp as it appears to be, being used in that code.

Further, I do not like strcmp() only and especially if the strings can be "whatever", such as filenames allowing varied patterns. Why? You can encounter an alias, not a bash alias, but a false detection. "And you will!", because for those out there who are experienced programmers, you know that if there is a remote possibility that your code could hit that one improbable condition ... IT WILL! Thus you are doing a strcmp() against ".", well you can possibly hit that if the first character is a dot.

Alternatives to strjoin() might be any of memmove(), memcpy(), strcat(), or byte by byte copy directly inline using a local index variable.

I honestly try as hard as possible to do zero copy code for when data is being pushed around a system.