LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-06-2012, 10:47 AM   #1
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Optimize simple C code


I want to count the number of each of the 256 possible bytes in a file.

This is the fastest code I have produced so far:
Code:
// output character statistics
#include <stdio.h>
#include <math.h>

#define NUMCHARMAX 256

int main (void)
{
	// ! init
	unsigned char cluster[BUFSIZ];
	unsigned int i;
	
	// init
	unsigned long count[NUMCHARMAX];
	
	for (i=0; i < NUMCHARMAX; i++)
	{
		count[i] = 0;
	}
	
	// input loop
	while (fread(cluster, BUFSIZ, 1, stdin))
	{
		for (i=0; i < BUFSIZ; i++)
		{
			count[cluster[i]]++;
		}
	}
	
	// total calc
	for (i=0; i < NUMCHARMAX; i++)
	{
		printf("%u %lu\n", i, count[i]);
	}

	return 0;
}
I would like to keep the code portable, but using MMX or SSE* is ok if that would help.

Thank for any suggestions on how to optimize this further.
 
Old 12-06-2012, 11:12 AM   #2
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
What test cases are you using? For instance does your total counted always match the length of the input file?

I wouldn't use fread() for this - just read() or mmap().
 
Old 12-06-2012, 11:50 AM   #3
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.

If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.

If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).

The small potential for improvement here, assuming BUFSIZ is incredibly large, is to break into an outer loop over a decently chosen chunk size (amortize the overhead of fread with big chunks, but make them small enough that the cache misses that need to occur in fread don't occur a second time when you read from the buffer).

I don't know about mmap of stdin. For an ordinary file, mmap might reduce the cache misses and some other processing time compared to fread. Plus it causes the kernel to effectively manage that chunk size for you, which is simpler and maybe better than doing so yourself.

Last edited by johnsfine; 12-06-2012 at 11:54 AM.
 
Old 12-06-2012, 12:24 PM   #4
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by linosaurusroot View Post
What test cases are you using? For instance does your total counted always match the length of the input file?

I wouldn't use fread() for this - just read() or mmap().
The size of the input can be anything. The total should match the length of the file, plus or minus one. The statistics I am doing aren't altered by small amounts of error.

I tried using read, and it has not sped anything up, it's just as fast. I have never used mmap, and from what I've read there usually is no significant improvement because the kernel handles it.

Quote:
Originally Posted by johnsfine View Post
I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.

If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.

If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).
I have tried using '-O3', but it doesn't improve anything. You might be right that this may be the best that can be done.

BUFSIZ is around 4096.

I have also tried using pthreads, but the code ran slower than this one for some reason. Maybe I just messed up. I don't know how to use SSE or MMX, nor if they would help in this case.
 
Old 12-06-2012, 12:47 PM   #5
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by H_TeXMeX_H View Post
BUFSIZ is around 4096.
I didn't see an outer loop. If the whole task is that size, why optimize?

If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.

Quote:
I have also tried using pthreads, but the code ran slower than this one for some reason.
If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.

Quote:
I don't know how to use SSE or MMX, nor if they would help in this case.
I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.
 
1 members found this post helpful.
Old 12-06-2012, 01:15 PM   #6
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by johnsfine View Post
I didn't see an outer loop. If the whole task is that size, why optimize?

If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.



If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.



I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.
I just tried 2 and 4 times the standard BUFSIZ and it doesn't help. Anything lower than BUFSIZ will be slower, so I think it is the right size.

There are 4 cores with each pair sharing 3 MB for a total of 6 MB for all 4 cores.

I'll leave the thread up a bit, and if nobody has any magical optimizations, I think this is as optimized as it is going to get.

Thanks for the input.
 
Old 06-05-2013, 01:09 PM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
I have one more question on this code. It works well, but if the input is not divisible by BUFSIZ, then some of the input is discarded. It isn't too much of an issue, but I'd like to know if can get the rest of the input without too much performance hit. Performance is more important than getting a few more bytes of input. This is read from stdin, so I can't calculate sizes beforehand and do it that way.
 
Old 06-05-2013, 02:06 PM   #8
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by H_TeXMeX_H View Post
if the input is not divisible by BUFSIZ, then some of the input is discarded.
Have you tried fread(cluster, 1, BUFSIZ, stdin) instead of fread(cluster, BUFSIZ, 1, stdin) ?

I would not expect that to affect performance much, but you won't know until you try.

With that change, it is trivial to know how much you read and process the correct amount.

Code:
	unsigned r;
	while ( 0!= (r=fread(cluster, 1, BUFSIZ, stdin)) )
	{
		for (i=0; i < r; i++)
		{
			count[cluster[i]]++;
		}
	}

Last edited by johnsfine; 06-05-2013 at 02:10 PM.
 
1 members found this post helpful.
Old 06-06-2013, 02:17 AM   #9
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Thanks, it works !

I tried to do something with the return value of fread, but I didn't think to put it in the for loop.

Alright, it is solved, and performance is the same.
 
Old 06-06-2013, 07:32 AM   #10
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
By the way, on unrelated note, instead of defining “NUMCHARMAX” use “UCHAR_MAX” from “limits.h” header file.

Back on topic, if you are really interested, you could play with “posix_fadvise”, for instance:
Code:
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define BUFSIZE 4096

int main(void) {
	unsigned long count[UCHAR_MAX + 1];
	unsigned char *buf, *ch;
	ssize_t i, ret;
	off_t pos = 0;

	for (i = 0; i <= UCHAR_MAX; ++i) {
		count[i] = 0;
	}

	buf = malloc(BUFSIZE);
	for (;;) {
		ret = read(0, buf, BUFSIZE);
		if (ret > 0) {
			pos += ret;
			posix_fadvise(0, pos, BUFSIZE,
			              POSIX_FADV_SEQUENTIAL |
			              POSIX_FADV_NOREUSE |
			              POSIX_FADV_WILLNEED);
			for (ch = buf; ret; --ret, ++ch) {
				++count[*ch];
			}
		} else if (!ret) {
			break;
		} else if (errno != EINTR) {
			fprintf(stderr, "%s\n", strerror(errno));
			return 1;
		}
	}
	free(buf);

	for (i = 0; i <= UCHAR_MAX; ++i) {
		printf("%u %lu\n", (unsigned)i, count[i]);
	}

	return 0;
}
But with such a simple code it's unlikely that you'll notice any improvement.
 
1 members found this post helpful.
Old 06-06-2013, 07:35 AM   #11
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
@ mina86

I tried that code, but it is slower. You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations. I suppose I could define a constant based on UCHAR_MAX.

Last edited by H_TeXMeX_H; 06-06-2013 at 07:42 AM.
 
Old 06-06-2013, 11:41 AM   #12
bloody
Member
 
Registered: Feb 2013
Location: Berlin
Distribution: Gentoo, Debian
Posts: 172

Rep: Reputation: 25
Use at least 64K buffer size, the bigger the better. Up to 4MB can sometimes help.
 
Old 06-06-2013, 03:56 PM   #13
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
Hi

I think It should be slightly faster to use fadvise, mmap and no buffer at all.

Code:
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
  int i, fd;
  unsigned long count[UCHAR_MAX + 1];
  memset(count,0,sizeof(long)*(UCHAR_MAX+1));
  if (argc != 2) {
    printf("No filename.\n");
    return 1;
  }
  const char *filename = argv[1];
  fd = open(filename,O_RDONLY);
  struct stat s;
  fstat (fd, &s);
  size_t size = s.st_size;
  posix_fadvise(fd, 0, size, POSIX_FADV_SEQUENTIAL | POSIX_FADV_NOREUSE | POSIX_FADV_WILLNEED);
  const unsigned char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  if (data == MAP_FAILED) {
    perror("mmap");
    return 1;
  }
  for ( i=0; i<size; i++ ) {
    unsigned char c = data[i];
    count[c]++;
  }
  for (i = 0; i <= UCHAR_MAX; ++i) {
    printf("%u %lu\n", (unsigned)i, count[i]);
  }
  return 0;
}
 
1 members found this post helpful.
Old 06-07-2013, 02:35 AM   #14
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928

Original Poster
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
It's still slower, 0.159s versus 0.131s for 100 MB file.

I have already tried large buffers, but the BUFSIZ defined in stdin.h seems to be optimal. Larger doesn't help, smaller is slower.
 
Old 06-07-2013, 09:01 AM   #15
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
Quote:
Originally Posted by H_TeXMeX_H View Post
I tried that code, but it is slower.
Actually, I haven't said it would be faster. The test application is too simple to optimise. Linux most likely does a good job at optimising it anyway. But if you have a bigger problem, using “posix_fadvise” might be useful.

Quote:
Originally Posted by H_TeXMeX_H View Post
You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations.
You should use UCHAR_MAX because you avoid using literals in your code which are not defined by the standard by the way. Not using literal values is almost always a good thing.

Quote:
Originally Posted by Guttorm View Post
Code:
  const unsigned char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
This will fail if the file is very big, say 1T.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with a simple If/Else code in C Exsiss Programming 4 10-02-2007 02:37 AM
optimize socket networking code powah Linux - Networking 0 09-26-2006 04:47 PM
Interpret this simple C code Chase_G Programming 4 04-29-2005 09:07 AM
to become giddy in simple code Hamid Moradmand Programming 2 05-18-2004 06:23 PM
How to optimize source code installation? Waldi Slackware 4 08-05-2003 12:47 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration