ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.
If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.
If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).
The small potential for improvement here, assuming BUFSIZ is incredibly large, is to break into an outer loop over a decently chosen chunk size (amortize the overhead of fread with big chunks, but make them small enough that the cache misses that need to occur in fread don't occur a second time when you read from the buffer).
I don't know about mmap of stdin. For an ordinary file, mmap might reduce the cache misses and some other processing time compared to fread. Plus it causes the kernel to effectively manage that chunk size for you, which is simpler and maybe better than doing so yourself.
What test cases are you using? For instance does your total counted always match the length of the input file?
I wouldn't use fread() for this - just read() or mmap().
The size of the input can be anything. The total should match the length of the file, plus or minus one. The statistics I am doing aren't altered by small amounts of error.
I tried using read, and it has not sped anything up, it's just as fast. I have never used mmap, and from what I've read there usually is no significant improvement because the kernel handles it.
Quote:
Originally Posted by johnsfine
I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.
If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.
If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).
I have tried using '-O3', but it doesn't improve anything. You might be right that this may be the best that can be done.
BUFSIZ is around 4096.
I have also tried using pthreads, but the code ran slower than this one for some reason. Maybe I just messed up. I don't know how to use SSE or MMX, nor if they would help in this case.
I didn't see an outer loop. If the whole task is that size, why optimize?
If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.
Quote:
I have also tried using pthreads, but the code ran slower than this one for some reason.
If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.
Quote:
I don't know how to use SSE or MMX, nor if they would help in this case.
I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.
I didn't see an outer loop. If the whole task is that size, why optimize?
If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.
If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.
I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.
I just tried 2 and 4 times the standard BUFSIZ and it doesn't help. Anything lower than BUFSIZ will be slower, so I think it is the right size.
There are 4 cores with each pair sharing 3 MB for a total of 6 MB for all 4 cores.
I'll leave the thread up a bit, and if nobody has any magical optimizations, I think this is as optimized as it is going to get.
I have one more question on this code. It works well, but if the input is not divisible by BUFSIZ, then some of the input is discarded. It isn't too much of an issue, but I'd like to know if can get the rest of the input without too much performance hit. Performance is more important than getting a few more bytes of input. This is read from stdin, so I can't calculate sizes beforehand and do it that way.
I tried that code, but it is slower. You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations. I suppose I could define a constant based on UCHAR_MAX.
Last edited by H_TeXMeX_H; 06-06-2013 at 07:42 AM.
Actually, I haven't said it would be faster. The test application is too simple to optimise. Linux most likely does a good job at optimising it anyway. But if you have a bigger problem, using “posix_fadvise” might be useful.
Quote:
Originally Posted by H_TeXMeX_H
You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations.
You should use UCHAR_MAX because you avoid using literals in your code which are not defined by the standard by the way. Not using literal values is almost always a good thing.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.