how can i speed up memcpy?
i have a simple test environment
i have a server
i have a client
/*booth C progs*/
the client does write(); in a loop
and the server does read(); in a loop
the client does some time measurement so i know how fast i can be
what i want:
the maximum throuthput which is possible and which fits in my application needs
i'm not sure if my mesaurements are correctly
but what i know is
if i do ONLY write(); AND read(); in a loop on both sides
i get a value of 2002
on the server side
if i do a memcpy(); during the loop
i get a value of 600
this means if i copy a buffer (the incoming to an temp buffer) the speed is slowed down!
i know i never get a value of 2000 if my whole app is between write(); and read();
BUT i want at minimum 1000!
i thought about writing
because gcc does not always use the best implementation of assembler memcpy!!!
on some cpu's
a loop with mov commands is faster than
using the rep command!!!
so my questions:
1. because i have very very very basic knowledge of assembler itself i have no idea how the gcc assembler syntax works!
the best would be that somebody can write those few lines in assembler for me
maybe i get some little to understand info on assembler programming in gcc
so i can try the two possible memcpy assembler implementations (rep or mov)
2. maybe some weird GCC command line optimization options would help!!
but i have no idea which options do what
so i don't know which to use
and i don't know which options are available
btw. if use memcpy and my whole algorithm is between write(); and read(); i have a throughput of 400
this is fast
so if anybody posts
DON'T try it!
i only have to say
but i want it to try
that's the only way i learn many things about linux, C and maybe assembler
so please help and don't say
let it be :D
I can help more with that if you want, probably
Are you positive that you need as many memcpy()s as you're using? Perhaps the problem could be solved by a more complicated buffering scheme. What is your server doing?
also, it seems inuitive that, if you always memcpy in multiples of the word size you could beat gcc's implementation. Is that the case? You might even try unrolling the loop around your mov instructions..
However, if gcc is using a builtin memcpy and you call it with constant lengths, maybe it already does these things (?)
I believe glibc's implementation of memcpy() already copies in word-sized chunks by means of casting.
EDIT: Yeah, it actually does whole pages at a time if it can, then words, then bytes. this is in the file glibc-2.3.5/sysdeps/generic/memcpy.c
Sorry, I pasted glibc's implementation after your post. I guess if you already knew how many bytes you were copying then you could avoid memcpy()'s branch logic, but even if you previously knew how many bytes you were copying it adds a maintenance headache. What if that number of bytes changes in the future? You have to remember to consider your copying algorithm too, all for the sake of saving the tiniest bit of time.
The whole post sounds like a somewhat silly micro-optimization to me too, but it could be academcally interesting : )
To save on the maintainance headache, you could call assert() inside of memcpy (this can be compiled away with NDEBUG) or make your version take a number of words as the size argument instead of a number of bytes. Call it wordcpy or something../
Did you try:
gcc -g -p myprog.c -o myprog
1. run your code
2 run gprof myprog
and look at the results of profiling? Just because elapsed time is longer does not mean that your memcpy() call is necessairly the problem.
anybody knows the bcopy function??
what the hell is the difference between bcopy and memcpy????
reading the man page of gcc while searching some compiler options i noticed that there is a function bcopy
i didn't know this function exists!
the man of bcopy says it's depricated and i should use memcpy
i just gave it a try an voila
the speed increases to about 900!! THIS IS EXTREMLY GOOD
well i also used some gcc flags
but i used those flags with memcpy and bcopy
and it didn't help with memcpy
but bcopy is really fast!
so, what's the difference?
btw: i tried the whole stuff with and without gcc flags
the flags i used are for i686 architecture (i just wanted to try)
using memcpy i got around 450
using bcopy i got around 900
WITHOUT the flags
memcpy: around 450
bcpy: around 1100!!!
how did i measure:
i used the gettimeofday function to know how fast i can send data between 2 progs
using only a simple benchmark tool
i got a value around 1300
there is nothing between the 2 progs
just a server and a client
the server writes and the client reads
then i tried the same measurement using my own architecture
and i got 1100 (well there is not my whole architecture between the 2 endpoints, so the whole architecture will reduce the stuff a bit)
so i think i'm at the end of testing ;-)
but i want to know why bcopy is so much faster?
bcopy comes from BSD code, while memcpy is from System V, and more standardized.
There may be platforms where they behave similarly, or where memcpy run faster ...
|All times are GMT -5. The time now is 12:03 PM.|