how can i speed up memcpy?

Thinking · 10-13-2005, 10:00 AM

hiho@ll

i have a simple test environment

i have a server
i have a client

/*booth C progs*/

the client does write(); in a loop
and the server does read(); in a loop

the client does some time measurement so i know how fast i can be

what i want:
the maximum throuthput which is possible and which fits in my application needs

i'm not sure if my mesaurements are correctly
but what i know is
if i do ONLY write(); AND read(); in a loop on both sides
i get a value of 2002

on the server side
if i do a memcpy(); during the loop
i get a value of 600

this means if i copy a buffer (the incoming to an temp buffer) the speed is slowed down!

i know i never get a value of 2000 if my whole app is between write(); and read();

BUT i want at minimum 1000!

i thought about writing
memcpy();
in assembler

WHY?
because gcc does not always use the best implementation of assembler memcpy!!!
on some cpu's
a loop with mov commands is faster than
using the rep command!!!

so my questions:

1. because i have very very very basic knowledge of assembler itself i have no idea how the gcc assembler syntax works!
the best would be that somebody can write those few lines in assembler for me
OR
maybe i get some little to understand info on assembler programming in gcc
so i can try the two possible memcpy assembler implementations (rep or mov)

2. maybe some weird GCC command line optimization options would help!!
but i have no idea which options do what
so i don't know which to use
and i don't know which options are available

thx@ll

btw. if use memcpy and my whole algorithm is between write(); and read(); i have a throughput of 400
this is fast
really fast

so if anybody posts
DON'T try it!
i only have to say
but i want it to try
that's the only way i learn many things about linux, C and maybe assembler
so please help and don't say
let it be

aluser · 10-13-2005, 12:21 PM

Quote:

maybe some weird GCC command line optimization options would help!!

I think -O3 will turn on everything you'd want except maybe -funroll-loops . There's a possibility that loop unrolling actually slows down your code, so it's imperative to test it : )

Quote:

maybe i get some little to understand info on assembler programming in gcc

A google found this: http://www.ibiblio.org/gferg/ldp/GCC...bly-HOWTO.html . There's some (more cryptic) information in the gcc docs: http://gcc.gnu.org/onlinedocs/gcc-3....ended-Asm.html and http://gcc.gnu.org/onlinedocs/gcc-3....nstraints.html

I can help more with that if you want, probably

Are you positive that you need as many memcpy()s as you're using? Perhaps the problem could be solved by a more complicated buffering scheme. What is your server doing?

aluser · 10-13-2005, 12:27 PM

also, it seems inuitive that, if you always memcpy in multiples of the word size you could beat gcc's implementation. Is that the case? You might even try unrolling the loop around your mov instructions..

However, if gcc is using a builtin memcpy and you call it with constant lengths, maybe it already does these things (?)

itsme86 · 10-13-2005, 12:32 PM

I believe glibc's implementation of memcpy() already copies in word-sized chunks by means of casting.

EDIT: Yeah, it actually does whole pages at a time if it can, then words, then bytes. this is in the file glibc-2.3.5/sysdeps/generic/memcpy.c

Code:

void *
memcpy (dstpp, srcpp, len)
     void *dstpp;
     const void *srcpp;
     size_t len;
{
  unsigned long int dstp = (long int) dstpp;
  unsigned long int srcp = (long int) srcpp;

  /* Copy from the beginning to the end.  */

  /* If there not too few bytes to copy, use word copy.  */
  if (len >= OP_T_THRES)
    {
      /* Copy just a few bytes to make DSTP aligned.  */
      len -= (-dstp) % OPSIZ;
      BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

      /* Copy whole pages from SRCP to DSTP by virtual address manipulation,
         as much as possible.  */

      PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

      /* Copy from SRCP to DSTP taking advantage of the known alignment of
         DSTP.  Number of bytes remaining is put in the third argument,
         i.e. in LEN.  This number may vary from machine to machine.  */

      WORD_COPY_FWD (dstp, srcp, len, len);

      /* Fall out and copy the tail.  */
    }

  /* There are just a few bytes to copy.  Use byte memory operations.  */
  BYTE_COPY_FWD (dstp, srcp, len);

  return dstpp;
}

aluser · 10-13-2005, 12:38 PM

Quote:

I believe glibc's implementation of memcpy() already copies in word-sized chunks by means of casting.

Sure, but somehow this has to work:

Code:

char a[4];
a[3] = '\0';
memcpy(a, "abcQ", 3);
assert(strcmp(a, "abc") == 0);

So memcpy() is doing something special for the case where the size isn't a multiple of the word size. If the size argument isn't constant, then this has to be done at run time, somehow.

itsme86 · 10-13-2005, 12:47 PM

Sorry, I pasted glibc's implementation after your post. I guess if you already knew how many bytes you were copying then you could avoid memcpy()'s branch logic, but even if you previously knew how many bytes you were copying it adds a maintenance headache. What if that number of bytes changes in the future? You have to remember to consider your copying algorithm too, all for the sake of saving the tiniest bit of time.

aluser · 10-13-2005, 12:59 PM

The whole post sounds like a somewhat silly micro-optimization to me too, but it could be academcally interesting : )

To save on the maintainance headache, you could call assert() inside of memcpy (this can be compiled away with NDEBUG) or make your version take a number of words as the size argument instead of a number of bytes. Call it wordcpy or something../

jim mcnamara · 10-13-2005, 01:01 PM

Did you try:
gcc -g -p myprog.c -o myprog

1. run your code
2 run gprof myprog

and look at the results of profiling? Just because elapsed time is longer does not mean that your memcpy() call is necessairly the problem.

Thinking · 10-14-2005, 02:54 AM

hiho@ll

anybody knows the bcopy function??
what the hell is the difference between bcopy and memcpy????

reading the man page of gcc while searching some compiler options i noticed that there is a function bcopy
i didn't know this function exists!

the man of bcopy says it's depricated and i should use memcpy

i just gave it a try an voila

the speed increases to about 900!! THIS IS EXTREMLY GOOD

well i also used some gcc flags
but i used those flags with memcpy and bcopy
and it didn't help with memcpy
but bcopy is really fast!

so, what's the difference?

btw: i tried the whole stuff with and without gcc flags
the flags i used are for i686 architecture (i just wanted to try)
using memcpy i got around 450
using bcopy i got around 900
WITHOUT the flags
memcpy: around 450
bcpy: around 1100!!!

how did i measure:
i used the gettimeofday function to know how fast i can send data between 2 progs
using only a simple benchmark tool
i got a value around 1300
there is nothing between the 2 progs
just a server and a client
the server writes and the client reads

then i tried the same measurement using my own architecture
and i got 1100 (well there is not my whole architecture between the 2 endpoints, so the whole architecture will reduce the stuff a bit)

so i think i'm at the end of testing ;-)

but i want to know why bcopy is so much faster?

thx@ll

jlliagre · 10-14-2005, 07:53 AM

Quote:

anybody knows the bcopy function??

yes

Quote:

what the hell is the difference between bcopy and memcpy????

Their name, the order of their arguments, and possibly their implementation ...
bcopy comes from BSD code, while memcpy is from System V, and more standardized.

Quote:

but i want to know why bcopy is so much faster?

I would say it's more optimized, at least on your system.
There may be platforms where they behave similarly, or where memcpy run faster ...