Socket: read/write. Data have their bytes randomly shuffled ?

Coralie Saint Ourens · 02-17-2012, 02:43 AM

Hi everyone

I have a simple client/server apps where the client send data (TCP/IP) in the form of square of pixels colors (3 floats) to the server. I can change the size of the square but by default, I started with something like 32x32 pixels. So overall that's a packet of 32*32*3*8 (8 bytes per float) floats sent to the server. These packets are sent to the server for an entire image which can be say 1024x1024 in resolution. For my testing I set the pixel color to always be 1 0 0.1.

On the client side:
* 1 write -> 4 integers to specify pos and size of the square in the frame
* 1 write -> square dim^2 * 3 (RGB) * 8 bytes pixel color data

On the server side I do 2 reads:
* do a first read to find the dimension and position of the square in the frame with 1 read
* read all the pixels for the current square (square size^2 * 3 (RGB) * 8 bytes) with 1 read

It runs okay but occasionally the squares have wrong values. For a block of 32x32 pixels say the first 200 pixels have RGB values that are 1 0 0.1 then the next 200 pixels have their RGB inverted say: 0 0.1 1 then the rest of the pixels will be 0.1 0 1 say. This problem only happens when the square reaches a certain size. When the square are 8x8 or 16x16 this problem doesn't seem to occur.

So I was wondering if it is possible that when the packed of data sent is too big, that the order of the bytes changes when it's read by the server. Or is this not supposed to happen at all which would suggest a bug in the code (however I really checked that I was writing and reading the correct number of bytes so I find that strange).

If the write/read process doesn't guarantee that the bytes are not re-ordered, is that just happening after the data written and read go over a certain limit. Can I find this limit ? In other words is there a limit (number of bytes) under which it is certain that bytes won't be re-ordered. It seems that I can send the data in much smaller packets (by sending say 8 pixels at a time from a square of 64x64) but isn't that not very efficient ?

Thanks a lot for your help

-coralie

Nominal Animal · 02-17-2012, 03:19 AM

Quote:

Originally Posted by Coralie Saint Ourens

For a block of 32x32 pixels say the first 200 pixels have RGB values that are 1 0 0.1 then the next 200 pixels have their RGB inverted say: 0 0.1 1 then the rest of the pixels will be 0.1 0 1 say.

Both writes (write(), send(), sendto() etc.) and reads (read(), recv(), recvfrom() etc.) may return a short count. I have seen your exact symptoms when programmers ignore the return values from I/O functions -- and only then.

Short counts occur quite often when using sockets and transmitting significant amounts of data. The kernel only allocates limited amount of internal buffering for the sockets, and will happily return a short count whenever the buffer is too full to write to, or if there is less data available than the programmer requested.

Here are two example C functions which read and write exactly the number of bytes requested.

Code:

#include <unistd.h>
#include <sys/types.h>
#include <errno.h>

int read_exact(const int descriptor, void *const buffer, const size_t bytes)
{
    char       *head = (char *)buffer;
    char *const tail = (char *)buffer + bytes;
    ssize_t     n;

    while (head < tail) {

        n = read(descriptor, head, (size_t)(tail - head));

        if (n > (ssize_t)0)
            head += n;

        else
        if (n == (ssize_t)0)
            return errno = ENOENT; /* End of file/stream, actually */

        else
        if (n != (ssize_t)-1)
            return errno = EIO;    /* Library bug, not seen in the wild */

        else
        if (errno != EINTR)
            return errno;          /* Actual error. EINTR is not an error. */
    }

    return 0;
}

int write_exact(const int descriptor, const void *const data, const size_t bytes)
{
    const char       *head = (const char *const)data;
    const char *const tail = (const char *const)data + bytes;
    ssize_t           n;

    while (head < tail) {
        n = write(descriptor, head, (size_t)(tail - head));
        if (n > (ssize_t)0) {
            head += n;

        } else
        if (n != (ssize_t)-1 || errno != EINTR) {
            if (n != (ssize_t)-1)
                errno = EIO;
            return errno;
        }
     }

     return 0;
}

Both functions return 0 if successful, nonzero error code otherwise. (See man errno for details.)

Usually it is better to do communications the other way around: instead of reading each packet separately, read as much as you can, picking out the completed packets. (It is a lot more efficient, for one.)

Use a large buffer, at least twice the size of your largest data packet. Whenever you have enough data in the buffer, call your data handler with a pointer to the data. Whenever there is not enough data in the buffer, move the leftover data to the start of the buffer, and read some more data. (Issue a read to fill the buffer, but do not retry. Most often you get enough data to handle a new packet. If not, then retry the read.) Repeat until all data is processed.

Finally, using floats for color components is more than a bit overkill, unless you are doing HDR research. Even then, uint16_t's (16-bit unsigned shorts) are usually more than enough. Usually people just use unsigned chars (0..255) instead.

Coralie Saint Ourens · 02-17-2012, 06:50 AM

Thank you very much for your help. Funnily (Sadly) enough I had done that already quite a few years ago ;-( but completely forgot about it. I think it's good though that there's a post around that explains this clearly -

Yes, passing float is overkilled but indeed I deal with HDR images. It's not optimised but i am just trying to get something working for now, and then later I will try to pass on less data (for instance half floats).

Thanks a lot for your clear and complete answer (with examples). Thanks for your time.

-coralie

Nominal Animal · 02-17-2012, 10:41 AM

Quote:

Originally Posted by Coralie Saint Ourens

Yes, passing float is overkilled but indeed I deal with HDR images.

Ah, very interesting! (I've been wondering if a hardware-based 8x8, 16x16, and 32x32 iDCT driver (maybe on an FPGA? or just a DSP?) might be needed to give Linux video recording and playback a kick in the pants.)

I don't know whether binary16 floats (half precision) will bring any performance benefits -- other than cutting the transferred data in half -- due to the conversions needed. If you do the computations on an x86 CPU, you may be better off just using floats. You might wish to check, though!

See Fast Half Float Conversions (PDF) for efficient methods for the half-to-single precision and single precision-to-half precision conversions. Basically, you use uint32_t and uint16_t and bit operations and small lookup tables to do the conversion using binary arithmetic. No floating point ops, so the conversion should be very efficient. It is better to do the conversions in bulk, i.e. entire block at a time, so the CPU can work off the cache.

When sending image data, you may find writev useful, since it lets you gather the metadata from one pointer, and the data from another, but still send it as a single packet. It's not that useful with TCP sockets, but for datagram sockets, it avoids the need for a separate copy if you want to send complete datagrams only. The short count handling is very tricky (I can write an example if needed) but for datagram sockets like UDP you normally just ignore the badly formed datagrams and retry instead, so it's much easier there.

Of course, if you use single precision internally, and half precision only when communicating, you can combine the packet construction/parsing and half/float conversions into a single operation; no need to fiddle with writev().

Finally, if you need high throughput, usually the bottleneck is not the bandwidth, but the number of packets transferred per second. The solution boils down to sending/writing and receiving/reading more data per function call, and making sure you use large transfer sizes (if using TCP/IP or UDP/IP).

Glad to help,

Coralie Saint Ourens · 02-17-2012, 11:16 AM

Maybe that should go another post but I believe this is "slightly" in the continuity of the question I asked before. At least, now I got the transfer part working which is great. Thx again.
However I noticed something else which I can't explain (but I am sure there's a good reason for it). When I write the packet without adding a slight delay between each send, then the whole process of sending all the data takes a looooong time (5 minutes). When I put a 2 milliseconds between each send, then going through all the packets takes 10 seconds !? This is the code I use:

Code:

	for (unsigned i = 0; i < nbuckets * nbuckets; ++i) {
		// compute coordinates
		int by = i / nbuckets;
		int bx = i % nbuckets;
		int x0 = bx * bucketSize;
		int y0 = by * bucketSize;
		int x1 = x0 + bucketSize;
		int y1 = y0 + bucketSize;
		Box2<int> dim(Vec2<int>(x0, y0), Vec2<int>(x1 - 1, y1 - 1));
		Vec3<float> *data = new Vec3<float>[bucketSize * bucketSize];
		Vec3<float> c(1, 0.1, 0.0);//drand48(), drand48(), drand48());
		for (unsigned j = 0; j < bucketSize * bucketSize; ++j) {
			data[j] = c;
		}
		clock_t mseconds = 0; // <<< SLOW
		clock_t goal = mseconds + clock();
		while (goal > clock());

		renderView->sendImage(dim, data);
		
		
		delete [] data;
	}

Any idea why ? That seems like a hack to me and indicates that there's still something I am not doing properly ? Any help greatly appreciated -

Thank you -c

Nominal Animal · 02-17-2012, 12:34 PM

Quote:

Originally Posted by Coralie Saint Ourens

Code:

		clock_t mseconds = 0; // <<< SLOW
		clock_t goal = mseconds + clock();
		while (goal > clock());

The man 3 clock manpage states that clock() ... returns an approximation of processor time used ...

You are not waiting mseconds clock ticks ((double)mseconds/(double)CLOCKS_PER_SEC seconds) of wall clock time, you're busy-waiting in the loop until enough CPU time has been wasted.

So, there are two issues: the units are 1/CLOCKS_PER_SEC ticks, and clock() measures CPU time used by your program, not wall clock time. The loop is running for a long time, because you told it to ..

Coralie Saint Ourens · 02-20-2012, 06:11 AM

Thank you again for your help and I agree with what you wrote. However I didn't explain the problem well I think. When there is NO delay (when I comment this code out) the data takes MUCH longer to get on the other end than when I put the delay. So I a confused because of this. Is it possible that when I send the data so close to each other (in time) that they actually "bump" into each other and that they are "harder" to read by the server than when they are send one after the other with a small delay between each sent ?

How can this be ?

With this code the server gets the data in much longer time:

Code:

	for (unsigned i = 0; i < nbuckets * nbuckets; ++i) {
		
		Box2<int> dim(Vec2<int>(x0, y0), Vec2<int>(x1 - 1, y1 - 1));
		Vec3<float> *data = new Vec3<float>[bucketSize * bucketSize];
		
		renderView->sendImage(dim, data);
		
		
		delete [] data;
	}

than if I use this:

Code:

	for (unsigned i = 0; i < nbuckets * nbuckets; ++i) {
		
		Box2<int> dim(Vec2<int>(x0, y0), Vec2<int>(x1 - 1, y1 - 1));
		Vec3<float> *data = new Vec3<float>[bucketSize * bucketSize];
		
		clock_t mseconds = 2;
		clock_t goal = mseconds + clock();
    	        while (goal > clock());

		renderView->sendImage(dim, data);
		
		
		delete [] data;
	}

Thanks again for your input -

Nominal Animal · 02-20-2012, 07:17 AM

Ah, I understood the opposite slowdown case.

In general, the network stacks work best when write buffers are full, and read buffers empty. If you are using TCP, and the recipient keeps its read buffer full (i.e. reads as little as possible, taking as long as possible to read the data), it is possible the delay you have is due to TCP retransmissions. Essentially, the sender keeps sending the same TCP packet at increasing intervals, because the recipient is unable to receive them as their TCP/IP receive buffer is already full.

In other words, I think the delays are caused by your recipient code, which reads the data slow enough to cause TCP retransmissions. When you add delays between writes on the sending side, you ease the congestion in your read side, and thus avoid or at least decrease the number of TCP retransmissions.

You could do a tcpdump of the connection in both cases, and compare the number of packets used to see if my theory is right.

Not seeing any of the actual sources -- send and receive -- means the above is only a guess, though. Care to post the actual send and receive code?

Also, use usleep() or nanosleep() for the delay instead. In Linux, usleep() is implemented in the GNU C library using a nanosleep() syscall, and nanosleep() does not interfere with alarm() or signals in any way.

I haven't seen anything like this behaviour in my own code. Then again, I always use a large (application-side) read buffer for each connection, and issue each read/recv to fill the entire buffer (effectively emptying the kernel receive queue), then process each data frame from the buffer. Remember: each retransmission costs milliseconds or more in transmission time (extra full round-trip-time). If that occurs for a large number of packets, you easily get very long delays and slow transmission. It is always better to stream the data as fast as possible, even if you had to use a temporary file to store the data, to avoid retransmissions. (Usually that is not necessary, because the processors are powerful enough to do their work while streaming the data.)

On the other hand, if you want to cap the transmit bandwidth (like most BitTorrent clients do), you need to do it on the write side, basically issuing the writes rarely enough (with sleeps etc. in between) to not exceed the bandwidth. This is pretty much what your code does with nonzero mseconds value. (Exceeding some specific bandwidth causes the recipient rx buffer to become full, which causes retransmissions, which causes a serious slowdown.)

sundialsvcs · 02-20-2012, 08:53 AM

Two suggestions:

Arrange for each command that you send to have a sequence-number attached to it. The receiving code makes certain that each number received is physically sequential (modulo n, of course). If it's not, then you know that there is a bug somewhere.

Generally, I write systems like this where there is one process that "multiplexes and demultiplexes" the inbound and outbound traffic. Inbound packets are "bursted" and placed one by one on an internal queue. Then, an outbound packet is assembled from another queue, filling the buffer with as many packets as the selected buffer-size will hold. (If it is obliged by protocol to send an empty response packet, it does so.) But, I make no attempt to deal with mux/demux anywhere else except this one process or thread. Apart from this one "knowledgeable" thread, both senders and receivers consider only a packet stream coming through a buffered queue.

Coralie Saint Ourens · 02-24-2012, 12:33 PM

Thank you to you both. These are very helpful and complete answers.

I much appreciate that you shared your knowledge, time to help me with this (and I am sure others).