host to network byte order for strings

aspiring_stellar · 01-14-2012, 03:51 AM

Hi guys,
I want to convert a "string" from host byte order to network byte order. For integers and short, I can use htons(), htonl(); but what to do for strings?

Thanks.

Nominal Animal · 01-14-2012, 04:43 AM

In general, strings have no byte order. For variable-length character encodings like UTF-8, the order is specified in the encoding, and does not depend on endianness (also known as byte order). It is always the same.

Some encodings do use multiple bytes for every character, though. For example, UTF-16 the "string" is actually a sequence of 16-bit unsigned integers (unsigned shorts), and UCS-4 "strings" are actually sequences of 31-bit unsigned (32-bit signed) integers. Since these "strings" are actually sequences of shorts or ints, you can use htons() and htonl() to convert them to network byte order. However, Unicode users using those encodings should instead use a byte order mark (BOM, U+FEFF) as the first character to let the reader handle the byte order (if there is any risk of confusion); correspondingly, all readers should be prepared to understand any byte order for multibyte Unicode strings based on the initial byte order mark. (There is no reason to use or retain byte order marks when using UTF-8.)

For a large number of reasons, I recommend using UTF-8 for your strings. Each character (Unicode U+0000 to U+10FFFF) may be encoded between one to four bytes (and therefore you should be aware of the difference between string length in bytes and in characters!), but the order of the byte components in each characters is always the same and does not depend on the byte order. No byte order conversion is ever done for such strings. Furthermore, you can use standard C library functions to handle the strings. (UTF-16 and UCS-4 require wide character support, using special wide character types.)