Handling field width in multibyte strings with printf formats?

johan162 · 03-13-2012, 10:23 AM

I use a solution to a problem (that more people than me must have had) that I feel is less than elegant. Therefore I'm curious how other have handled similar situations.

Core problem:
The standard printf() function will print multibyte strings with no problem but it will not correctly handle width formatting since it will calculate the length of the string in bytes and not displayed characters.

For utf-8 encoded strings this is a problem since one displayed character might use two bytes. This means that formats such as

Code:

"%-20s : %-10s\n"

(for example) will not work as expected for true multibyte strings even when using a setlocale() that indicates utf-8 (behavior with glibc)

Possible solutions (and drawbacks)

1. Use wide character as internal format.
This solves the problem but requires a massive, error-prone, rewrite of existing code. In addition this internal
format is non portable (in general) unless strings are printed out with wprintf* family and assuming the proper locale is set so that the output has a proper encoding. Furtermore this will only work for stream input/output. There is no equivalent to read/write non-buffered input/output.

2. Semi-manual format
By manually calculating the displayed width of strings known to be mb it is possible to preformat strings that later can be printed with the normal standard printf() family. However such manual conversion must go over wide character in some way since this seems to be the only way to guarantee correct count of displayed characters regardless of encoding.

The example below illustrates one possible way of doing this

Code:

/* Calculate displayed number of chars in a mb string */
size_t _mblen(const char *s) {
  mbstate_t t;
  const char *scopy = s;
  memset(&t, 0, sizeof (t));
  return mbsrtowcs(NULL, &scopy, strlen(scopy), &t);
}

/* Pad a mb string to 'pad' displayed size */
int _mbpad(char *s,size_t pad, size_t maxlen) {
  size_t mbn=_mblen(s);
  size_t n=strlen(s);
  if( (size_t)-1 == n || n+pad >= maxlen || mbn > pad ) return -1;
  for(size_t i=0; i < pad-mbn; ++i ) {
    s[n+i] = ' ';
  }
  s[n+pad-mbn]='\0';
  return 0;
} 

/** possible usage. Assume the strings mystring1 and mystring2 exists **/
const size_t bsize=255;
char tmpbuf1[bsize],tmpbuf2[bsize]

strncpy(tmpbuf1,mystring1,bsize-1); tmpbuf1[bsize-1]='\0';
strncpy(tmpbuf2,mystring2,bsize-1); tmpbuf2[bsize-1]='\0';
_mbpad(tmpbuf1, 30, bsize); // Ignore possible error condition for clarity
_mbpad(tmpbuf2, 30, bsize); // Ignore possible error condition for clarity
printf("%s : %s\n",tmpbuf1,tmpbuf2);
// Equivalent to printf("%-30s : %-30s\n",mystring1,mystring2); for non-mb strings.

Other people must have solved the same problem. How did you handle it? Should it be considered a bug in glibc that printf family doesn't know about the locale (and mb strings)?

(I should note that in my actual application I make frequent use of va_list versions of the printf* family which makes it impossible to implement this "silently" under the hood since this would require pre-parsing of the formatting string and adjusting only the string arguments)

Thoughts?

dwhitney67 · 03-14-2012, 08:39 PM

Have you looked into using wprintf()?

Edit: I guess you have, and it seems that you have your reasons against pursuing its usage.

johan162 · 03-15-2012, 01:27 AM

Yes, this is basically the solution 1. as listed in my post. Using wide-chars for all internal processing would be a possibility but that would require a complete refactoring of the program. All char and char * types has to be changed. With this comes some subtle but error prone issues. Since I don't really need the full wide-char functionality (the utf-8 half-way house is fine) this is not really a road I want to travel.