wstring <-> utf8 conversion in pure C++

ErV · 01-30-2009, 02:15 PM

Hello.

I'm looking for utf8<->wstring conversion routine implemented in pure C++ without external libraries. utf8 data is stored in std::string.

Basically, I want to do this:

Code:

#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::cout << ws << s << std::endl;
	return 0;
}

or this

Code:

#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::wcout << ws << s << std::endl;
	return 0;
}

for any istream/ostream/wistream/wostream-based class (notice that both string and wstring are used, normally this gives compiler error). Obviously to do thi, I'll need to overload ostream/wostream operator<< and for that I'll need conversion routine.

I spent some time looking for solution, found following:

1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.
2) libiconv. Too heavy for lightweight project, requires to open source code, so I can't just took one routine from there. I also neet only utf8<->wstring conversion, so it looks like overkill.
3) This code:

Code:

#include <locale>
#include <iostream>
#include <string>
#include <sstream>
using namespace std ;
       
wstring widen( const string& str )
{
      wostringstream wstm ;
      wstm.imbue(std::locale("en_US.UTF-8"));
      const ctype<wchar_t>& ctfacet =
      use_facet< ctype<wchar_t> >( wstm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      wstm << ctfacet.widen( str[i] ) ;
      return wstm.str() ;
}
       
string narrow( const wstring& str )
{
      ostringstream stm ;
      stm.imbue(std::locale("en_US"));
      const ctype<char>& ctfacet =
      use_facet< ctype<char> >( stm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      stm << ctfacet.narrow( str[i], 0 ) ;
      return stm.str() ;
}
       
int main()
{
      {
      const char* cstr = "abcdefghijkl" ;
      const wchar_t* wcstr = widen(cstr).c_str() ;
      wcout << wcstr << L'\n' ;
      }
      {
      const wchar_t* wcstr = L"mnopqrstuvwx" ;
      const char* cstr = narrow(wcstr).c_str() ;
      cout << cstr << '\n' ;
      }
 }

It is closest to what I've been looking for, but apparently it doesn't work. I.e. narrow(widen(std::string("фыва"))) on machine with utf8 locale doesn't return "фыва", and I don't see second string ("mnopqrstuvwx") in terminal on my machine.
4) mbstowcs It will work, but I'd like to avoid C functions and changing C locale in this project.

But no "pure lightweight C++ solution without external libraries".
Right now I'm writing my own conversion routine using this specification, but I'd like to know If there is a standard way to convert between utf8 and wstring in pure C++ or not, and I suspect I missed something. Any ideas?

Thanks for your time.

ErV · 01-31-2009, 01:41 AM

Here is utf8 <-> wstring conversion routine I wrote.
Str.h:

Code:

#ifndef STR_H
#define STR_H
#include <string>
#include <iostream>

typedef std::string Str;
typedef std::wstring WStr;

std::ostream& operator<<(std::ostream& f, const WStr& s);
std::istream& operator>>(std::istream& f, WStr& s);
void utf8toWStr(WStr& dest, const Str& src);
void wstrToUtf8(Str& dest, const WStr& src);

#endif

Str.cpp:

Code:

#include "Str.h"
#ifdef UTF8TEST
#include <stdio.h>
#endif

void utf8toWStr(WStr& dest, const Str& src){
	dest.clear();
	wchar_t w = 0;
	int bytes = 0;
	wchar_t err = L'�';
	for (size_t i = 0; i < src.size(); i++){
		unsigned char c = (unsigned char)src[i];
		if (c <= 0x7f){//first byte
			if (bytes){
				dest.push_back(err);
				bytes = 0;
			}
			dest.push_back((wchar_t)c);
		}
		else if (c <= 0xbf){//second/third/etc byte
			if (bytes){
				w = ((w << 6)|(c & 0x3f));
				bytes--;
				if (bytes == 0)
					dest.push_back(w);
			}
			else
				dest.push_back(err);
		}
		else if (c <= 0xdf){//2byte sequence start
			bytes = 1;
			w = c & 0x1f;
		}
		else if (c <= 0xef){//3byte sequence start
			bytes = 2;
			w = c & 0x0f;
		}
		else if (c <= 0xf7){//3byte sequence start
			bytes = 3;
			w = c & 0x07;
		}
		else{
			dest.push_back(err);
			bytes = 0;
		}
	}
	if (bytes)
		dest.push_back(err);
}

void wstrToUtf8(Str& dest, const WStr& src){
	dest.clear();
	for (size_t i = 0; i < src.size(); i++){
		wchar_t w = src[i];
		if (w <= 0x7f)
			dest.push_back((char)w);
		else if (w <= 0x7ff){
			dest.push_back(0xc0 | ((w >> 6)& 0x1f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else if (w <= 0xffff){
			dest.push_back(0xe0 | ((w >> 12)& 0x0f));
			dest.push_back(0x80| ((w >> 6) & 0x3f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else if (w <= 0x10ffff){
			dest.push_back(0xf0 | ((w >> 18)& 0x07));
			dest.push_back(0x80| ((w >> 12) & 0x3f));
			dest.push_back(0x80| ((w >> 6) & 0x3f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else
			dest.push_back('?');
	}
}

Str wstrToUtf8(const WStr& str){
	Str result;
	wstrToUtf8(result, str);
	return result;
}

WStr utf8toWStr(const Str& str){
	WStr result;
	utf8toWStr(result, str);
	return result;
}

std::ostream& operator<<(std::ostream& f, const WStr& s){
	Str s1;
	wstrToUtf8(s1, s);
	f << s1;
	return f;
}

std::istream& operator>>(std::istream& f, WStr& s){
	Str s1;
	f >> s1;
	utf8toWStr(s, s1);
	return f;
}

#ifdef UTF8TEST
bool utf8test(){
	WStr w1;
	//for (wchar_t c = 1; c <= 0x10ffff; c++){
	for (wchar_t c = 0x100000; c <= 0x100002; c++){
		w1 += c;	
	}
	Str s = wstrToUtf8(w1);
	WStr w2 = utf8toWStr(s);
	bool result = true;
	if (w1.length() != w2.length()){
		printf("length differs\n");
		//std::cout << "length differs" << std::endl;
		result = false;
	}
	
	printf("w1: %S\ns: %s\nw2: %S\n", w1.c_str(), s.c_str(), w2.c_str());
	
	for (size_t i = 0; i < w1.size(); i++)
		if (w1[i] != w2[i]){
			result = false;
			printf("character at pos %x differs (expected %.8x got %.8x)\n", i, w1[i], w2[i]);
			//std::cout << "character at pos " << i  << " differs" << std::endl;
			break;
		}
		
	if (!result){
		printf("utf8 dump: \n");
		for (size_t i = 0; i < s.size(); i++)
			printf("%2x ", (unsigned char)s[i]);
	}
	
	return result;
}

int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("фыва");
	std::cout << ws << s << std::endl;
	std::cout << wstrToUtf8(utf8toWStr("фыва")) << std::endl;
	if (utf8test())
		std::cout << "utf8Test succesful" << std::endl;
	else 
		std::cout << "utf8Test failed" << std::endl;
	return 0;
}
#endif

So, is there any way to do this in standard C++ without writing conversion routine yourself (as I did)?

AceofSpades19 · 01-31-2009, 02:41 AM

Quote:

Originally Posted by ErV

Hello.

I'm looking for utf8<->wstring conversion routine implemented in pure C++ without external libraries. utf8 data is stored in std::string.

Basically, I want to do this:

Code:

#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::cout << ws << s << std::endl;
	return 0;
}

or this

Code:

#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::wcout << ws << s << std::endl;
	return 0;
}

for any istream/ostream/wistream/wostream-based class (notice that both string and wstring are used, normally this gives compiler error). Obviously to do thi, I'll need to overload ostream/wostream operator<< and for that I'll need conversion routine.

I spent some time looking for solution, found following:

1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.
2) libiconv. Too heavy for lightweight project, requires to open source code, so I can't just took one routine from there. I also neet only utf8<->wstring conversion, so it looks like overkill.
3) This code:

Code:

#include <locale>
#include <iostream>
#include <string>
#include <sstream>
using namespace std ;
       
wstring widen( const string& str )
{
      wostringstream wstm ;
      wstm.imbue(std::locale("en_US.UTF-8"));
      const ctype<wchar_t>& ctfacet =
      use_facet< ctype<wchar_t> >( wstm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      wstm << ctfacet.widen( str[i] ) ;
      return wstm.str() ;
}
       
string narrow( const wstring& str )
{
      ostringstream stm ;
      stm.imbue(std::locale("en_US"));
      const ctype<char>& ctfacet =
      use_facet< ctype<char> >( stm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      stm << ctfacet.narrow( str[i], 0 ) ;
      return stm.str() ;
}
       
int main()
{
      {
      const char* cstr = "abcdefghijkl" ;
      const wchar_t* wcstr = widen(cstr).c_str() ;
      wcout << wcstr << L'\n' ;
      }
      {
      const wchar_t* wcstr = L"mnopqrstuvwx" ;
      const char* cstr = narrow(wcstr).c_str() ;
      cout << cstr << '\n' ;
      }
 }

It is closest to what I've been looking for, but apparently it doesn't work. I.e. narrow(widen(std::string("фыва"))) on machine with utf8 locale doesn't return "фыва", and I don't see second string ("mnopqrstuvwx") in terminal on my machine.
4) mbstowcs It will work, but I'd like to avoid C functions and changing C locale in this project.

But no "pure lightweight C++ solution without external libraries".
Right now I'm writing my own conversion routine using this specification, but I'd like to know If there is a standard way to convert between utf8 and wstring in pure C++ or not, and I suspect I missed something. Any ideas?

Thanks for your time.

Is there any particular reason you have the code at size 1?

ErV · 01-31-2009, 03:53 AM

Quote:

Originally Posted by AceofSpades19

Is there any particular reason you have the code at size 1?

I'm not sure what exactly you are talking about. Please, explain/elaborate. If you were asking why I'm not using wide character streams instead of single-character streams, this is because I have code that uses mixture of wstring/string classes. Moving everything to wstream is not possible, because certain calls require const char*, and some configuration files needs to be 8bit-compatible. so I'll run into utf8<->wchar_t conversion anyway, it simply can't be avoided. Also, storing data as utf8 in external files is more compact, less platform-dependant (for example windows wchar_t might be 2bytes long, while on linux it might be 4 bytes long), even when you use wchar_t-based strings internally.

Anyway, I met this problem conversion problem in the past few times and avoided it. So right now I'd like to know how to do conversion in the "right" way, in pure C++.

AceofSpades19 · 01-31-2009, 12:57 PM

Quote:

Originally Posted by ErV

I'm not sure what exactly you are talking about. Please, explain/elaborate..

As in the code that you posted is in a really small font
As you can see in this screenshot

ErV · 01-31-2009, 03:09 PM

Quote:

Originally Posted by AceofSpades19

As in the code that you posted is in a really small font
As you can see in this screenshot

Because it is optional information, forum doesn't have "CUT" tags, and it is easy to understand what I'm trying to do from the first sentence.

So, any ideas?

tuxdev · 01-31-2009, 04:26 PM

utf8_codecvt_facet should work for any stream. STL doesn't provide a wide stringstream directly, but you can typedef it from the std::basic_stringstream (and friends) template.

ErV · 02-01-2009, 12:19 AM

Quote:

Originally Posted by tuxdev

utf8_codecvt_facet should work for any stream. STL doesn't provide a wide stringstream directly, but you can typedef it from the std::basic_stringstream (and friends) template.

As I said, Id like to do conversion in pure C++, without external libraries (I also had some trouble ripping utf8_codecvt_facet from boost). C has mechanics for that (setlocale + wcstombs), so it would be strange if standard C++ doesn't allow that.

ErV · 02-14-2009, 04:49 PM

I suppose that this thing either is not possible in "pure C++" or no one here knows how to do that.
Therefore, question is closed.

phorgan1 · 02-25-2010, 02:10 PM

Quote:

Originally Posted by ErV

I spent some time looking for solution, found following:
[size=1]
1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.

You could typedef a wide string stream and make a locale with this code conversion facet and imbue it into the string and then just reading the stream would do the conversion for you. n.b. The boost utf-8 conversion facet doesn't follow the unicode spec, and leaves you open to security problems with alternate overly long encodings (as does your implementation below. From the From the Unicode Standard Version 5.2:

Code:

            Table 3-7. Well-Formed UTF-8 Byte Sequences
    Code Points        First Byte Second Byte Third Byte Fourth Byte
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    U+0000..U+007F     00..7F
    U+0080..U+07FF     C2..DF     80..BF
    U+0800..U+0FFF     E0         A0..BF      80..BF
    U+1000..U+CFFF     E1..EC     80..BF      80..BF
    U+D000..U+D7FF     ED         80..9F      80..BF
    U+E000..U+FFFF     EE..EF     80..BF      80..BF
    U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
    U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
    U+100000..U+10FFFF F4         80..8F      80..BF     80..BF

If you follow these you don't have problem with non-shortest forms
or utf-16 surrogates. If you don't you are insecure.