LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-30-2009, 02:15 PM   #1
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Rep: Reputation: 62
Question wstring <-> utf8 conversion in pure C++


Hello.

I'm looking for utf8<->wstring conversion routine implemented in pure C++ without external libraries. utf8 data is stored in std::string.


Basically, I want to do this:
Code:
#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::cout << ws << s << std::endl;
	return 0;
}
or this
Code:
#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::wcout << ws << s << std::endl;
	return 0;
}
for any istream/ostream/wistream/wostream-based class (notice that both string and wstring are used, normally this gives compiler error). Obviously to do thi, I'll need to overload ostream/wostream operator<< and for that I'll need conversion routine.


I spent some time looking for solution, found following:

1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.
2) libiconv. Too heavy for lightweight project, requires to open source code, so I can't just took one routine from there. I also neet only utf8<->wstring conversion, so it looks like overkill.
3) This code:
Code:
#include <locale>
#include <iostream>
#include <string>
#include <sstream>
using namespace std ;
       
wstring widen( const string& str )
{
      wostringstream wstm ;
      wstm.imbue(std::locale("en_US.UTF-8"));
      const ctype<wchar_t>& ctfacet =
      use_facet< ctype<wchar_t> >( wstm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      wstm << ctfacet.widen( str[i] ) ;
      return wstm.str() ;
}
       
string narrow( const wstring& str )
{
      ostringstream stm ;
      stm.imbue(std::locale("en_US"));
      const ctype<char>& ctfacet =
      use_facet< ctype<char> >( stm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      stm << ctfacet.narrow( str[i], 0 ) ;
      return stm.str() ;
}
       
int main()
{
      {
      const char* cstr = "abcdefghijkl" ;
      const wchar_t* wcstr = widen(cstr).c_str() ;
      wcout << wcstr << L'\n' ;
      }
      {
      const wchar_t* wcstr = L"mnopqrstuvwx" ;
      const char* cstr = narrow(wcstr).c_str() ;
      cout << cstr << '\n' ;
      }
 }
It is closest to what I've been looking for, but apparently it doesn't work. I.e. narrow(widen(std::string("фыва"))) on machine with utf8 locale doesn't return "фыва", and I don't see second string ("mnopqrstuvwx") in terminal on my machine.
4) mbstowcs It will work, but I'd like to avoid C functions and changing C locale in this project.

But no "pure lightweight C++ solution without external libraries".
Right now I'm writing my own conversion routine using this specification, but I'd like to know If there is a standard way to convert between utf8 and wstring in pure C++ or not, and I suspect I missed something. Any ideas?

Thanks for your time.

Last edited by ErV; 01-30-2009 at 02:17 PM.
 
Old 01-31-2009, 01:41 AM   #2
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Original Poster
Rep: Reputation: 62
Here is utf8 <-> wstring conversion routine I wrote.
Str.h:
Code:
#ifndef STR_H
#define STR_H
#include <string>
#include <iostream>

typedef std::string Str;
typedef std::wstring WStr;

std::ostream& operator<<(std::ostream& f, const WStr& s);
std::istream& operator>>(std::istream& f, WStr& s);
void utf8toWStr(WStr& dest, const Str& src);
void wstrToUtf8(Str& dest, const WStr& src);

#endif
Str.cpp:
Code:
#include "Str.h"
#ifdef UTF8TEST
#include <stdio.h>
#endif

void utf8toWStr(WStr& dest, const Str& src){
	dest.clear();
	wchar_t w = 0;
	int bytes = 0;
	wchar_t err = L'�';
	for (size_t i = 0; i < src.size(); i++){
		unsigned char c = (unsigned char)src[i];
		if (c <= 0x7f){//first byte
			if (bytes){
				dest.push_back(err);
				bytes = 0;
			}
			dest.push_back((wchar_t)c);
		}
		else if (c <= 0xbf){//second/third/etc byte
			if (bytes){
				w = ((w << 6)|(c & 0x3f));
				bytes--;
				if (bytes == 0)
					dest.push_back(w);
			}
			else
				dest.push_back(err);
		}
		else if (c <= 0xdf){//2byte sequence start
			bytes = 1;
			w = c & 0x1f;
		}
		else if (c <= 0xef){//3byte sequence start
			bytes = 2;
			w = c & 0x0f;
		}
		else if (c <= 0xf7){//3byte sequence start
			bytes = 3;
			w = c & 0x07;
		}
		else{
			dest.push_back(err);
			bytes = 0;
		}
	}
	if (bytes)
		dest.push_back(err);
}

void wstrToUtf8(Str& dest, const WStr& src){
	dest.clear();
	for (size_t i = 0; i < src.size(); i++){
		wchar_t w = src[i];
		if (w <= 0x7f)
			dest.push_back((char)w);
		else if (w <= 0x7ff){
			dest.push_back(0xc0 | ((w >> 6)& 0x1f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else if (w <= 0xffff){
			dest.push_back(0xe0 | ((w >> 12)& 0x0f));
			dest.push_back(0x80| ((w >> 6) & 0x3f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else if (w <= 0x10ffff){
			dest.push_back(0xf0 | ((w >> 18)& 0x07));
			dest.push_back(0x80| ((w >> 12) & 0x3f));
			dest.push_back(0x80| ((w >> 6) & 0x3f));
			dest.push_back(0x80| (w & 0x3f));
		}
		else
			dest.push_back('?');
	}
}

Str wstrToUtf8(const WStr& str){
	Str result;
	wstrToUtf8(result, str);
	return result;
}

WStr utf8toWStr(const Str& str){
	WStr result;
	utf8toWStr(result, str);
	return result;
}

std::ostream& operator<<(std::ostream& f, const WStr& s){
	Str s1;
	wstrToUtf8(s1, s);
	f << s1;
	return f;
}

std::istream& operator>>(std::istream& f, WStr& s){
	Str s1;
	f >> s1;
	utf8toWStr(s, s1);
	return f;
}

#ifdef UTF8TEST
bool utf8test(){
	WStr w1;
	//for (wchar_t c = 1; c <= 0x10ffff; c++){
	for (wchar_t c = 0x100000; c <= 0x100002; c++){
		w1 += c;	
	}
	Str s = wstrToUtf8(w1);
	WStr w2 = utf8toWStr(s);
	bool result = true;
	if (w1.length() != w2.length()){
		printf("length differs\n");
		//std::cout << "length differs" << std::endl;
		result = false;
	}
	
	printf("w1: %S\ns: %s\nw2: %S\n", w1.c_str(), s.c_str(), w2.c_str());
	
	for (size_t i = 0; i < w1.size(); i++)
		if (w1[i] != w2[i]){
			result = false;
			printf("character at pos %x differs (expected %.8x got %.8x)\n", i, w1[i], w2[i]);
			//std::cout << "character at pos " << i  << " differs" << std::endl;
			break;
		}
		
	if (!result){
		printf("utf8 dump: \n");
		for (size_t i = 0; i < s.size(); i++)
			printf("%2x ", (unsigned char)s[i]);
	}
	
	return result;
}

int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("фыва");
	std::cout << ws << s << std::endl;
	std::cout << wstrToUtf8(utf8toWStr("фыва")) << std::endl;
	if (utf8test())
		std::cout << "utf8Test succesful" << std::endl;
	else 
		std::cout << "utf8Test failed" << std::endl;
	return 0;
}
#endif
So, is there any way to do this in standard C++ without writing conversion routine yourself (as I did)?

Last edited by ErV; 01-31-2009 at 03:56 AM.
 
Old 01-31-2009, 02:41 AM   #3
AceofSpades19
Senior Member
 
Registered: Feb 2007
Location: Chilliwack,BC.Canada
Distribution: Slackware64 -current
Posts: 2,079

Rep: Reputation: 58
Quote:
Originally Posted by ErV View Post
Hello.

I'm looking for utf8<->wstring conversion routine implemented in pure C++ without external libraries. utf8 data is stored in std::string.


Basically, I want to do this:
Code:
#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::cout << ws << s << std::endl;
	return 0;
}
or this
Code:
#include <iostream>
#include <string>
#include <locale>
int main(int argc, char** argv){
	std::wstring ws(L"фыва");
	std::string s("asdf");
	std::wcout << ws << s << std::endl;
	return 0;
}
for any istream/ostream/wistream/wostream-based class (notice that both string and wstring are used, normally this gives compiler error). Obviously to do thi, I'll need to overload ostream/wostream operator<< and for that I'll need conversion routine.


I spent some time looking for solution, found following:

1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.
2) libiconv. Too heavy for lightweight project, requires to open source code, so I can't just took one routine from there. I also neet only utf8<->wstring conversion, so it looks like overkill.
3) This code:
Code:
#include <locale>
#include <iostream>
#include <string>
#include <sstream>
using namespace std ;
       
wstring widen( const string& str )
{
      wostringstream wstm ;
      wstm.imbue(std::locale("en_US.UTF-8"));
      const ctype<wchar_t>& ctfacet =
      use_facet< ctype<wchar_t> >( wstm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      wstm << ctfacet.widen( str[i] ) ;
      return wstm.str() ;
}
       
string narrow( const wstring& str )
{
      ostringstream stm ;
      stm.imbue(std::locale("en_US"));
      const ctype<char>& ctfacet =
      use_facet< ctype<char> >( stm.getloc() ) ;
      for( size_t i=0 ; i<str.size() ; ++i )
      stm << ctfacet.narrow( str[i], 0 ) ;
      return stm.str() ;
}
       
int main()
{
      {
      const char* cstr = "abcdefghijkl" ;
      const wchar_t* wcstr = widen(cstr).c_str() ;
      wcout << wcstr << L'\n' ;
      }
      {
      const wchar_t* wcstr = L"mnopqrstuvwx" ;
      const char* cstr = narrow(wcstr).c_str() ;
      cout << cstr << '\n' ;
      }
 }
It is closest to what I've been looking for, but apparently it doesn't work. I.e. narrow(widen(std::string("фыва"))) on machine with utf8 locale doesn't return "фыва", and I don't see second string ("mnopqrstuvwx") in terminal on my machine.
4) mbstowcs It will work, but I'd like to avoid C functions and changing C locale in this project.

But no "pure lightweight C++ solution without external libraries".
Right now I'm writing my own conversion routine using this specification, but I'd like to know If there is a standard way to convert between utf8 and wstring in pure C++ or not, and I suspect I missed something. Any ideas?

Thanks for your time.
Is there any particular reason you have the code at size 1?
 
Old 01-31-2009, 03:53 AM   #4
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Original Poster
Rep: Reputation: 62
Quote:
Originally Posted by AceofSpades19 View Post
Is there any particular reason you have the code at size 1?
I'm not sure what exactly you are talking about. Please, explain/elaborate. If you were asking why I'm not using wide character streams instead of single-character streams, this is because I have code that uses mixture of wstring/string classes. Moving everything to wstream is not possible, because certain calls require const char*, and some configuration files needs to be 8bit-compatible. so I'll run into utf8<->wchar_t conversion anyway, it simply can't be avoided. Also, storing data as utf8 in external files is more compact, less platform-dependant (for example windows wchar_t might be 2bytes long, while on linux it might be 4 bytes long), even when you use wchar_t-based strings internally.

Anyway, I met this problem conversion problem in the past few times and avoided it. So right now I'd like to know how to do conversion in the "right" way, in pure C++.

Last edited by ErV; 01-31-2009 at 03:57 AM.
 
Old 01-31-2009, 12:57 PM   #5
AceofSpades19
Senior Member
 
Registered: Feb 2007
Location: Chilliwack,BC.Canada
Distribution: Slackware64 -current
Posts: 2,079

Rep: Reputation: 58
Quote:
Originally Posted by ErV View Post
I'm not sure what exactly you are talking about. Please, explain/elaborate..
As in the code that you posted is in a really small font
As you can see in this screenshot
 
Old 01-31-2009, 03:09 PM   #6
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Original Poster
Rep: Reputation: 62
Quote:
Originally Posted by AceofSpades19 View Post
As in the code that you posted is in a really small font
As you can see in this screenshot
Because it is optional information, forum doesn't have "CUT" tags, and it is easy to understand what I'm trying to do from the first sentence.

So, any ideas?

Last edited by ErV; 01-31-2009 at 03:13 PM.
 
Old 01-31-2009, 04:26 PM   #7
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,014

Rep: Reputation: 115Reputation: 115
utf8_codecvt_facet should work for any stream. STL doesn't provide a wide stringstream directly, but you can typedef it from the std::basic_stringstream (and friends) template.
 
Old 02-01-2009, 12:19 AM   #8
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Original Poster
Rep: Reputation: 62
Quote:
Originally Posted by tuxdev View Post
utf8_codecvt_facet should work for any stream. STL doesn't provide a wide stringstream directly, but you can typedef it from the std::basic_stringstream (and friends) template.
As I said, Id like to do conversion in pure C++, without external libraries (I also had some trouble ripping utf8_codecvt_facet from boost). C has mechanics for that (setlocale + wcstombs), so it would be strange if standard C++ doesn't allow that.
 
Old 02-14-2009, 04:49 PM   #9
ErV
Senior Member
 
Registered: Mar 2007
Location: Russia
Distribution: Slackware 12.2
Posts: 1,202
Blog Entries: 3

Original Poster
Rep: Reputation: 62
I suppose that this thing either is not possible in "pure C++" or no one here knows how to do that.
Therefore, question is closed.
 
Old 02-25-2010, 02:10 PM   #10
phorgan1
LQ Newbie
 
Registered: May 2008
Posts: 9

Rep: Reputation: 0
code conversion facet works, but make sure it meets the spec

Quote:
Originally Posted by ErV View Post
I spent some time looking for solution, found following:
[size=1]
1) utf8_codecvt_facet. Looks good, but was meant to be used in wifstreams/wofstreams, so it might be difficult to use this thing for conversion between utf8<->wstring.
You could typedef a wide string stream and make a locale with this code conversion facet and imbue it into the string and then just reading the stream would do the conversion for you. n.b. The boost utf-8 conversion facet doesn't follow the unicode spec, and leaves you open to security problems with alternate overly long encodings (as does your implementation below. From the From the Unicode Standard Version 5.2:

Code:
            Table 3-7. Well-Formed UTF-8 Byte Sequences
    Code Points        First Byte Second Byte Third Byte Fourth Byte
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    U+0000..U+007F     00..7F
    U+0080..U+07FF     C2..DF     80..BF
    U+0800..U+0FFF     E0         A0..BF      80..BF
    U+1000..U+CFFF     E1..EC     80..BF      80..BF
    U+D000..U+D7FF     ED         80..9F      80..BF
    U+E000..U+FFFF     EE..EF     80..BF      80..BF
    U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
    U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
    U+100000..U+10FFFF F4         80..8F      80..BF     80..BF

If you follow these you don't have problem with non-shortest forms
or utf-16 surrogates. If you don't you are insecure.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
wstring compilation using gcc 3.2.3 on FreeBSD 4.8 samitpmc Programming 2 07-30-2008 07:10 AM
pure-ftpd-mysql activates pure-ftpd zvikamer Linux - Software 2 03-01-2008 12:11 PM
utf8 hraposo Debian 1 08-11-2006 10:59 AM
equivalent for wstring pippet Programming 4 12-30-2004 04:00 PM
Utf8 akasantos Fedora 2 11-18-2003 10:03 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration