LinuxQuestions.org - Letters and chars in Linux

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Letters and chars in Linux (https://www.linuxquestions.org/questions/programming-9/letters-and-chars-in-linux-858856/)

nime	01-26-2011 02:20 PM

Letters and chars in Linux

Hello,
I am pretty new in C and in Linux so I need advice, suggestion or example regarding encoding/decoding strings in generally on Linux.

I have much of old binary data files mostly written in DOS/QB45. If I want to read that data on windows I should recode strings (replace certain characters) and then data looks good on windows. There this characters are between ASCII 127 and 255. It is word about non-english letters like "čćšđžČĆŠĐŽ" which is common in my country.
Every operation with such strings can count for 1 byte per letter.
So, conversion looks like:

Code:

        char* ps = myString;

        while(*ps != '\0'){

          switch ((unsigned char)*ps)

            {

        case 230: *ps=138;

        break;

        case 209: *ps=208;

        break;

        .

        .

        .

        etc...

But when I come to linux one different story waits for me :)
Here, different than english letters, seems that "my" characters have more than one byte.
Of course this is not good for my program which have searching and many other stuff organized like letter/byte principle.
And of course I am in trouble now :)

But this is behaviour present in many other languages and I ask for help experienced linux users and programmers - what to do here to be able to use my nice and beloved old data with my C programs in linux.

Data is organized in packed binary structures and written to file with (say) VB.

Thanx, nime.

Snark1994

01-26-2011 04:51 PM

Hm, sounds like you would benefit from reading this... My guess is that linux is using unicode to store your characters. I'm not sure what the best course of action would be - if it were me, I would probably try to change my programmes to deal with the multi-character encoding, or at least find a bit more about the issue (90% of my knowledge on the subject came from reading that article when trying to fix a character encoding issue on a website). However, this may not be the most time-effective way of doing things. What exactly are you "converting" it to? You say you need to re-encode characters between 129 and 255, but in your programme you change 209 to 208...

nime	01-26-2011 05:18 PM

Hello Snark,
I have old data which I cannot change cause it is at different places and because it has "a much" of those data. Originally, this data was written in DOS with my code page which is 852.
Part of converting program I showed recode characters readable in DOS so they can be readable in Windows and this works OK. When I work in windows I always recode data back so it can be readable from DOS too. At this way I have same data for two different OS-es.
My idea was to involve Linux in this story, that's mainly why I now learn "C" language.
In Dos and windows those specific characters lies in range between 127 and 255 but not in same place. This applyes only for my country specific letters "čćšđžČĆŠĐŽ". Latin characters are always at same place.

Now, I have another interesting questions:
How Linux prints to matrix devices like printers and so if they uses unicode which is suitable only for graphical systems?
How linux then interact with microcontrollers?

I didn't find yet interesting stuff at Oracle.
But I have some idea and because I am C beginner I don't know if this is this possible.
For example I can compose string with single byte and two byte characters to read/change them in Linux (like GTK or similar textbox) but can I after recognize (by looping) which letter in string is 2-byte so I can recode back to DOS single byte matrix before I save to disk?
Recoding is necessary because data assumes strings with explicit size (fixed length strings) and other technical requirements in such system stands on this fact.

What do you think?

eSelix

01-26-2011 06:12 PM

Well, if you read data from file and save single bytes then in file are stored single bytes, meaningless of a Linux or Windows, data in files are the same. I think that you have problem with representation of this new file on the screen. Can you say how do you known that:

Quote:

Here, different than english letters, seems that "my" characters have more than one byte.

You read this file on the console with cat or editor, or with graphical editor?

nime	01-26-2011 06:49 PM

Hi eSelix,
I try like this:

Code:

char* str = "Marco";

printf("%d",strlen(str));



returns 5

and this

Code:

char* str = "Mačka";

printf("%d",strlen(str));



returns 6

so "č" letter can be two byte letter, I suppose.
I prints data to console with printf and all those letters are represented with black questionmark.

I do additional tests.
I looping through all ascii from 32 to 255 and prints them to console and letters over 127 are also only black "?".
And more. I save text file in windows (notepad/plain text) with those data and in linux this is displayed without my characters.
I can see those file properly after converting in geany or ooWriter.
When I open this file in geany without conversion instead of those characters there are boxes with four (hex) letters.
But please note, I want to convert my strings through C program, not whole text file.
After all, my data is in binary file where text and numbers are mixed and "packed".

eSelix

01-26-2011 07:00 PM

Editor in which you wrote this code has set unicode encoding. So the "Mačka" string is encoded in your program that way. Change to iso/ascii in editor options. If you don't now how, say which editor are you using.

As varies countries use varies charsets then we have mess with strings now. It is difficult to guess by editor which charset is used in opened file, so the user must do a conversion by hand. But UTF-8 was created as remedy for this situation.

nime	01-27-2011 12:29 AM

eSelix,
No need to guessing, just read what I wrote.

1) I don't want to convert text (or other) file but strings in my C program.
2) I don't use editor but C program in windows.

I can read those data from windows in linux but those letters are not displayed.
Is here any way that my program can use this data that I do conversion directly in my program just for strings which I need, not to whole file?
Or like this, can I force linux that my program uses codepage and character sets same like windows?

Same problem should be with german, french, spanish and many other languages so I belive somebody faces this problem long before me.

Snark1994

01-27-2011 09:55 AM

Quote:

Originally Posted by nime (Post 4239136)

2) I don't use editor but C program in windows.

He means the editor you used to write your C program. That has to store the string as well, so he is suggesting changing the setting in your editor :)

nime	01-27-2011 12:39 PM

Ah so, thanks Snarky,
I do this already (set to UTF-8) my C editor (Code:Blocks) and they works well and proper on both win and linux. But this is other kind of problem.

I read today more sources for this problem and it seems that may of people has it.
Higher level programming tools have included those conversions but C have got gnulib, locale and internationalization concept which I have to study.

Anyway, If somebody have advice or experience with this please share!

Nominal Animal

01-27-2011 04:23 PM

n Linux, there is no need to modify your old programs. You can adjust your environment to use the same character encoding, IBM 852, as your old data files use.

If your old program is a command-line program (and does not use e.g. ncurses), use iconv to convert the output to readable form:

Code:

./my-program ... | iconv -f IBM852

If your program is interactive, and have a recent version of luit (for IBM 852 code page support), then

Code:

luit -encoding ibm-cp852 ./my-program ...

If you write your own program for Linux, you can use the iconv_open, iconv, and iconv_close functions to set up an easy conversion for your strings, between UTF-8 and IBM852. It is a very nice and easy API to use. It is also part of glibc and standard locales, and should be available in all Linux distributions.

Example iconv C helper code follows. ibm852_to_utf8() and utf8_to_ibm852() work just like strdup(), returning a duplicate of the given string, except they do the conversion too. Remember to free() any such strings after you're done with them. Also, call iconv_done() when you don't need the conversion (at least for a while), to free any internal tables used by iconv.
It would be better if the code resized the newly created strings when necessary, instead of just allocating them heuristically. Or even better, if it used string pools (like Apache memory pools), so that you could discard all related strings at once, and not worry about remembering to free each and every one.

Code:

#include <stdlib.h>

#include <string.h>

#include <iconv.h>

#include <errno.h>

#include <stdio.h>



static iconv_t  iconv_ibm852_to_utf8 = (iconv_t)-1;

static iconv_t  iconv_utf8_to_ibm852 = (iconv_t)-1;



void iconv_done(void)

{

    if (iconv_ibm852_to_utf8 != (iconv_t)-1)

        iconv_close(iconv_ibm852_to_utf8);



    if (iconv_utf8_to_ibm852 != (iconv_t)-1)

        iconv_close(iconv_utf8_to_ibm852);



    iconv_ibm852_to_utf8 = (iconv_t)-1;

    iconv_utf8_to_ibm852 = (iconv_t)-1;

}



/*

 * These functions will return a dynamically allocated copy, just like strdup().

 * Remember to free() them after use.

*/



char *ibm852_to_utf8(char *ibm852)

{

    char  *in = ibm852;

    char  *out, *end;

    size_t  in_left = (in) ? strlen(in) : 0;

    size_t  out_size = 3 * in_left;

    size_t  out_left = out_size;

    size_t  converted;



    if (!ibm852) {

        errno = EINVAL;

        return NULL;

    }



    if (iconv_ibm852_to_utf8 == (iconv_t)-1) {

        iconv_ibm852_to_utf8 = iconv_open("UTF-8", "IBM852");

        if (iconv_ibm852_to_utf8 == (iconv_t)-1)

            return NULL;

    }



    out = malloc(out_size + 1);

    if (!out) {

        errno = ENOMEM;

        return NULL;

    }



    end = out;

    converted = iconv(iconv_ibm852_to_utf8, &in, &in_left, &end, &out_left);

    if (converted == (size_t)-1) {

        const int saved_errno = errno;

        free(out);

        errno = saved_errno;

        return NULL;

    }

    *end = 0;



    return out;

}



char *utf8_to_ibm852(char *utf8)

{

    char  *in = utf8;

    char  *out, *end;

    size_t  in_left = (in) ? strlen(in) : 0;

    size_t  out_size = 2 * in_left;

    size_t  out_left = out_size;

    size_t  converted;



    if (!utf8) {

        errno = EINVAL;

        return NULL;

    }



    if (iconv_utf8_to_ibm852 == (iconv_t)-1) {

        iconv_utf8_to_ibm852 = iconv_open("IBM852//TRANSLIT", "UTF-8");

        if (iconv_utf8_to_ibm852 == (iconv_t)-1)

            return NULL;

    }



    out = malloc(out_size + 1);

    if (!out) {

        errno = ENOMEM;

        return NULL;

    }



    end = out;

    converted = iconv(iconv_utf8_to_ibm852, &in, &in_left, &end, &out_left);

    if (converted == (size_t)-1) {

        const int saved_errno = errno;

        free(out);

        errno = saved_errno;

        return NULL;

    }

    *end = 0;



    return out;

}



int main(int argc, char *argv[])

{

        int  arg;

        char *s1, *s2, *s3, *s4;



        for (arg = 1; arg < argc; arg++) {

                s1 = utf8_to_ibm852(argv[arg]);

                s2 = ibm852_to_utf8(argv[arg]);

                s3 = ibm852_to_utf8(s1);

                s4 = utf8_to_ibm852(s2);

                printf("s = \"%s\":\n", argv[arg]);

                printf("utf8_to_ibm852(s) = \"%s\"\n", s1 ? s1 : "<error>");

                printf("ibm852_to_utf8(s) = \"%s\"\n", s2 ? s2 : "<error>");

                printf("ibm852_to_utf8(utf8_to_ibm852(s)) = \"%s\"\n", s3 ? s3 : "<error>");

                printf("utf8_to_ibm852(ibm852_to_utf8(s)) = \"%s\"\n", s4 ? s4 : "<error>");

                if (s4) free(s4);

                if (s3) free(s3);

                if (s2) free(s2);

                if (s1) free(s1);

        }



        iconv_done();



        return 0;

}

I don't use VB or Windows, so I cannot really help you there.

Hope this helps,

Nominal Animal

nime	01-28-2011 12:34 AM

Thank you Nominal,

this sounds like way to go.
Also, thanks for all remarks, I will understand what do you talk about with a little practice then I will post back results.
So I see now this can go like this:

- load string from 852 file,
- duplicate them and convert to utf-8,
- do some job (change) with them,
- convert them to 852 back,
- write them to disk,
- free them from memory,

This is ideal scenario because same data without static conversion will be good for program in DOS and windows too which can all share same files.
However, this makes theoretically possible to have 3 different OS-es in LAN network which work with same data.

Thanks again Nominal, for windows and VB and DOS don't need help. I solved there those scenario long time ago.

Nominal Animal

01-30-2011 11:57 AM

Right, nime.

The only thing to watch for is if the user supplies UTF-8 characters (say, Kanji or something), which cannot be converted to IBM-852. The //TRANSLIT flag tells iconv to do the closest possible conversion, which is very good, but you may have to use better memory handling in case of the translitterations are much longer than the original characters. The example functions I gave are not very good, since they really should check if the conversion needs a larger buffer..

Also, you might consider using a configurable character set name (read it either from a a file, command-line parameter or an environment variable, defaulting to IBM852), and open iconv handles (disk_to_utf and utf_to_disk) early in your program. That way you can handle already converted files, too.

(Iconv uses the standard character set mappings in your locales' charmaps/ directory. If you have already done some partial static character conversions, you can copy charmaps/IBM852 to say charmaps/IBM852dosfix, and edit it (rename to IDBM852dosfix) to account for those static conversions -- the file has a simple textual format. Then just tell your program to use the IBM852dosfix charset instead, and you can work with those files painlessly too.

Happy to help,

Nominal Animal

nime	03-05-2011 04:44 AM

Hello again

I see we have here background conversation too what confirm additionally how this subject is interesting for developers. I am probably winner in trying to fix my own mistakes with partial success :) One I'm sure, If I wouldn't be a human I will made much less mistakes. This is not good because then I will not have to fix anything.
And most important is that all my mistakes becomes as result of hard work and positive thinking.

So, last month I try often to get conversion of codepages to work and I get it finally - yesterday, but only on windows, using iconv.dll.
But new situations appears (which I don't know to solve).
IBM852 to UTF8 is fine for filling GTK textboxes, printing to console (which is at 852) and so, but it is not good for writing to text file for show with notepad (without BOM). So, by example of Nominal I made additional conversions from IBM852 to CP1250 and back and now this is also fixed. Conversions works excellent! Properly and fast. What will be in Linux I will see later.
Calling conversions look's like this:

Code:

  char* entry_text;

  entry_text =(char*) gtk_entry_get_text(GTK_ENTRY(entry));

  printf("button_clicked search for %s\n", utf8_to_ibm852(entry_text));



OR



    char* et = utf8_to_ibm852((char*)entry_text);

    printf("Entry contents: %s\n", et);



OR



    char* tp;

    tp = utf8_to_ibm852(findstr);



    char* wp;

    wp = ibm852_to_cp1250(tp);

... and now I have 4 functions and need more (cp1250>UTF8 and back, and even more like 8859-2, etc...).
So, I think about to replace those functions with just one or two, and I try, but my C is not so good as my english ;)

I would like function like this:

Code:

    

    char* convCP(char* fromCP, char* toCP, char* instring);



//which could be called:

    

convstring = convCP("IBM852", "UTF-8, char* instring);

//OR

convstring = convCP("UTF-8", "CP1250, char* instring);

If it is not too much for ask, please someone who know C to make my function based on Nominals code.

(2)
And additional problem, for which I don't think before but is needed now is:
Main purpose of storing strings to file in old DOS 852 format is direct addressing (one letter-one byte) and ability to use simple C functions for searching and manipulating data like strcmp, strstr, strcpy and so.
But...
... problem is, after I convert GTK text to IBM852 I have to convert my new 852 text to uppercase! But my system (CP1250) dont know what is uppercase in CP852 when we talk about letters like "čćžšđČĆŽŠĐ".

What to do here.
Did anybody faces with this problem before?

Nominal Animal

03-05-2011 09:26 AM

Quote:

Originally Posted by nime (Post 4279697)

And most important is that all my mistakes becomes as result of hard work and positive thinking.

Exactly! You have a very good attitude, nime.

Quote:

Originally Posted by nime (Post 4279697)

now I have 4 functions and need more (cp1250>UTF8 and back, and even more like 8859-2, etc...).
So, I think about to replace those functions with just one or two, and I try, but my C is not so good as my english ;)

The interface you showed,

Quote:

Originally Posted by nime (Post 4279697)

Code:

char* convCP(char* fromCP, char* toCP, char* instring);

is a bit difficult: you need to choose between efficiency and code complexity. You see, if you create a new conversion handle for each string conversion, you incur quite a high overhead in iconv_open, slowing things down. You can avoid that by caching a number of conversion handles, but that makes the code a lot more complex.

I'd recommend you use a bit different interface, where you create the conversion first like this:

Code:

iconv_t  utf8_to_ibm852 = conversion("UTF-8",      "IBM852//TRANSLIT");

iconv_t  utf8_to_iso2  = conversion("UTF-8",      "ISO-8859-2//TRANSLIT");

iconv_t  ibm852_to_utf8 = conversion("IBM852",    "UTF-8");

iconv_t  ibm852_to_iso2 = conversion("IBM852",    "ISO-8859-2//TRANSLIT");

iconv_t  iso2_to_utf8  = conversion("ISO-8859-2", "UTF-8");

iconv_t  iso2_to_ibm852 = conversion("ISO-8859-2", "IBM852//TRANSLIT");

where the //TRANSLIT means transliterate unsupported characters to nearest equivalents. You use the conversion handles with a convert function like this:

Code:

char *new_ibm852_string = convert(utf8_to_ibm852, old_utf8_string);

which keeps the old string intact, returning a new dynamically allocated string.

Would this work for you? I'd be happy to show you the source for conversion() and convert().

Quote:

Originally Posted by nime (Post 4279697)

... problem is, after I convert GTK text to IBM852 I have to convert my new 852 text to uppercase! But my system (CP1250) dont know what is uppercase in CP852 when we talk about letters like "čćžšđČĆŽŠĐ".

Normally I'd recommend using setlocale() and toupper()/tolower(), but since every glyph in IBM852 is just one byte, you can do the conversion trivially with a table. For example,

Code:

#include <stdlib.h>



const unsigned char uppercase_ibm852[256] = {

        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,

        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,

        0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,

        0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,

        0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,

        0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,

        0x60, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,

        0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,

        0x80, 0x9a, 0x90, 0xb6, 0x8e, 0xde, 0x8f, 0x80, 0x9d, 0xd3, 0x8a, 0x8a, 0xd7, 0x8d, 0x8e, 0x8f,

        0x90, 0x91, 0x91, 0xe2, 0x99, 0x95, 0x95, 0x97, 0x97, 0x99, 0x9a, 0x9b, 0x9b, 0x9d, 0x9e, 0xac,

        0xb5, 0xd6, 0xe0, 0xe9, 0xa4, 0xa4, 0xa6, 0xa6, 0xa8, 0xa8, 0xaa, 0x8d, 0xac, 0xb8, 0xae, 0xaf,

        0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbd, 0xbf,

        0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc6, 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf,

        0xd1, 0xd1, 0xd2, 0xd3, 0xd2, 0xd5, 0xd6, 0xd7, 0xb7, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf,

        0xe0, 0xe1, 0xe2, 0xe3, 0xe3, 0xd5, 0xe6, 0xe6, 0xe8, 0xe9, 0xe8, 0xeb, 0xed, 0xed, 0xdd, 0xef,

        0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xeb, 0xfc, 0xfc, 0xfe, 0xff

};



const unsigned char lowercase_ibm852[256] = {

        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,

        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,

        0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,

        0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,

        0x40, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,

        0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,

        0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,

        0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,

        0x87, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8b, 0x8b, 0x8c, 0xab, 0x84, 0x86,

        0x82, 0x92, 0x92, 0x93, 0x94, 0x96, 0x96, 0x98, 0x98, 0x94, 0x81, 0x9c, 0x9c, 0x88, 0x9e, 0x9f,

        0xa0, 0xa1, 0xa2, 0xa3, 0xa5, 0xa5, 0xa7, 0xa7, 0xa9, 0xa9, 0xaa, 0xab, 0x9f, 0xad, 0xae, 0xaf,

        0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xa0, 0x83, 0xd8, 0xad, 0xb9, 0xba, 0xbb, 0xbc, 0xbe, 0xbe, 0xbf,

        0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc7, 0xc7, 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf,

        0xd0, 0xd0, 0xd4, 0x89, 0xd4, 0xe5, 0xa1, 0x8c, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xee, 0x85, 0xdf,

        0xa2, 0xe1, 0x93, 0xe4, 0xe4, 0xe5, 0xe7, 0xe7, 0xea, 0xa3, 0xea, 0xfb, 0xec, 0xec, 0xee, 0xef,

        0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xfb, 0xfd, 0xfd, 0xfe, 0xff

};



static inline char *toupper_ibm852(char *const string)

{

    size_t  i;



    if (string)

        for (i = 0; string[i]; i++)

            string[i] = uppercase_ibm852[(unsigned char)string[i]];



    return string;

}



static inline char *tolower_ibm852(char *const string)

{

    size_t  i;



    if (string)

        for (i = 0; string[i]; i++)

            string[i] = lowercase_ibm852[(unsigned char)string[i]];



    return string;

}

Note that the functions toupper_ibm852() and tolower_ibm852() modify the string in place.
If you want to keep the original string intact, use

Code:

    new_upper = toupper_ibm852(strdup(old_ibm852_string));

    new_lower = tolower_ibm852(strdup(old_ibm852_string));

nime	03-05-2011 11:47 AM

Hello Nominal :)
I'm glad to "see" you again with your's impressive knowledge of clean C and good will to help others!!

You helped me to fix huge problem with encodings and thank you very much for that. Now I have 3 different encodings in my program but everyone is needed. I expect ISO-8859-2 will be needed when I would write (or read) HTML reports and who know what when I take PDF's. On principle you show me I have no doubt that I can respond to any further de/coding needs even with my poor (beginners) C knowledge.

Quote:

where the //TRANSLIT means transliterate unsupported characters to nearest equivalents.

I didn't see any "unsupported" characters until now, everything is showed properly in all 3 my encodings. But I work only with letters, maybe this is applied for special signs like ® © ½ ± ° ¥ £ ¢ € ² ? If not maybe I can //TRANSLIT simply "forget"?

Code:

The interface you showed, is a bit difficult: you need to choose between efficiency and code complexity. You see, if you create a new conversion handle for each string conversion, you incur quite a high overhead in iconv_open, slowing things down. You can avoid that by caching a number of conversion handles, but that makes the code a lot more complex.

I am sorry for that. This is because I still don't understand things at right way. I thought at VB way like this:

Code:

char* fromCP = "IBM852";

char* toCP = "UTF-8";

iconv_t  onlyonehandle = conversion(fromCP, toCP);

Because after every converting handle is destroyed and iconv closed. So, I think, function can begin again with new parameters. But if this is not OK I will try your suggestions.

Quote:

... which keeps the old string intact, returning a new dynamically allocated string.
Would this work for you? I'd be happy to show you the source for conversion() and convert().

Yes, copied string is fine.
Well, I am a bit slow in C so I need enough time to see what will happen and try to understand this. After all, for now I don't free any memory because I do only snippets for testing and I know, I should do that for concrete programs.

Quote:

Normally I'd recommend using setlocale() and toupper()/tolower(), but since every glyph in IBM852 is just one byte, you can do the conversion trivially with a table. For example...

Hmm, I think manipulation with locales would be a better (easier) solution here if I can get result with them. Maybe sometimes I would need to do some "sort by letters" in "strange" locale and order in ASCII table is not proportional with place in alphabet order. For example letters in my alphabet is like this: "ABCČĆD(Dž)ĐEFGHIJ..."
So for this reasons temporary change locale can give better results (I think).
Additionally, difference between lowercase 'č' and uppercase 'Č' in ASCII is also 32 (200,232 in cp1250) but not in IBM852 (179, 152). So, better is run away for doing this "by hand" like I use to do in DOS.

Huh, now you give me enough work for next month :)

What can I say more than thank you again. You helped much more than I expected!

Nominal Animal

03-05-2011 01:31 PM

Quote:

Originally Posted by nime (Post 4279937)

Don't worry, there is no downsides in your case for always using //TRANSLIT for the target encoding. If you don't, and there happens to be a character that cannot be converted, the conversion will fail. You can also use //IGNORE to skip those characters instead.

Here is a function that uses your original interface. Note that it may be slow, because it opens and closes the iconv handle for each string separately. Like I said above, it is always good to append either //TRANSLIT or //IGNORE to the target character set name, otherwise the function will return NULL if there are inconvertible characters. Finally, this will grow and optimize the result string dynamically to exact length. It will always allocate enough additional space for the rest of the input string, plus CONV_EXTRA bytes. If you want, you can set CONV_EXTRA to a larger value, so it will initially allocate more memory. It will still optimize the size via a realloc() call, so there is very little harm in having CONV_EXTRA a bit larger, maybe 256 or 1024.

Code:

#include <stdlib.h>

#include <string.h>

#include <iconv.h>

#include <errno.h>



#define  CONV_EXTRA  16



char *convert(char const *const from, char const *const to, char const *const in)

{

    iconv_t    handle;

    size_t      insize, inleft;

    char      *inends;

    size_t      outsize, outleft;

    char      *out, *outends;



    /* Invalid character set names? */

    if (!from || !to || !*from || !*to) {

        errno = EINVAL;

        return NULL;

    }



    /* No string to convert? */

    if (!in)

        return NULL;



    /* Get a handle for the conversion. */

    handle = iconv_open(to, from);

    if (handle == (iconv_t)-1) {

        /* This conversion is not supported. */

        errno = ENOTSUP;

        return NULL;

    }



    /* Do the conversion. Grow the buffer if not large enough. */

    insize  = strlen(in);

    out    = NULL;

    outsize = insize + (size_t)CONV_EXTRA;

    while(1) {



        /* Prepare for the conversion. */

        inends = (char *)in;

        inleft = insize;



        /* Allocate a new output buffer. */

        out = malloc(outsize + 1);

        if (!out) {

            iconv_close(handle);

            errno = ENOMEM;

            return NULL;

        }

        outends = out;

        outleft = outsize;



        /* Do the conversion. */

        if (iconv(handle, &inends, &inleft, &outends, &outleft) == (size_t)0)

            break;



        /* Error? */

        if (errno != E2BIG) {

            int const  error = errno;

            free(out);

            iconv_close(handle);

            errno = error;

            return NULL;

        }



        /* Grow the output buffer size. */

        free(out);

        outsize = outsize + inleft + (size_t)CONV_EXTRA;

    }



    /* Reallocate the string to optimal size. */

    if (outleft && outsize >= outleft) {

        char *tmp;

        outsize -= outleft;

        tmp = realloc(out, outsize + (size_t)1);

        if (tmp)

            out = tmp;

    }



    /* Append EOS. */

    out[outsize] = 0;



    /* Close the conversion handle. */

    iconv_close(handle);



    return out;

}

Quote:

Originally Posted by nime (Post 4279937)

First, make sure you have the locale files defined. In Linux, locale -a lists them, and localedef can be used to create or install new ones. The C side is fortunately quite easy; the only limitation is that the locale setting is process wide, so if you use threads, it will change the locale for all threads.

The locale setting is divided into multiple categories, so you can only set e.g. LC_COLLATE category for string collation. Here is an example which uses the locale hr_HR.IBM852 to compare two strings:

Code:

#include <locale.h>



char *oldlocale, *string1, *string2;

int  result;



oldlocale = setlocale(LC_COLLATE, "hr_HR.IBM852");



result = strcoll(string1, string2);



setlocale(LC_COLLATE, oldlocale);



if (result < 0)

    printf("%s < %s\n", string1, string2);

else if (result > 0)

    printf("%s > %s\n", string1, string2);

else

    printf("%s == %s\n", string1, string2);

In the above code, I use the oldlocale variable, so that the locale is only temporarily changed.
In your code, you don't need to do that; you can just set the locale to whatever you happen to need. It is also local to the program, and will not change any system settings or anything, so you can use it in your program pretty freely.

The strcoll function works just like strcmp, except it uses the LC_COLLATE locale category.

Note that the tolower_ibm852() and toupper_ibm852() functions I listed earlier are both thread-safe, and not dependent on the locale settings. If you need case sensitive and insensitive IBM852 comparison functions (strcmp_ibm852() and strcasecmp_ibm852()), I can show them for you; the code is very much like tolower_ibm852() and toupper_ibm852(), except with two different tables.

The difference between these hardcoded functions and locale functions is that these are self-contained, and do not depend on any other things. In fact, if you have issues getting iconv support working for all your target platforms, I could quite easily write hardcoded conversion, sorting, and case changing functions for ISO-8859-2 and IBM852, with conversion to and from UTF-8 and ISO-8859-1, if you like.

Hope this helps!

nime	03-05-2011 03:54 PM

Nominal,
I am truly shocked with your deep knowing of this (very complicated) theme. Actually, I shamed to ask any additional help because I understand that I get more than "reasonable" help for free from you.

And also, I hope you see that level of your examples and wide help go far beyond my ability to understand what I am doing. For now I wouldn't like to know about iconv conversions more than is need for my programs because I already have mess in head from this. From more informations now I can only have more damage.
Go to developiong CONV_EXTRA for my strings is also too much. Especially what now I have proper letters on console, file and GTK entry. I can easily now add functions for 8859-2 by myself. What more can I want?
I would simply add //TRANSLIT if this is "has to be".

Thank you for pointing me to potentional problems with changing of locales and I undestand advantages of hardcodins when CP IBM852 is constant. So I try to apply your example but won't work.

I added uppercase and lowercase arrays in header file and declarations for functions, like this:

Code:

const unsigned char uppercase_ibm852[256] = {

      0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,...



static inline char *toupper_ibm852(char *const string);

static inline char *tolower_ibm852(char *const string);

// when I don't even know what 'is static inline char *'

// and what is 'char *const'. I already used only 'const char*'

Then I copy functions to my functions code file and I call functions like this:

Code:

// fstr is string from GTK textbox.

// can be in 'mixedcase' so I try both lower and uppercase..



    char* tp;

    tp = utf8_to_ibm852(fstr);

    static inline char* new_upper = toupper_ibm852(strdup(tp));

    static inline char* new_lower = tolower_ibm852(strdup(tp));

And I get bunch of errors...

Quote:

C:\RTM\gtktext\cHEADER.h|50|warning: 'toupper_ibm852' declared 'static' but never defined|
C:\RTM\gtktext\cHEADER.h|51|warning: 'tolower_ibm852' declared 'static' but never defined|
C:\RTM\gtktext\cCODE.c||In function 'invsearch':|
C:\RTM\gtktext\cCODE.c|20|warning: variable 'new_upper' declared 'inline'|
C:\RTM\gtktext\cCODE.c|20|error: initializer element is not constant|
C:\RTM\gtktext\cCODE.c|21|warning: variable 'new_lower' declared 'inline'|
C:\RTM\gtktext\cCODE.c|21|error: initializer element is not constant|
C:\RTM\gtktext\cCODE.c|21|warning: unused variable 'new_lower'|
C:\RTM\gtktext\cCODE.c|20|warning: unused variable 'new_upper'|
||=== Build finished: 2 errors, 6 warnings ===|

Of course, header is included in code file.
Please help to get this working.

I was also try to reduce code for not have to same function six or more times with your recommendation:

Code:

iconv_t  utf8_to_ibm852 = conversion("UTF-8",      "IBM852//TRANSLIT");

iconv_t  utf8_to_iso2  = conversion("UTF-8",      "ISO-8859-2//TRANSLIT");

iconv_t  ibm852_to_utf8 = conversion("IBM852",    "UTF-8");

iconv_t  ibm852_to_iso2 = conversion("IBM852",    "ISO-8859-2//TRANSLIT");

iconv_t  iso2_to_utf8  = conversion("ISO-8859-2", "UTF-8");

iconv_t  iso2_to_ibm852 = conversion("ISO-8859-2", "IBM852//TRANSLIT");

//-----------

C:\RTM\gtktext\cFUNC.c|195|warning: implicit declaration of function 'conversion'|

C:\RTM\gtktext\cFUNC.c|195|warning: initialization makes pointer from integer without a cast|

C:\RTM\gtktext\cFUNC.c|195|error: initializer element is not constant|

... and so 6 times...

||=== Build finished: 6 errors, 7 warnings ===|



//'implicit declaration' appears mostly when something isn't declared,

//but in same file I already have: static iconv_t iconv_cp1250_to_ibm852 = (iconv_t)-1;

//which works. What do I missed?

Quote:

If you need case sensitive and insensitive IBM852 comparison functions (strcmp_ibm852() and strcasecmp_ibm852()), I can show them for you; the code is very much like tolower_ibm852() and toupper_ibm852(), except with two different tables.

The difference between these hardcoded functions and locale functions is that these are self-contained, and do not depend on any other things. In fact, if you have issues getting iconv support working for all your target platforms, I could quite easily write hardcoded conversion, sorting, and case changing functions for ISO-8859-2 and IBM852, with conversion to and from UTF-8 and ISO-8859-1, if you like.

I would like to see case sensitive and insensitive IBM852 comparison functions (strcmp_ibm852() and strcasecmp_ibm852()).
For now I use following function which I pick up from the net. Works fast and nice, but not on 852 so I must recode first to work only in windows..

Code:

const char *instrnocase(const char *haystack, const char *needle)

{

  if (!*needle){

      return haystack;}

  for (; *haystack; ++haystack){

      if (toupper(*haystack) == toupper(*needle)){

        /* matched starting char -- loop through remaining chars. */

        const char *h, *n;

        for (h = haystack, n = needle; *h && *n; ++h, ++n){

            if (toupper(*h) != toupper(*n)){

              break;}}

        if (!*n) /* matched all of 'needle' to null termination */

          {return haystack;} /* return the start of the match */

      }}

  return 0;

}

//-------------

//Call:

    const char *found = instrnocase(Ip, tp);

    if (found)

        {

          finded ++;

          gtk_entry_append_text (GTK_ENTRY (entry), cp);

          ...

So, for sorting I don't know. Seems complicated. In DOS I made sorts with "lookup" string for properly ordering.
And now I think confused about changing locales...
This is something what should be well tested before any decision.
Thank you for example how to change just a essential part of locale.

Nominal Animal

03-05-2011 06:59 PM

Quote:

Originally Posted by nime (Post 4280089)

And also, I hope you see that level of your examples and wide help go far beyond my ability to understand what I am doing.

Fair enough. I'll try to limit the scope of my answers to fit your needs better.

Quote:

Originally Posted by nime (Post 4280089)

I would simply add //TRANSLIT if this is "has to be".

Yes. It just makes sure that if there is something unexpected, the results are as close to correct as possible.

Quote:

Originally Posted by nime (Post 4280089)

I added uppercase and lowercase arrays in header file and declarations for functions, like this:

Code:

const unsigned char uppercase_ibm852[256] = {

      0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,...



static inline char *toupper_ibm852(char *const string);

static inline char *tolower_ibm852(char *const string);

// when I don't even know what 'is static inline char *'

// and what is 'char *const'. I already used only 'const char*'

Sorry! :doh: It's my fault. static inline designates the function to be a local helper function, only used in the current source file. Of course it is wrong here. If you remove the static inline from everywhere, it should work.

As to the types:

char * means "pointer to char": a pointer to a string.
const char * and char const * both mean "pointer to constant char": a pointer to a fixed string. (You can change the pointer, but you cannot change the characters in the string.)
char *const means "a constant pointer to char": a fixed pointer to a string. (You cannot change the pointer, but you can change the individual characters in the string.)
const char *const and char const *const both mean "a constant pointer to constant char": a fixed pointer to a fixed string.

The reason I've marked the pointers fixed is just to help the compiler do better code. The string value is changed, but the pointer is not. This designation should help the compiler.

Current C compilers almost always detect the above situation even without the help, so it is just a detail. However, I've found I understand and remember the intent of the functions better, when the parameters are meticulously marked const. I've even caught a number of bugs that way.

Quote:

Originally Posted by nime (Post 4280089)

I was also try to reduce code for not have to same function six or more times with your recommendation:

Code:

iconv_t  utf8_to_ibm852 = conversion("UTF-8",      "IBM852//TRANSLIT");

iconv_t  utf8_to_iso2  = conversion("UTF-8",      "ISO-8859-2//TRANSLIT");

iconv_t  ibm852_to_utf8 = conversion("IBM852",    "UTF-8");

iconv_t  ibm852_to_iso2 = conversion("IBM852",    "ISO-8859-2//TRANSLIT");

iconv_t  iso2_to_utf8  = conversion("ISO-8859-2", "UTF-8");

iconv_t  iso2_to_ibm852 = conversion("ISO-8859-2", "IBM852//TRANSLIT");

//-----------

C:\RTM\gtktext\cFUNC.c|195|warning: implicit declaration of function 'conversion'|

C:\RTM\gtktext\cFUNC.c|195|warning: initialization makes pointer from integer without a cast|

C:\RTM\gtktext\cFUNC.c|195|error: initializer element is not constant|

... and so 6 times...

||=== Build finished: 6 errors, 7 warnings ===|

Sorry again!

toupper_ibm852 and tolower_ibm852 are self-standing functions; you can call them anytime you want, without any setup. They do not depend on any other code; everything they needed was in that one code box. (If you remove the static inline from it.)

Similarly, the convert function is self-standing; you only need the code I showed in that code box, and link in iconv when compiling. If you use it, you can replace all the conversion calls you have right now with it. The only iconv_t you should see in all your source code would be the iconv_t handle; line in the convert function.

Quote:

Originally Posted by nime (Post 4280089)

I'll put all the conversion functions and the IBM852 functions into one header and one code file, so it'll be easier for you to use. It will replace all the code I've shown you before, so you'll need to modify your existing code a bit, if you decide to use it. It will be very similar, so your changes should be very small and easy to do. I'll do that in a new message in a bit.

nime	03-06-2011 03:50 AM

I get yours extra-excellent upper/lower 852 conversion to work with removing "static inline" and putting array in code file. Then I reorganize my program with new header and code file just for conversions because I will need them often.

Now I have this:

Code:

// fstr is UTF8 from Gtk textbox



    char* tp;

    tp = utf8_to_ibm852(fstr);  // convert fstr to 852

    char* new_upper = toupper_ibm852(strdup(tp)); //uppercase 852 string

    ...



    char* n_upper = toupper_ibm852(strdup(Ip));  // uppercase string from data file

    const char *found = strstr(n_upper, new_upper); // pure C strstr

    if (found)  // IT FINDS, IT FINDS!!

        {

          finded ++;



      // fill GTK entry with previously converted UTF8 string 'cp'

      // from 852 data file

          gtk_entry_append_text (GTK_ENTRY (entry), cp);

So, this is it!. No need for additional functions!

nime	03-06-2011 07:19 AM

And now, here is my performances!

I have data file over 3MB with 100.000 records which contains 43.761 filled rows with various data, written with QB45 and VB data structures (types) which don't know for null termination so I do this during reading of my records. All string data is in cp852!

I have GTK textbox for input a search string. After, I convert this string from utf8 to 852 and then to uppercase852 all with Nominals functions.
Then I read all 100.000 records and where data exist I isolate and terminate all strings, convert name field (28 chars) to uppercase852 and then search for first occurrence of search string in it with 'strstr' C function.

If search string is finded I write them to console with cp852, write them to windows txt file with cp1250 and write them to GTK textbox with utf8 with following results:

Quote:

found 52 from 43761 'čić' for 0,843 seconds

Assuming many conversions on-the-fly I think that results are very good and fast enough.

Now I have added additional conversions of CP and have 12 of them in my program (everything to anything) and total size of my program (exe) is a little less than 20 kb!

Of course, all of this would not be possible without extremely assistive Nominal Animal selfless help for which I am grateful for a lifetime!

Nominal Animal

03-06-2011 09:06 AM

Nime, that sounds excellent! Less than a second to search is not long at all. I'm sure the users are happy.

I know you said you don't need any more code .. but these functions allow you to work directly on the unterminated strings in your data structures.

First, these two functions use the same uppercase_ibm852 array as previously. They allow you to check if the data contains the given substring. The first is case sensitive, the second is case insensitive. If you supply a NULL pointer, or a zero-length area, both functions will just return -1 (no match) without any problems. These should make your code even simpler, I think.

Code:

/* Case sensitive substring search.

 * The haystack is the size bytes of (unterminated) data at data.

 * Both data and string use the IBM852 character set.

 * If string is in data, the function will return 0..(size-1), the offset to the start of the match.

 * If there is no match, the function will return -1.

*/

int find_ibm852(void const *const data, int const size,

                char const *const string)

{

    unsigned char const *const haystack = (unsigned char const *)data;

    unsigned char const *const needle = (unsigned char const *)string;



    int        i, o;



    /* No data? */

    if (size < 1 || !data)

        return -1;



    /* No string to search for? */

    if (!string || !*string)

        return -1;



    /* Case sensitive search loop. */

    for (i = 0; i < size; i++)

        if (haystack[i] == needle[0]) {

            o = 1;

            while (i+o < size && needle[o] && haystack[i+o] == needle[o])

                o++;

            if (!needle[o])

                return i;

        }



    /* No match. */

    return -1;

}



/* Case insensitive substring match.

 * The haystack is the size bytes of (unterminated) data at data.

 * Both data and string use the IBM852 character set,

 * and may be in any case. Both are compared case insensitively.

 * If string is in data, the function will return 0..(size-1), the offset to the start of the match.

 * If there is no match, the function will return -1.

*/

int findcase_ibm852(void const *const data, int const size,

                    char const *const string)

{

    unsigned char const *const haystack = (unsigned char const *)data;

    unsigned char const *const needle = (unsigned char const *)string;



    int        i, o;



    /* No data? */

    if (size < 1 || !data)

        return -1;



    /* No string to search for? */

    if (!string || !*string)

        return -1;



    /* Case insensitive search loop. */

    for (i = 0; i < size; i++)

        if (uppercase_ibm852[haystack[i]] == uppercase_ibm852[needle[0]]) {

            o = 1;

            while (i+o < size && needle[o] && uppercase_ibm852[haystack[i+o]] == uppercase_ibm852[needle[o]])

                o++;

            if (!needle[o])

                return i;

        }



    /* No match. */

    return -1;

}

Finally, convertdata() is a variant of the convert() function. It is used the same way, except that convertdata() takes a pointer and size (number of bytes) to the data, so the data does not need to be terminated. You could use this to create strings out of the data, and do the character set conversion at the same time.

Code:

char *convertdata(char const *const from, char const *const to,

                  void const *const data, size_t const size)

{

    iconv_t    handle;

    size_t      insize, inleft;

    char      *inends;

    size_t      outsize, outleft;

    char      *out, *outends;



    /* Invalid character set names? */

    if (!from || !to || !*from || !*to) {

        errno = EINVAL;

        return NULL;

    }



    /* No string to convert? */

    if (!data || size < (size_t)1)

        return NULL;



    /* Note: If you want this function to return an empty string

    *      instead of NULL, replace the above return NULL; with

    *      return strdup("");

    */



    /* Get a handle for the conversion. */

    handle = iconv_open(to, from);

    if (handle == (iconv_t)-1) {

        /* This conversion is not supported. */

        errno = ENOTSUP;

        return NULL;

    }



    /* Do the conversion. Grow the buffer if not large enough. */

    insize  = size;

    out    = NULL;

    outsize = insize + (size_t)16;

    while(1) {



        /* Prepare for the conversion. */

        inends = (char *)data;

        inleft = insize;



        /* Allocate a new output buffer. */

        out = malloc(outsize + 1);

        if (!out) {

            iconv_close(handle);

            errno = ENOMEM;

            return NULL;

        }

        outends = out;

        outleft = outsize;



        /* Do the conversion. */

        if (iconv(handle, &inends, &inleft, &outends, &outleft) == (size_t)0)

            break;



        /* Error? */

        if (errno != E2BIG) {

            int const  error = errno;

            free(out);

            iconv_close(handle);

            errno = error;

            return NULL;

        }



        /* Grow the output buffer size. */

        free(out);

        outsize = outsize + inleft + (size_t)16;

    }



    /* Reallocate the string to optimal size. */

    if (outleft && outsize >= outleft) {

        char *tmp;

        outsize -= outleft;

        tmp = realloc(out, outsize + (size_t)1);

        if (tmp)

            out = tmp;

    }



    /* Append EOS. */

    out[outsize] = 0;



    /* Close the conversion handle. */

    iconv_close(handle);



    return out;

}

I'm glad to have been of help. If you need any further help, or explanations on exactly why or how any of this stuff works, please let me know!

nime	03-06-2011 05:23 PM

Thank you Nominal for more functions but I can't get them to work and probably I don't try enough. Why to, when we already does an excellent job with good results.

But, I am sure, more than half of the world programmers will be happy to find this material here! I searched for this long time. So mighty conversions (relatively simple to make if someone like you help enough) which are independent from M$"cultures" and very slow classes in huge frameworks.

I tested my program more tonight and I see that freeing variables are very necessary. If I don't do this program becomes unstable and significantly slower. So I do best I can. Program seems reliable now but need more testing.

I also try to make new project in Linux to read my data files but I don't do something good. Compiler returns error (-ld 2 or so). But I will do this soon as possible and now I am sure with same good results.
I belive that nothing regarding cp Conversions we did not leave unfinished.
So, thank you for all once more,

nime.

All times are GMT -5. The time now is 10:23 AM.