LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Letters and chars in Linux (http://www.linuxquestions.org/questions/programming-9/letters-and-chars-in-linux-858856/)

nime 01-26-2011 02:20 PM

Letters and chars in Linux
 
Hello,
I am pretty new in C and in Linux so I need advice, suggestion or example regarding encoding/decoding strings in generally on Linux.

I have much of old binary data files mostly written in DOS/QB45. If I want to read that data on windows I should recode strings (replace certain characters) and then data looks good on windows. There this characters are between ASCII 127 and 255. It is word about non-english letters like "čćšđžČĆŠĐŽ" which is common in my country.
Every operation with such strings can count for 1 byte per letter.
So, conversion looks like:
Code:

        char* ps = myString;
        while(*ps != '\0'){
          switch ((unsigned char)*ps)
            {
        case 230: *ps=138;
        break;
        case 209: *ps=208;
        break;
        .
        .
        .
        etc...

But when I come to linux one different story waits for me :)
Here, different than english letters, seems that "my" characters have more than one byte.
Of course this is not good for my program which have searching and many other stuff organized like letter/byte principle.
And of course I am in trouble now :)

But this is behaviour present in many other languages and I ask for help experienced linux users and programmers - what to do here to be able to use my nice and beloved old data with my C programs in linux.

Data is organized in packed binary structures and written to file with (say) VB.

Thanx, nime.

Snark1994 01-26-2011 04:51 PM

Hm, sounds like you would benefit from reading this... My guess is that linux is using unicode to store your characters. I'm not sure what the best course of action would be - if it were me, I would probably try to change my programmes to deal with the multi-character encoding, or at least find a bit more about the issue (90% of my knowledge on the subject came from reading that article when trying to fix a character encoding issue on a website). However, this may not be the most time-effective way of doing things. What exactly are you "converting" it to? You say you need to re-encode characters between 129 and 255, but in your programme you change 209 to 208...

nime 01-26-2011 05:18 PM

Hello Snark,
I have old data which I cannot change cause it is at different places and because it has "a much" of those data. Originally, this data was written in DOS with my code page which is 852.
Part of converting program I showed recode characters readable in DOS so they can be readable in Windows and this works OK. When I work in windows I always recode data back so it can be readable from DOS too. At this way I have same data for two different OS-es.
My idea was to involve Linux in this story, that's mainly why I now learn "C" language.
In Dos and windows those specific characters lies in range between 127 and 255 but not in same place. This applyes only for my country specific letters "čćšđžČĆŠĐŽ". Latin characters are always at same place.

Now, I have another interesting questions:
How Linux prints to matrix devices like printers and so if they uses unicode which is suitable only for graphical systems?
How linux then interact with microcontrollers?

I didn't find yet interesting stuff at Oracle.
But I have some idea and because I am C beginner I don't know if this is this possible.
For example I can compose string with single byte and two byte characters to read/change them in Linux (like GTK or similar textbox) but can I after recognize (by looping) which letter in string is 2-byte so I can recode back to DOS single byte matrix before I save to disk?
Recoding is necessary because data assumes strings with explicit size (fixed length strings) and other technical requirements in such system stands on this fact.

What do you think?

eSelix 01-26-2011 06:12 PM

Well, if you read data from file and save single bytes then in file are stored single bytes, meaningless of a Linux or Windows, data in files are the same. I think that you have problem with representation of this new file on the screen. Can you say how do you known that:
Quote:

Here, different than english letters, seems that "my" characters have more than one byte.
You read this file on the console with cat or editor, or with graphical editor?

nime 01-26-2011 06:49 PM

Hi eSelix,
I try like this:
Code:

char* str = "Marco";
printf("%d",strlen(str));

returns 5

and this
Code:

char* str = "Mačka";
printf("%d",strlen(str));

returns 6

so "č" letter can be two byte letter, I suppose.
I prints data to console with printf and all those letters are represented with black questionmark.

I do additional tests.
I looping through all ascii from 32 to 255 and prints them to console and letters over 127 are also only black "?".
And more. I save text file in windows (notepad/plain text) with those data and in linux this is displayed without my characters.
I can see those file properly after converting in geany or ooWriter.
When I open this file in geany without conversion instead of those characters there are boxes with four (hex) letters.
But please note, I want to convert my strings through C program, not whole text file.
After all, my data is in binary file where text and numbers are mixed and "packed".

eSelix 01-26-2011 07:00 PM

Editor in which you wrote this code has set unicode encoding. So the "Mačka" string is encoded in your program that way. Change to iso/ascii in editor options. If you don't now how, say which editor are you using.

As varies countries use varies charsets then we have mess with strings now. It is difficult to guess by editor which charset is used in opened file, so the user must do a conversion by hand. But UTF-8 was created as remedy for this situation.

nime 01-27-2011 12:29 AM

eSelix,
No need to guessing, just read what I wrote.

1) I don't want to convert text (or other) file but strings in my C program.
2) I don't use editor but C program in windows.

I can read those data from windows in linux but those letters are not displayed.
Is here any way that my program can use this data that I do conversion directly in my program just for strings which I need, not to whole file?
Or like this, can I force linux that my program uses codepage and character sets same like windows?

Same problem should be with german, french, spanish and many other languages so I belive somebody faces this problem long before me.

Snark1994 01-27-2011 09:55 AM

Quote:

Originally Posted by nime (Post 4239136)
2) I don't use editor but C program in windows.

He means the editor you used to write your C program. That has to store the string as well, so he is suggesting changing the setting in your editor :)

nime 01-27-2011 12:39 PM

Ah so, thanks Snarky,
I do this already (set to UTF-8) my C editor (Code:Blocks) and they works well and proper on both win and linux. But this is other kind of problem.

I read today more sources for this problem and it seems that may of people has it.
Higher level programming tools have included those conversions but C have got gnulib, locale and internationalization concept which I have to study.

Anyway, If somebody have advice or experience with this please share!

Nominal Animal 01-27-2011 04:23 PM

n Linux, there is no need to modify your old programs. You can adjust your environment to use the same character encoding, IBM 852, as your old data files use.

If your old program is a command-line program (and does not use e.g. ncurses), use iconv to convert the output to readable form:
Code:

./my-program ... | iconv -f IBM852
If your program is interactive, and have a recent version of luit (for IBM 852 code page support), then
Code:

luit -encoding ibm-cp852 ./my-program ...
If you write your own program for Linux, you can use the iconv_open, iconv, and iconv_close functions to set up an easy conversion for your strings, between UTF-8 and IBM852. It is a very nice and easy API to use. It is also part of glibc and standard locales, and should be available in all Linux distributions.

Example iconv C helper code follows. ibm852_to_utf8() and utf8_to_ibm852() work just like strdup(), returning a duplicate of the given string, except they do the conversion too. Remember to free() any such strings after you're done with them. Also, call iconv_done() when you don't need the conversion (at least for a while), to free any internal tables used by iconv.
It would be better if the code resized the newly created strings when necessary, instead of just allocating them heuristically. Or even better, if it used string pools (like Apache memory pools), so that you could discard all related strings at once, and not worry about remembering to free each and every one.
Code:

#include <stdlib.h>
#include <string.h>
#include <iconv.h>
#include <errno.h>
#include <stdio.h>

static iconv_t  iconv_ibm852_to_utf8 = (iconv_t)-1;
static iconv_t  iconv_utf8_to_ibm852 = (iconv_t)-1;

void iconv_done(void)
{
    if (iconv_ibm852_to_utf8 != (iconv_t)-1)
        iconv_close(iconv_ibm852_to_utf8);

    if (iconv_utf8_to_ibm852 != (iconv_t)-1)
        iconv_close(iconv_utf8_to_ibm852);

    iconv_ibm852_to_utf8 = (iconv_t)-1;
    iconv_utf8_to_ibm852 = (iconv_t)-1;
}

/*
 * These functions will return a dynamically allocated copy, just like strdup().
 * Remember to free() them after use.
*/

char *ibm852_to_utf8(char *ibm852)
{
    char  *in = ibm852;
    char  *out, *end;
    size_t  in_left = (in) ? strlen(in) : 0;
    size_t  out_size = 3 * in_left;
    size_t  out_left = out_size;
    size_t  converted;

    if (!ibm852) {
        errno = EINVAL;
        return NULL;
    }

    if (iconv_ibm852_to_utf8 == (iconv_t)-1) {
        iconv_ibm852_to_utf8 = iconv_open("UTF-8", "IBM852");
        if (iconv_ibm852_to_utf8 == (iconv_t)-1)
            return NULL;
    }

    out = malloc(out_size + 1);
    if (!out) {
        errno = ENOMEM;
        return NULL;
    }

    end = out;
    converted = iconv(iconv_ibm852_to_utf8, &in, &in_left, &end, &out_left);
    if (converted == (size_t)-1) {
        const int saved_errno = errno;
        free(out);
        errno = saved_errno;
        return NULL;
    }
    *end = 0;

    return out;
}

char *utf8_to_ibm852(char *utf8)
{
    char  *in = utf8;
    char  *out, *end;
    size_t  in_left = (in) ? strlen(in) : 0;
    size_t  out_size = 2 * in_left;
    size_t  out_left = out_size;
    size_t  converted;

    if (!utf8) {
        errno = EINVAL;
        return NULL;
    }

    if (iconv_utf8_to_ibm852 == (iconv_t)-1) {
        iconv_utf8_to_ibm852 = iconv_open("IBM852//TRANSLIT", "UTF-8");
        if (iconv_utf8_to_ibm852 == (iconv_t)-1)
            return NULL;
    }

    out = malloc(out_size + 1);
    if (!out) {
        errno = ENOMEM;
        return NULL;
    }

    end = out;
    converted = iconv(iconv_utf8_to_ibm852, &in, &in_left, &end, &out_left);
    if (converted == (size_t)-1) {
        const int saved_errno = errno;
        free(out);
        errno = saved_errno;
        return NULL;
    }
    *end = 0;

    return out;
}

int main(int argc, char *argv[])
{
        int  arg;
        char *s1, *s2, *s3, *s4;

        for (arg = 1; arg < argc; arg++) {
                s1 = utf8_to_ibm852(argv[arg]);
                s2 = ibm852_to_utf8(argv[arg]);
                s3 = ibm852_to_utf8(s1);
                s4 = utf8_to_ibm852(s2);
                printf("s = \"%s\":\n", argv[arg]);
                printf("utf8_to_ibm852(s) = \"%s\"\n", s1 ? s1 : "<error>");
                printf("ibm852_to_utf8(s) = \"%s\"\n", s2 ? s2 : "<error>");
                printf("ibm852_to_utf8(utf8_to_ibm852(s)) = \"%s\"\n", s3 ? s3 : "<error>");
                printf("utf8_to_ibm852(ibm852_to_utf8(s)) = \"%s\"\n", s4 ? s4 : "<error>");
                if (s4) free(s4);
                if (s3) free(s3);
                if (s2) free(s2);
                if (s1) free(s1);
        }

        iconv_done();

        return 0;
}

I don't use VB or Windows, so I cannot really help you there.

Hope this helps,
Nominal Animal

nime 01-28-2011 12:34 AM

Thank you Nominal,

this sounds like way to go.
Also, thanks for all remarks, I will understand what do you talk about with a little practice then I will post back results.
So I see now this can go like this:

- load string from 852 file,
- duplicate them and convert to utf-8,
- do some job (change) with them,
- convert them to 852 back,
- write them to disk,
- free them from memory,

This is ideal scenario because same data without static conversion will be good for program in DOS and windows too which can all share same files.
However, this makes theoretically possible to have 3 different OS-es in LAN network which work with same data.

Thanks again Nominal, for windows and VB and DOS don't need help. I solved there those scenario long time ago.

Nominal Animal 01-30-2011 11:57 AM

Right, nime.

The only thing to watch for is if the user supplies UTF-8 characters (say, Kanji or something), which cannot be converted to IBM-852. The //TRANSLIT flag tells iconv to do the closest possible conversion, which is very good, but you may have to use better memory handling in case of the translitterations are much longer than the original characters. The example functions I gave are not very good, since they really should check if the conversion needs a larger buffer..

Also, you might consider using a configurable character set name (read it either from a a file, command-line parameter or an environment variable, defaulting to IBM852), and open iconv handles (disk_to_utf and utf_to_disk) early in your program. That way you can handle already converted files, too.

(Iconv uses the standard character set mappings in your locales' charmaps/ directory. If you have already done some partial static character conversions, you can copy charmaps/IBM852 to say charmaps/IBM852dosfix, and edit it (rename to IDBM852dosfix) to account for those static conversions -- the file has a simple textual format. Then just tell your program to use the IBM852dosfix charset instead, and you can work with those files painlessly too.

Happy to help,
Nominal Animal

nime 03-05-2011 04:44 AM

Hello again
 
I see we have here background conversation too what confirm additionally how this subject is interesting for developers. I am probably winner in trying to fix my own mistakes with partial success :) One I'm sure, If I wouldn't be a human I will made much less mistakes. This is not good because then I will not have to fix anything.
And most important is that all my mistakes becomes as result of hard work and positive thinking.

So, last month I try often to get conversion of codepages to work and I get it finally - yesterday, but only on windows, using iconv.dll.
But new situations appears (which I don't know to solve).
IBM852 to UTF8 is fine for filling GTK textboxes, printing to console (which is at 852) and so, but it is not good for writing to text file for show with notepad (without BOM). So, by example of Nominal I made additional conversions from IBM852 to CP1250 and back and now this is also fixed. Conversions works excellent! Properly and fast. What will be in Linux I will see later.
Calling conversions look's like this:

Code:

  char* entry_text;
  entry_text =(char*) gtk_entry_get_text(GTK_ENTRY(entry));
  printf("button_clicked search for %s\n", utf8_to_ibm852(entry_text));

OR

    char* et = utf8_to_ibm852((char*)entry_text);
    printf("Entry contents: %s\n", et);

OR

    char* tp;
    tp = utf8_to_ibm852(findstr);

    char* wp;
    wp = ibm852_to_cp1250(tp);

... and now I have 4 functions and need more (cp1250>UTF8 and back, and even more like 8859-2, etc...).
So, I think about to replace those functions with just one or two, and I try, but my C is not so good as my english ;)

I would like function like this:
Code:

   
    char* convCP(char* fromCP, char* toCP, char* instring);

//which could be called:
   
convstring = convCP("IBM852", "UTF-8, char* instring);
//OR
convstring = convCP("UTF-8", "CP1250, char* instring);

If it is not too much for ask, please someone who know C to make my function based on Nominals code.

(2)
And additional problem, for which I don't think before but is needed now is:
Main purpose of storing strings to file in old DOS 852 format is direct addressing (one letter-one byte) and ability to use simple C functions for searching and manipulating data like strcmp, strstr, strcpy and so.
But...
... problem is, after I convert GTK text to IBM852 I have to convert my new 852 text to uppercase! But my system (CP1250) dont know what is uppercase in CP852 when we talk about letters like "čćđČĆĐ".

What to do here.
Did anybody faces with this problem before?

Nominal Animal 03-05-2011 09:26 AM

Quote:

Originally Posted by nime (Post 4279697)
And most important is that all my mistakes becomes as result of hard work and positive thinking.

Exactly! You have a very good attitude, nime.

Quote:

Originally Posted by nime (Post 4279697)
now I have 4 functions and need more (cp1250>UTF8 and back, and even more like 8859-2, etc...).
So, I think about to replace those functions with just one or two, and I try, but my C is not so good as my english ;)

The interface you showed,
Quote:

Originally Posted by nime (Post 4279697)
Code:

char* convCP(char* fromCP, char* toCP, char* instring);

is a bit difficult: you need to choose between efficiency and code complexity. You see, if you create a new conversion handle for each string conversion, you incur quite a high overhead in iconv_open, slowing things down. You can avoid that by caching a number of conversion handles, but that makes the code a lot more complex.

I'd recommend you use a bit different interface, where you create the conversion first like this:
Code:

iconv_t  utf8_to_ibm852 = conversion("UTF-8",      "IBM852//TRANSLIT");
iconv_t  utf8_to_iso2  = conversion("UTF-8",      "ISO-8859-2//TRANSLIT");
iconv_t  ibm852_to_utf8 = conversion("IBM852",    "UTF-8");
iconv_t  ibm852_to_iso2 = conversion("IBM852",    "ISO-8859-2//TRANSLIT");
iconv_t  iso2_to_utf8  = conversion("ISO-8859-2", "UTF-8");
iconv_t  iso2_to_ibm852 = conversion("ISO-8859-2", "IBM852//TRANSLIT");

where the //TRANSLIT means transliterate unsupported characters to nearest equivalents. You use the conversion handles with a convert function like this:
Code:

char *new_ibm852_string = convert(utf8_to_ibm852, old_utf8_string);
which keeps the old string intact, returning a new dynamically allocated string.

Would this work for you? I'd be happy to show you the source for conversion() and convert().

Quote:

Originally Posted by nime (Post 4279697)
... problem is, after I convert GTK text to IBM852 I have to convert my new 852 text to uppercase! But my system (CP1250) dont know what is uppercase in CP852 when we talk about letters like "čćđČĆĐ".

Normally I'd recommend using setlocale() and toupper()/tolower(), but since every glyph in IBM852 is just one byte, you can do the conversion trivially with a table. For example,
Code:

#include <stdlib.h>

const unsigned char uppercase_ibm852[256] = {
        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,
        0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,
        0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
        0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,
        0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,
        0x60, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,
        0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,
        0x80, 0x9a, 0x90, 0xb6, 0x8e, 0xde, 0x8f, 0x80, 0x9d, 0xd3, 0x8a, 0x8a, 0xd7, 0x8d, 0x8e, 0x8f,
        0x90, 0x91, 0x91, 0xe2, 0x99, 0x95, 0x95, 0x97, 0x97, 0x99, 0x9a, 0x9b, 0x9b, 0x9d, 0x9e, 0xac,
        0xb5, 0xd6, 0xe0, 0xe9, 0xa4, 0xa4, 0xa6, 0xa6, 0xa8, 0xa8, 0xaa, 0x8d, 0xac, 0xb8, 0xae, 0xaf,
        0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbd, 0xbf,
        0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc6, 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf,
        0xd1, 0xd1, 0xd2, 0xd3, 0xd2, 0xd5, 0xd6, 0xd7, 0xb7, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf,
        0xe0, 0xe1, 0xe2, 0xe3, 0xe3, 0xd5, 0xe6, 0xe6, 0xe8, 0xe9, 0xe8, 0xeb, 0xed, 0xed, 0xdd, 0xef,
        0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xeb, 0xfc, 0xfc, 0xfe, 0xff
};

const unsigned char lowercase_ibm852[256] = {
        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
        0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,
        0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,
        0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
        0x40, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,
        0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,
        0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,
        0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,
        0x87, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8b, 0x8b, 0x8c, 0xab, 0x84, 0x86,
        0x82, 0x92, 0x92, 0x93, 0x94, 0x96, 0x96, 0x98, 0x98, 0x94, 0x81, 0x9c, 0x9c, 0x88, 0x9e, 0x9f,
        0xa0, 0xa1, 0xa2, 0xa3, 0xa5, 0xa5, 0xa7, 0xa7, 0xa9, 0xa9, 0xaa, 0xab, 0x9f, 0xad, 0xae, 0xaf,
        0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xa0, 0x83, 0xd8, 0xad, 0xb9, 0xba, 0xbb, 0xbc, 0xbe, 0xbe, 0xbf,
        0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc7, 0xc7, 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf,
        0xd0, 0xd0, 0xd4, 0x89, 0xd4, 0xe5, 0xa1, 0x8c, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xee, 0x85, 0xdf,
        0xa2, 0xe1, 0x93, 0xe4, 0xe4, 0xe5, 0xe7, 0xe7, 0xea, 0xa3, 0xea, 0xfb, 0xec, 0xec, 0xee, 0xef,
        0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xfb, 0xfd, 0xfd, 0xfe, 0xff
};

static inline char *toupper_ibm852(char *const string)
{
    size_t  i;

    if (string)
        for (i = 0; string[i]; i++)
            string[i] = uppercase_ibm852[(unsigned char)string[i]];

    return string;
}

static inline char *tolower_ibm852(char *const string)
{
    size_t  i;

    if (string)
        for (i = 0; string[i]; i++)
            string[i] = lowercase_ibm852[(unsigned char)string[i]];

    return string;
}

Note that the functions toupper_ibm852() and tolower_ibm852() modify the string in place.
If you want to keep the original string intact, use
Code:

    new_upper = toupper_ibm852(strdup(old_ibm852_string));
    new_lower = tolower_ibm852(strdup(old_ibm852_string));


nime 03-05-2011 11:47 AM

Hello Nominal :)
I'm glad to "see" you again with your's impressive knowledge of clean C and good will to help others!!

You helped me to fix huge problem with encodings and thank you very much for that. Now I have 3 different encodings in my program but everyone is needed. I expect ISO-8859-2 will be needed when I would write (or read) HTML reports and who know what when I take PDF's. On principle you show me I have no doubt that I can respond to any further de/coding needs even with my poor (beginners) C knowledge.

Quote:

where the //TRANSLIT means transliterate unsupported characters to nearest equivalents.
I didn't see any "unsupported" characters until now, everything is showed properly in all 3 my encodings. But I work only with letters, maybe this is applied for special signs like ? If not maybe I can //TRANSLIT simply "forget"?

Code:

The interface you showed, is a bit difficult: you need to choose between efficiency and code complexity. You see, if you create a new conversion handle for each string conversion, you incur quite a high overhead in iconv_open, slowing things down. You can avoid that by caching a number of conversion handles, but that makes the code a lot more complex.
I am sorry for that. This is because I still don't understand things at right way. I thought at VB way like this:
Code:

char* fromCP = "IBM852";
char* toCP = "UTF-8";
iconv_t  onlyonehandle = conversion(fromCP, toCP);

Because after every converting handle is destroyed and iconv closed. So, I think, function can begin again with new parameters. But if this is not OK I will try your suggestions.

Quote:

... which keeps the old string intact, returning a new dynamically allocated string.
Would this work for you? I'd be happy to show you the source for conversion() and convert().
Yes, copied string is fine.
Well, I am a bit slow in C so I need enough time to see what will happen and try to understand this. After all, for now I don't free any memory because I do only snippets for testing and I know, I should do that for concrete programs.

Quote:

Normally I'd recommend using setlocale() and toupper()/tolower(), but since every glyph in IBM852 is just one byte, you can do the conversion trivially with a table. For example...
Hmm, I think manipulation with locales would be a better (easier) solution here if I can get result with them. Maybe sometimes I would need to do some "sort by letters" in "strange" locale and order in ASCII table is not proportional with place in alphabet order. For example letters in my alphabet is like this: "ABCČĆD(D)ĐEFGHIJ..."
So for this reasons temporary change locale can give better results (I think).
Additionally, difference between lowercase 'č' and uppercase 'Č' in ASCII is also 32 (200,232 in cp1250) but not in IBM852 (179, 152). So, better is run away for doing this "by hand" like I use to do in DOS.

Huh, now you give me enough work for next month :)

What can I say more than thank you again. You helped much more than I expected!


All times are GMT -5. The time now is 07:14 PM.