Letters and chars in Linux
Hello,
I am pretty new in C and in Linux so I need advice, suggestion or example regarding encoding/decoding strings in generally on Linux. I have much of old binary data files mostly written in DOS/QB45. If I want to read that data on windows I should recode strings (replace certain characters) and then data looks good on windows. There this characters are between ASCII 127 and 255. It is word about non-english letters like "čćšđžČĆŠĐŽ" which is common in my country. Every operation with such strings can count for 1 byte per letter. So, conversion looks like: Code:
char* ps = myString; Here, different than english letters, seems that "my" characters have more than one byte. Of course this is not good for my program which have searching and many other stuff organized like letter/byte principle. And of course I am in trouble now :) But this is behaviour present in many other languages and I ask for help experienced linux users and programmers - what to do here to be able to use my nice and beloved old data with my C programs in linux. Data is organized in packed binary structures and written to file with (say) VB. Thanx, nime. |
Hm, sounds like you would benefit from reading this... My guess is that linux is using unicode to store your characters. I'm not sure what the best course of action would be - if it were me, I would probably try to change my programmes to deal with the multi-character encoding, or at least find a bit more about the issue (90% of my knowledge on the subject came from reading that article when trying to fix a character encoding issue on a website). However, this may not be the most time-effective way of doing things. What exactly are you "converting" it to? You say you need to re-encode characters between 129 and 255, but in your programme you change 209 to 208...
|
Hello Snark,
I have old data which I cannot change cause it is at different places and because it has "a much" of those data. Originally, this data was written in DOS with my code page which is 852. Part of converting program I showed recode characters readable in DOS so they can be readable in Windows and this works OK. When I work in windows I always recode data back so it can be readable from DOS too. At this way I have same data for two different OS-es. My idea was to involve Linux in this story, that's mainly why I now learn "C" language. In Dos and windows those specific characters lies in range between 127 and 255 but not in same place. This applyes only for my country specific letters "čćšđžČĆŠĐŽ". Latin characters are always at same place. Now, I have another interesting questions: How Linux prints to matrix devices like printers and so if they uses unicode which is suitable only for graphical systems? How linux then interact with microcontrollers? I didn't find yet interesting stuff at Oracle. But I have some idea and because I am C beginner I don't know if this is this possible. For example I can compose string with single byte and two byte characters to read/change them in Linux (like GTK or similar textbox) but can I after recognize (by looping) which letter in string is 2-byte so I can recode back to DOS single byte matrix before I save to disk? Recoding is necessary because data assumes strings with explicit size (fixed length strings) and other technical requirements in such system stands on this fact. What do you think? |
Well, if you read data from file and save single bytes then in file are stored single bytes, meaningless of a Linux or Windows, data in files are the same. I think that you have problem with representation of this new file on the screen. Can you say how do you known that:
Quote:
|
Hi eSelix,
I try like this: Code:
char* str = "Marco"; Code:
char* str = "Mačka"; I prints data to console with printf and all those letters are represented with black questionmark. I do additional tests. I looping through all ascii from 32 to 255 and prints them to console and letters over 127 are also only black "?". And more. I save text file in windows (notepad/plain text) with those data and in linux this is displayed without my characters. I can see those file properly after converting in geany or ooWriter. When I open this file in geany without conversion instead of those characters there are boxes with four (hex) letters. But please note, I want to convert my strings through C program, not whole text file. After all, my data is in binary file where text and numbers are mixed and "packed". |
Editor in which you wrote this code has set unicode encoding. So the "Mačka" string is encoded in your program that way. Change to iso/ascii in editor options. If you don't now how, say which editor are you using.
As varies countries use varies charsets then we have mess with strings now. It is difficult to guess by editor which charset is used in opened file, so the user must do a conversion by hand. But UTF-8 was created as remedy for this situation. |
eSelix,
No need to guessing, just read what I wrote. 1) I don't want to convert text (or other) file but strings in my C program. 2) I don't use editor but C program in windows. I can read those data from windows in linux but those letters are not displayed. Is here any way that my program can use this data that I do conversion directly in my program just for strings which I need, not to whole file? Or like this, can I force linux that my program uses codepage and character sets same like windows? Same problem should be with german, french, spanish and many other languages so I belive somebody faces this problem long before me. |
Quote:
|
Ah so, thanks Snarky,
I do this already (set to UTF-8) my C editor (Code:Blocks) and they works well and proper on both win and linux. But this is other kind of problem. I read today more sources for this problem and it seems that may of people has it. Higher level programming tools have included those conversions but C have got gnulib, locale and internationalization concept which I have to study. Anyway, If somebody have advice or experience with this please share! |
n Linux, there is no need to modify your old programs. You can adjust your environment to use the same character encoding, IBM 852, as your old data files use.
If your old program is a command-line program (and does not use e.g. ncurses), use iconv to convert the output to readable form: Code:
./my-program ... | iconv -f IBM852 Code:
luit -encoding ibm-cp852 ./my-program ... Example iconv C helper code follows. ibm852_to_utf8() and utf8_to_ibm852() work just like strdup(), returning a duplicate of the given string, except they do the conversion too. Remember to free() any such strings after you're done with them. Also, call iconv_done() when you don't need the conversion (at least for a while), to free any internal tables used by iconv. It would be better if the code resized the newly created strings when necessary, instead of just allocating them heuristically. Or even better, if it used string pools (like Apache memory pools), so that you could discard all related strings at once, and not worry about remembering to free each and every one. Code:
#include <stdlib.h> Hope this helps, Nominal Animal |
Thank you Nominal,
this sounds like way to go. Also, thanks for all remarks, I will understand what do you talk about with a little practice then I will post back results. So I see now this can go like this: - load string from 852 file, - duplicate them and convert to utf-8, - do some job (change) with them, - convert them to 852 back, - write them to disk, - free them from memory, This is ideal scenario because same data without static conversion will be good for program in DOS and windows too which can all share same files. However, this makes theoretically possible to have 3 different OS-es in LAN network which work with same data. Thanks again Nominal, for windows and VB and DOS don't need help. I solved there those scenario long time ago. |
Right, nime.
The only thing to watch for is if the user supplies UTF-8 characters (say, Kanji or something), which cannot be converted to IBM-852. The //TRANSLIT flag tells iconv to do the closest possible conversion, which is very good, but you may have to use better memory handling in case of the translitterations are much longer than the original characters. The example functions I gave are not very good, since they really should check if the conversion needs a larger buffer.. Also, you might consider using a configurable character set name (read it either from a a file, command-line parameter or an environment variable, defaulting to IBM852), and open iconv handles (disk_to_utf and utf_to_disk) early in your program. That way you can handle already converted files, too. (Iconv uses the standard character set mappings in your locales' charmaps/ directory. If you have already done some partial static character conversions, you can copy charmaps/IBM852 to say charmaps/IBM852dosfix, and edit it (rename to IDBM852dosfix) to account for those static conversions -- the file has a simple textual format. Then just tell your program to use the IBM852dosfix charset instead, and you can work with those files painlessly too. Happy to help, Nominal Animal |
Hello again
I see we have here background conversation too what confirm additionally how this subject is interesting for developers. I am probably winner in trying to fix my own mistakes with partial success :) One I'm sure, If I wouldn't be a human I will made much less mistakes. This is not good because then I will not have to fix anything.
And most important is that all my mistakes becomes as result of hard work and positive thinking. So, last month I try often to get conversion of codepages to work and I get it finally - yesterday, but only on windows, using iconv.dll. But new situations appears (which I don't know to solve). IBM852 to UTF8 is fine for filling GTK textboxes, printing to console (which is at 852) and so, but it is not good for writing to text file for show with notepad (without BOM). So, by example of Nominal I made additional conversions from IBM852 to CP1250 and back and now this is also fixed. Conversions works excellent! Properly and fast. What will be in Linux I will see later. Calling conversions look's like this: Code:
char* entry_text; So, I think about to replace those functions with just one or two, and I try, but my C is not so good as my english ;) I would like function like this: Code:
(2) And additional problem, for which I don't think before but is needed now is: Main purpose of storing strings to file in old DOS 852 format is direct addressing (one letter-one byte) and ability to use simple C functions for searching and manipulating data like strcmp, strstr, strcpy and so. But... ... problem is, after I convert GTK text to IBM852 I have to convert my new 852 text to uppercase! But my system (CP1250) dont know what is uppercase in CP852 when we talk about letters like "čćžšđČĆŽŠĐ". What to do here. Did anybody faces with this problem before? |
Quote:
Quote:
Quote:
I'd recommend you use a bit different interface, where you create the conversion first like this: Code:
iconv_t utf8_to_ibm852 = conversion("UTF-8", "IBM852//TRANSLIT"); Code:
char *new_ibm852_string = convert(utf8_to_ibm852, old_utf8_string); Would this work for you? I'd be happy to show you the source for conversion() and convert(). Quote:
Code:
#include <stdlib.h> If you want to keep the original string intact, use Code:
new_upper = toupper_ibm852(strdup(old_ibm852_string)); |
Hello Nominal :)
I'm glad to "see" you again with your's impressive knowledge of clean C and good will to help others!! You helped me to fix huge problem with encodings and thank you very much for that. Now I have 3 different encodings in my program but everyone is needed. I expect ISO-8859-2 will be needed when I would write (or read) HTML reports and who know what when I take PDF's. On principle you show me I have no doubt that I can respond to any further de/coding needs even with my poor (beginners) C knowledge. Quote:
Code:
The interface you showed, is a bit difficult: you need to choose between efficiency and code complexity. You see, if you create a new conversion handle for each string conversion, you incur quite a high overhead in iconv_open, slowing things down. You can avoid that by caching a number of conversion handles, but that makes the code a lot more complex. Code:
char* fromCP = "IBM852"; Quote:
Well, I am a bit slow in C so I need enough time to see what will happen and try to understand this. After all, for now I don't free any memory because I do only snippets for testing and I know, I should do that for concrete programs. Quote:
So for this reasons temporary change locale can give better results (I think). Additionally, difference between lowercase 'č' and uppercase 'Č' in ASCII is also 32 (200,232 in cp1250) but not in IBM852 (179, 152). So, better is run away for doing this "by hand" like I use to do in DOS. Huh, now you give me enough work for next month :) What can I say more than thank you again. You helped much more than I expected! |
Quote:
Here is a function that uses your original interface. Note that it may be slow, because it opens and closes the iconv handle for each string separately. Like I said above, it is always good to append either //TRANSLIT or //IGNORE to the target character set name, otherwise the function will return NULL if there are inconvertible characters. Finally, this will grow and optimize the result string dynamically to exact length. It will always allocate enough additional space for the rest of the input string, plus CONV_EXTRA bytes. If you want, you can set CONV_EXTRA to a larger value, so it will initially allocate more memory. It will still optimize the size via a realloc() call, so there is very little harm in having CONV_EXTRA a bit larger, maybe 256 or 1024. Code:
#include <stdlib.h> Quote:
The locale setting is divided into multiple categories, so you can only set e.g. LC_COLLATE category for string collation. Here is an example which uses the locale hr_HR.IBM852 to compare two strings: Code:
#include <locale.h> In your code, you don't need to do that; you can just set the locale to whatever you happen to need. It is also local to the program, and will not change any system settings or anything, so you can use it in your program pretty freely. The strcoll function works just like strcmp, except it uses the LC_COLLATE locale category. Note that the tolower_ibm852() and toupper_ibm852() functions I listed earlier are both thread-safe, and not dependent on the locale settings. If you need case sensitive and insensitive IBM852 comparison functions (strcmp_ibm852() and strcasecmp_ibm852()), I can show them for you; the code is very much like tolower_ibm852() and toupper_ibm852(), except with two different tables. The difference between these hardcoded functions and locale functions is that these are self-contained, and do not depend on any other things. In fact, if you have issues getting iconv support working for all your target platforms, I could quite easily write hardcoded conversion, sorting, and case changing functions for ISO-8859-2 and IBM852, with conversion to and from UTF-8 and ISO-8859-1, if you like. Hope this helps! |
Nominal,
I am truly shocked with your deep knowing of this (very complicated) theme. Actually, I shamed to ask any additional help because I understand that I get more than "reasonable" help for free from you. And also, I hope you see that level of your examples and wide help go far beyond my ability to understand what I am doing. For now I wouldn't like to know about iconv conversions more than is need for my programs because I already have mess in head from this. From more informations now I can only have more damage. Go to developiong CONV_EXTRA for my strings is also too much. Especially what now I have proper letters on console, file and GTK entry. I can easily now add functions for 8859-2 by myself. What more can I want? I would simply add //TRANSLIT if this is "has to be". Thank you for pointing me to potentional problems with changing of locales and I undestand advantages of hardcodins when CP IBM852 is constant. So I try to apply your example but won't work. I added uppercase and lowercase arrays in header file and declarations for functions, like this: Code:
const unsigned char uppercase_ibm852[256] = { Code:
// fstr is string from GTK textbox. Quote:
Please help to get this working. I was also try to reduce code for not have to same function six or more times with your recommendation: Code:
iconv_t utf8_to_ibm852 = conversion("UTF-8", "IBM852//TRANSLIT"); Quote:
For now I use following function which I pick up from the net. Works fast and nice, but not on 852 so I must recode first to work only in windows.. Code:
const char *instrnocase(const char *haystack, const char *needle) And now I think confused about changing locales... This is something what should be well tested before any decision. Thank you for example how to change just a essential part of locale. |
Quote:
Quote:
Quote:
As to the types:
Current C compilers almost always detect the above situation even without the help, so it is just a detail. However, I've found I understand and remember the intent of the functions better, when the parameters are meticulously marked const. I've even caught a number of bugs that way. Quote:
toupper_ibm852 and tolower_ibm852 are self-standing functions; you can call them anytime you want, without any setup. They do not depend on any other code; everything they needed was in that one code box. (If you remove the static inline from it.) Similarly, the convert function is self-standing; you only need the code I showed in that code box, and link in iconv when compiling. If you use it, you can replace all the conversion calls you have right now with it. The only iconv_t you should see in all your source code would be the iconv_t handle; line in the convert function. Quote:
|
I get yours extra-excellent upper/lower 852 conversion to work with removing "static inline" and putting array in code file. Then I reorganize my program with new header and code file just for conversions because I will need them often.
Now I have this: Code:
// fstr is UTF8 from Gtk textbox |
And now, here is my performances!
I have data file over 3MB with 100.000 records which contains 43.761 filled rows with various data, written with QB45 and VB data structures (types) which don't know for null termination so I do this during reading of my records. All string data is in cp852! I have GTK textbox for input a search string. After, I convert this string from utf8 to 852 and then to uppercase852 all with Nominals functions. Then I read all 100.000 records and where data exist I isolate and terminate all strings, convert name field (28 chars) to uppercase852 and then search for first occurrence of search string in it with 'strstr' C function. If search string is finded I write them to console with cp852, write them to windows txt file with cp1250 and write them to GTK textbox with utf8 with following results: Quote:
Now I have added additional conversions of CP and have 12 of them in my program (everything to anything) and total size of my program (exe) is a little less than 20 kb! Of course, all of this would not be possible without extremely assistive Nominal Animal selfless help for which I am grateful for a lifetime! |
Nime, that sounds excellent! Less than a second to search is not long at all. I'm sure the users are happy.
I know you said you don't need any more code .. but these functions allow you to work directly on the unterminated strings in your data structures. First, these two functions use the same uppercase_ibm852 array as previously. They allow you to check if the data contains the given substring. The first is case sensitive, the second is case insensitive. If you supply a NULL pointer, or a zero-length area, both functions will just return -1 (no match) without any problems. These should make your code even simpler, I think. Code:
/* Case sensitive substring search. Code:
char *convertdata(char const *const from, char const *const to, |
Thank you Nominal for more functions but I can't get them to work and probably I don't try enough. Why to, when we already does an excellent job with good results.
But, I am sure, more than half of the world programmers will be happy to find this material here! I searched for this long time. So mighty conversions (relatively simple to make if someone like you help enough) which are independent from M$"cultures" and very slow classes in huge frameworks. I tested my program more tonight and I see that freeing variables are very necessary. If I don't do this program becomes unstable and significantly slower. So I do best I can. Program seems reliable now but need more testing. I also try to make new project in Linux to read my data files but I don't do something good. Compiler returns error (-ld 2 or so). But I will do this soon as possible and now I am sure with same good results. I belive that nothing regarding cp Conversions we did not leave unfinished. So, thank you for all once more, nime. |
All times are GMT -5. The time now is 10:23 AM. |