How to make iconv to skip incorrect symbols or iconv alternative?

x-stream · 07-24-2011, 06:10 AM

I'm going to convert a lot of text files from unicode to MS Win encoding cp-1251 but I fail using iconv since it stop converting when reach a symbol not existing in Windows codepage:

Code:

iconv: illegal input sequence at position ...

Is there a way to force iconv to continue encoding skipping incorrect symbol or any other program for CLI codepage conversion? I'm remember there was a 'konvert' command many years ago in one (and as far as I remember it was not stopping in this case) but I can't find any package providing this command...

David the H. · 07-24-2011, 06:37 AM

Assuming the gnu C version of iconv, there's a -c option for excluding invalid characters. See the man page.

There are also two options you can add to the "to" encoding.

Code:

iconv -f CP-1251 -t UTF-8//IGNORE file
#discards any unsupported characters.

iconv -f CP-1251 -t UTF-8//TRANSLIT file
#attempts to substitute similar characters from the target encoding.

x-stream · 07-24-2011, 06:51 AM

Thanks a lot, I think //TRANSLIT is the best I could expect to.

Nominal Animal · 07-24-2011, 01:55 PM

Unicode (and therefore UTF-8 too) does have all cp1251 glyphs, it is just that cp1251 does not define a glyph for code point 0x98 (152 in decimal). Using different fonts, you will see a different glyph for that byte, since there is no standard glyph defined for it.

You can see the same with almost all 8-bit character sets, since very few define all 256 possible glyphs. Even cp1252 (Windows "Western European" character set) has five undefined glyphs (0x81, 0x8D, 0x8F, 0x90, and 0x9D).

If you are using the iconv command, using e.g. -t UTF-8//TRANSLIT will not help, since it is not a transliteration problem -- as the source glyph is undefined, there is no way to transliterate it --, and even the -t UTF-8//IGNORE option will often cause the command to return with an error (even if it does convert all of the input). And in any case, they are nonstandard GNU extensions anyway. Use

Code:

iconv -sc -t UTF-8 -f cp1251

instead, as the -soption silences warnings, and -c option omits invalid characters from output.

Note that the POSIX specification for the iconv utility states that if the input contains invalid (or unmappable) characters, it will always be reflected in the exit status. That is insane, making it nearly useless for "bad" input. Fortunately, most iconv implementations do not do that; when -c is used, any transcoding problems are totally ignored.

In other words, the above works in practice, with exit value being nonzero only if a real error occurs. The POSIX standard differs a bit, stating that the command may return a nonzero exit value even if the conversion was successful, if there were any invalid or unmappable characters in the input.

If you need to be fully standards-compliant, you should first filter out the invalid bytes using e.g. tr, and then you can rely on the exit status:

Code:

if tr -d '\230' < file | iconv -t UTF-8 -f cp1251 > temporary-file ; then
    mv -f temporary-file file
else
    Error reading input-file or writing to temporary-file
fi

In all cases above, I recommend you use an automatically deleted temporary directory for your temporary files. It is a very easy technique that makes sure you won't leave temporary files lying around. See the latter part of this post, for example. Please remember to properly quote your file and directory name variables to avoid problems.

In case there is somebody wondering, for Windows Western European (AKA cp1252), a standards compliant way for the conversion is

Code:

if tr -d '\201\215\217\220\235' < file | iconv -t UTF-8 -f cp1252 > temporary-file ; then
    mv -f temporary-file file
else
    Error reading input-file or writing to temporary-file
fi

If you were to use the iconv() function in your own program, it will return (size_t)-1 with errno==EILSEQ and the input pointer pointing to the first byte of the invalid sequence. In that case, just increase the input pointer by one (decreasing the number of input bytes left also) and retry, until it succeeds or there is no more bytes in the input buffer. That way you do not need to know the undefined glyphs beforehand. That way you do not need to rely on GNU extensions, and you can even count the number of invalid bytes skipped in the input.

catch93 · 09-26-2011, 09:32 PM

We are migrating from UNIX to LINUX and we are using the iconv to convert some international characters
the unix version of the iconv command was
/usr/bin/iconv -f utf8 -t iso815
we converted it to
/usr/bin/iconv -f utf8 -t iso8895_15

We found in the iconv unix version that has the warnings:
WARNINGS
If an input character does not have a valid equivalent in the code set
selected by the -t option (the "to" code set), it is mapped to the
"galley character", if it has been defined for that conversion. (see
genxlt(1) and iconv(3C) ).

The LINUX version did not have that mention but we found the following option to suppress warnings and still continue conversion

/usr/bin/iconv -sc -f utf8 -t iso8895_15

Is that sufficient or we need to use another codepage in our -t option

I am moving RRHEL5 in LINUX