Unicode (and therefore UTF-8 too)
does have all cp1251 glyphs, it is just that cp1251 does not define a glyph for code point 0x98 (152 in decimal). Using different fonts, you will see a different glyph for that byte, since there is no standard glyph defined for it.
You can see the same with almost all 8-bit character sets, since very few define all 256 possible glyphs. Even cp1252 (Windows "Western European" character set) has five undefined glyphs (0x81, 0x8D, 0x8F, 0x90, and 0x9D).
If you are using the
iconv command, using e.g.
-t UTF-8//TRANSLIT will not help, since it is not a transliteration problem -- as the source glyph is undefined, there is no way to transliterate it --, and even the
-t UTF-8//IGNORE option will often cause the command to return with an error (even if it does convert all of the input). And in any case, they are nonstandard GNU extensions anyway. Use
Code:
iconv -sc -t UTF-8 -f cp1251
instead, as the
-soption silences warnings, and
-c option omits invalid characters from output.
Note that the
POSIX specification for the iconv utility states that if the input contains invalid (or unmappable) characters, it will always be reflected in the exit status. That is insane, making it nearly useless for "bad" input. Fortunately, most
iconv implementations do not do that; when
-c is used, any transcoding problems are totally ignored.
In other words, the above works in practice, with exit value being nonzero only if a real error occurs. The POSIX standard differs a bit, stating that the command may return a nonzero exit value even if the conversion was successful, if there were any invalid or unmappable characters in the input.
If you need to be fully standards-compliant, you should first filter out the invalid bytes using e.g. tr, and then you can rely on the exit status:
Code:
if tr -d '\230' < file | iconv -t UTF-8 -f cp1251 > temporary-file ; then
mv -f temporary-file file
else
Error reading input-file or writing to temporary-file
fi
In all cases above, I recommend you use an automatically deleted temporary directory for your temporary files. It is a very easy technique that makes sure you won't leave temporary files lying around. See the latter part of
this post, for example. Please remember to properly quote your file and directory name variables to avoid problems.
In case there is somebody wondering, for Windows Western European (AKA cp1252), a standards compliant way for the conversion is
Code:
if tr -d '\201\215\217\220\235' < file | iconv -t UTF-8 -f cp1252 > temporary-file ; then
mv -f temporary-file file
else
Error reading input-file or writing to temporary-file
fi
If you were to use the iconv() function in your own program, it will return
(size_t)-1 with
errno==EILSEQ and the input pointer pointing to the first byte of the invalid sequence. In that case, just increase the input pointer by one (decreasing the number of input bytes left also) and retry, until it succeeds or there is no more bytes in the input buffer. That way you do not need to know the undefined glyphs beforehand. That way you do not need to rely on GNU extensions, and you can even count the number of invalid bytes skipped in the input.