LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-24-2011, 06:10 AM   #1
x-stream
LQ Newbie
 
Registered: Jul 2011
Posts: 8

Rep: Reputation: Disabled
How to make iconv to skip incorrect symbols or iconv alternative?


I'm going to convert a lot of text files from unicode to MS Win encoding cp-1251 but I fail using iconv since it stop converting when reach a symbol not existing in Windows codepage:

Code:
iconv: illegal input sequence at position ...
Is there a way to force iconv to continue encoding skipping incorrect symbol or any other program for CLI codepage conversion? I'm remember there was a 'konvert' command many years ago in one (and as far as I remember it was not stopping in this case) but I can't find any package providing this command...
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 07-24-2011, 06:37 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Assuming the gnu C version of iconv, there's a -c option for excluding invalid characters. See the man page.

There are also two options you can add to the "to" encoding.
Code:
iconv -f CP-1251 -t UTF-8//IGNORE file
#discards any unsupported characters.

iconv -f CP-1251 -t UTF-8//TRANSLIT file
#attempts to substitute similar characters from the target encoding.
 
1 members found this post helpful.
Old 07-24-2011, 06:51 AM   #3
x-stream
LQ Newbie
 
Registered: Jul 2011
Posts: 8

Original Poster
Rep: Reputation: Disabled
Thanks a lot, I think //TRANSLIT is the best I could expect to.
 
Old 07-24-2011, 01:55 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Unicode (and therefore UTF-8 too) does have all cp1251 glyphs, it is just that cp1251 does not define a glyph for code point 0x98 (152 in decimal). Using different fonts, you will see a different glyph for that byte, since there is no standard glyph defined for it.

You can see the same with almost all 8-bit character sets, since very few define all 256 possible glyphs. Even cp1252 (Windows "Western European" character set) has five undefined glyphs (0x81, 0x8D, 0x8F, 0x90, and 0x9D).

If you are using the iconv command, using e.g. -t UTF-8//TRANSLIT will not help, since it is not a transliteration problem -- as the source glyph is undefined, there is no way to transliterate it --, and even the -t UTF-8//IGNORE option will often cause the command to return with an error (even if it does convert all of the input). And in any case, they are nonstandard GNU extensions anyway. Use
Code:
iconv -sc -t UTF-8 -f cp1251
instead, as the -soption silences warnings, and -c option omits invalid characters from output.

Note that the POSIX specification for the iconv utility states that if the input contains invalid (or unmappable) characters, it will always be reflected in the exit status. That is insane, making it nearly useless for "bad" input. Fortunately, most iconv implementations do not do that; when -c is used, any transcoding problems are totally ignored.

In other words, the above works in practice, with exit value being nonzero only if a real error occurs. The POSIX standard differs a bit, stating that the command may return a nonzero exit value even if the conversion was successful, if there were any invalid or unmappable characters in the input.

If you need to be fully standards-compliant, you should first filter out the invalid bytes using e.g. tr, and then you can rely on the exit status:
Code:
if tr -d '\230' < file | iconv -t UTF-8 -f cp1251 > temporary-file ; then
    mv -f temporary-file file
else
    Error reading input-file or writing to temporary-file
fi
In all cases above, I recommend you use an automatically deleted temporary directory for your temporary files. It is a very easy technique that makes sure you won't leave temporary files lying around. See the latter part of this post, for example. Please remember to properly quote your file and directory name variables to avoid problems.

In case there is somebody wondering, for Windows Western European (AKA cp1252), a standards compliant way for the conversion is
Code:
if tr -d '\201\215\217\220\235' < file | iconv -t UTF-8 -f cp1252 > temporary-file ; then
    mv -f temporary-file file
else
    Error reading input-file or writing to temporary-file
fi
If you were to use the iconv() function in your own program, it will return (size_t)-1 with errno==EILSEQ and the input pointer pointing to the first byte of the invalid sequence. In that case, just increase the input pointer by one (decreasing the number of input bytes left also) and retry, until it succeeds or there is no more bytes in the input buffer. That way you do not need to know the undefined glyphs beforehand. That way you do not need to rely on GNU extensions, and you can even count the number of invalid bytes skipped in the input.
 
2 members found this post helpful.
Old 09-26-2011, 09:32 PM   #5
catch93
LQ Newbie
 
Registered: Sep 2011
Posts: 5

Rep: Reputation: Disabled
/usr/bin/iconv: illegal input sequence at position

We are migrating from UNIX to LINUX and we are using the iconv to convert some international characters
the unix version of the iconv command was
/usr/bin/iconv -f utf8 -t iso815
we converted it to
/usr/bin/iconv -f utf8 -t iso8895_15

We found in the iconv unix version that has the warnings:
WARNINGS
If an input character does not have a valid equivalent in the code set
selected by the -t option (the "to" code set), it is mapped to the
"galley character", if it has been defined for that conversion. (see
genxlt(1) and iconv(3C) ).

The LINUX version did not have that mention but we found the following option to suppress warnings and still continue conversion

/usr/bin/iconv -sc -f utf8 -t iso8895_15

Is that sufficient or we need to use another codepage in our -t option

I am moving RRHEL5 in LINUX
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Looking for config.iconv Anna Petrovna Programming 3 07-20-2009 09:12 AM
iconv always returns 0 minimol Programming 1 04-16-2009 04:51 PM
Installing iconv? Zeno McDohl Linux - Software 1 01-24-2009 05:29 AM
Iconv troubles ppr:kut Linux - Software 1 10-19-2007 05:24 AM
iconv command saravanan1979 Linux - Software 1 07-06-2002 11:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration