iconv usage query
I have a UTF-8 file that I wish to process as ISO-8859-1.
Discounting for a moment the fact 'characters' may be single or multi-byte each record in this sample file has a fixed pre-determined format and fixed definition based on character offsets. e.g. Field 1 Alphanumeric position 1, length 30 Field 2 Alphanumeric position 31, length 20 etc etc So when reading in the data, the first 30 characters relate to Field 1, and Field 2 starts at the 31st character. Assume for the sake of argument that all are single byte in this example. When I use iconv to convert this UTF-8 file to ISO-8859-1 it will drop whatever it is unable to convert. The net result of this is that Field 1 may now contain 29 characters and Field 2 may start at the 30th character not the 31st. e.g. EXAMPLE may become EXMPLE (if for arguments sake the 'A' were a character iconv was unable to convert) Can I switch this behaviour off so that EXAMPLE becomes EX MPLE? And invalid or non-convertible characters are set to a control char, say 'space'? Or maybe there is a combination of commands that can be used to similar effect and can preserve a file's fixed position-based structure while still converting its character encoding. Thanks. |
"man iconv" has a section that says this:
Quote:
|
Hi, welcome to LQ!
Quote:
resolution of that issue, there's a "piconv" command (iconv re- invented in perl) that you may be able to modify to do your bidding... Cheers, Tink |
Quote:
'man iconv' in slackware 12.1 and RH as5 doesn't have those sections ... what distro are you looking it up in? Cheers, Tink |
Quote:
Code:
~/$ iconv --version Quote:
|
Slack:
Code:
$ iconv --version Code:
~> iconv --version Heh ... never mind, if debians man-page describes the proper behaviour it's all for the better ;} Cheers, Tink |
Thanks for the tips, I'll give each a go and see what I come up with (and report back)
Thanks again. |
Thanks for your guidance. Found on Solaris that with its default behaviour (i.e. no switches) iconv will replace the non-convertible characters with a ? or 0x3F. This seems to preserve the formatting from a fixed file definition perspective, and is therefore manageable for what I need to do. I can cope with a question mark. This is actually OK for me and much better - in this particular case - than the silent dropping of the character from the output.
|
All times are GMT -5. The time now is 05:48 AM. |