[SOLVED] How can remove this weird chars from my text file?

sopier · 12-21-2011, 03:32 AM

I have a text which contained some weird character such as (I have to use image since I can't print the character here):

http://www.mp3210.com/image.png

or this:

http://www.mp3210.com/question.png

I try this command using sed:

Code:

sed "s/[^a-zA-Z0-9]/ /g"

But there are no result so far....

Doc CPU · 12-21-2011, 04:31 AM

Hi there,

Quote:

Originally Posted by sopier

I have a text which contained some weird character such as (I have to use image since I can't print the character here):

http://www.mp3210.com/image.png
http://www.mp3210.com/question.png

you seem to have a character encoding issue.

In your first sample, do the small digits read 0096? It's hard to see in the image. But yes, I guess it's 0096.
Very obviously, your editor assumes that the text is in UTF-8, while it is actually something like Windows-1252. The first sample shows that there is a character code 0096h in the text stream. That code seems to be in valid UTF-8 encoding, but code 0096h has no character assigned.
The second sample shows a code that is invalid in UTF-8 and is therefore displayed as a replacement character (question mark).

Depending on how the text is created and how it is processed:

Make sure all processing stages use the same character encoding
Where possible, specify the encoding explicitly (e.g. in a HTTP header, or by supplying a BOM in a text file, though the BOM can cause other problems)

[X] Doc CPU

malekmustaq · 12-21-2011, 05:13 AM

Set char encoding first before issuing the command.

David the H. · 12-21-2011, 06:01 AM

Try this:

Install the program uchardet.

Run "uchardet filename" to determine (hopefully) what encoding was used.

Then use iconv to convert that encoding to utf-8:

Code:

iconv -f <old-encoding> -t UTF-8 filename > newfile

Hopefully the file will now be fully readable.

Use "iconv -l" to list out all the supported encodings, so as to put the <old-encoding> string in the proper form. In my experience, most text files I've found on the web have been in ISO-8859-1, CP-1252, UTF-16/UCS-2, or UTF-32. You may encounter others if you deal with many different languages, particularly one of the other ISO-8859 variations.

Note also that plain ascii is fully compatible with UTF-8, so there's no need to convert those files.

BTW: There's also a python script called chardet (python-chardet), which does pretty much the same job. But in my experience it doesn't make very reliable guesses. If its output says anything less than "confidence: 1.00", don't trust it. Open up the file in an editor that's capable of changing the display encoding on the fly, such as kwrite, and check it manually. In particular it appears to often mis-detect CP-1252 as ISO-8859-2.

I've only just discovered uchardet, so I don't really know how reliable it is yet, but a few quick tests seem to indicate that it does a better job.

For that matter, even the venerable file command makes some attempt at detecting the encoding, but it's even less reliable.

Finally, also be aware that files created on Microsoft platforms generally have dos-style line-endings. There are several different solutions available for converting them to unix-style, which I'll leave up to you to discover.

sopier · 12-21-2011, 06:15 AM

uchardet solves the problem, when i ran those command, it says "windows-1252", and I convert them to utf-8 using iconv.. solved... thanks...

Download uchardet for ubuntu:
http://mirror01.th.ifl.net/ubuntu/po...se/u/uchardet/