LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-21-2011, 03:32 AM   #1
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Rep: Reputation: Disabled
How can remove this weird chars from my text file?


I have a text which contained some weird character such as (I have to use image since I can't print the character here):

http://www.mp3210.com/image.png

or this:

http://www.mp3210.com/question.png

I try this command using sed:

Code:
sed "s/[^a-zA-Z0-9]/ /g"
But there are no result so far....
 
Old 12-21-2011, 04:31 AM   #2
Doc CPU
Senior Member
 
Registered: Jun 2011
Location: Stuttgart, Germany
Distribution: Mint, Debian, Gentoo, Win 2k/XP
Posts: 1,099

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
Hi there,

Quote:
Originally Posted by sopier View Post
I have a text which contained some weird character such as (I have to use image since I can't print the character here):

http://www.mp3210.com/image.png
http://www.mp3210.com/question.png
you seem to have a character encoding issue.

In your first sample, do the small digits read 0096? It's hard to see in the image. But yes, I guess it's 0096.
Very obviously, your editor assumes that the text is in UTF-8, while it is actually something like Windows-1252. The first sample shows that there is a character code 0096h in the text stream. That code seems to be in valid UTF-8 encoding, but code 0096h has no character assigned.
The second sample shows a code that is invalid in UTF-8 and is therefore displayed as a replacement character (question mark).

Depending on how the text is created and how it is processed:
  • Make sure all processing stages use the same character encoding
  • Where possible, specify the encoding explicitly (e.g. in a HTTP header, or by supplying a BOM in a text file, though the BOM can cause other problems)

[X] Doc CPU
 
Old 12-21-2011, 05:13 AM   #3
malekmustaq
Senior Member
 
Registered: Dec 2008
Location: root
Distribution: Slackware & BSD
Posts: 1,667

Rep: Reputation: 494Reputation: 494Reputation: 494Reputation: 494Reputation: 494
Set char encoding first before issuing the command.
 
Old 12-21-2011, 06:01 AM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian + kde 4 / 5
Posts: 6,846

Rep: Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008Reputation: 2008
Try this:

Install the program uchardet.

Run "uchardet filename" to determine (hopefully) what encoding was used.

Then use iconv to convert that encoding to utf-8:

Code:
iconv -f <old-encoding> -t UTF-8 filename > newfile
Hopefully the file will now be fully readable.

Use "iconv -l" to list out all the supported encodings, so as to put the <old-encoding> string in the proper form. In my experience, most text files I've found on the web have been in ISO-8859-1, CP-1252, UTF-16/UCS-2, or UTF-32. You may encounter others if you deal with many different languages, particularly one of the other ISO-8859 variations.

Note also that plain ascii is fully compatible with UTF-8, so there's no need to convert those files.

BTW: There's also a python script called chardet (python-chardet), which does pretty much the same job. But in my experience it doesn't make very reliable guesses. If its output says anything less than "confidence: 1.00", don't trust it. Open up the file in an editor that's capable of changing the display encoding on the fly, such as kwrite, and check it manually. In particular it appears to often mis-detect CP-1252 as ISO-8859-2.

I've only just discovered uchardet, so I don't really know how reliable it is yet, but a few quick tests seem to indicate that it does a better job.

For that matter, even the venerable file command makes some attempt at detecting the encoding, but it's even less reliable.

Finally, also be aware that files created on Microsoft platforms generally have dos-style line-endings. There are several different solutions available for converting them to unix-style, which I'll leave up to you to discover.
 
Old 12-21-2011, 06:15 AM   #5
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Original Poster
Rep: Reputation: Disabled
uchardet solves the problem, when i ran those command, it says "windows-1252", and I convert them to utf-8 using iconv.. solved... thanks...

Download uchardet for ubuntu:
http://mirror01.th.ifl.net/ubuntu/po...se/u/uchardet/
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How do I get rid of some weird stuff in a text file? lethalfang Linux - General 3 11-09-2011 05:25 PM
Remove lines in a text file based on another text file asiandude Programming 10 01-29-2009 10:59 AM
Remove text from a file jviola Programming 23 03-21-2007 12:23 PM
remove \r and \n from a text file powah Programming 9 10-02-2006 06:02 AM
Cant remove write protected weird file tuxombie Slackware 13 12-07-2004 09:44 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:17 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration