LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 10-11-2009, 07:07 PM   #1
PeteD
LQ Newbie
 
Registered: Feb 2007
Posts: 11

Rep: Reputation: 0
How to locate non-ascii characters in large files (and then remove them)


Hi all

I have about a dozen text files that I am using to prepare a book in linux; the co-author is working in Windows (and we are using svn to collaborate). (If necessary: these are LaTeX files, which are then used to generate pdf.) *He* reports numerous latex warnings about non-ascii characters; I have never seen any such alerts.

I would like to remove these characters, but first, I would like to locate them. I don't know how to do either, but a Web search reveals a few tips that might get me going on replacing. But I would really like to locate them first: so see what they are, and to ensure they are replaced properly. (They may only be carriage returns and co, but I'd like to know.)

I'm a linux user who uses linux as I find it more efficient; I'm hardly a guru. So if anyone can help with these, I would be very grateful (I assume with al the linux tools about, these aren't too hard):

1. How to locate non-ascii characters in a series of text files.

2. How to remove these non-ascii characters efficiently.

Thanks all.

P.
 
Old 10-11-2009, 07:46 PM   #2
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by PeteD View Post
Hi all

I have about a dozen text files that I am using to prepare a book in linux; the co-author is working in Windows (and we are using svn to collaborate). (If necessary: these are LaTeX files, which are then used to generate pdf.) *He* reports numerous latex warnings about non-ascii characters; I have never seen any such alerts.

I would like to remove these characters, but first, I would like to locate them. I don't know how to do either, but a Web search reveals a few tips that might get me going on replacing. But I would really like to locate them first: so see what they are, and to ensure they are replaced properly. (They may only be carriage returns and co, but I'd like to know.)

I'm a linux user who uses linux as I find it more efficient; I'm hardly a guru. So if anyone can help with these, I would be very grateful (I assume with al the linux tools about, these aren't too hard):

1. How to locate non-ascii characters in a series of text files.

2. How to remove these non-ascii characters efficiently.

Thanks all.

P.
We first must define our terms. If we define ASCII as 7-bit characters, then removing non-ASCII characters is child's play. But any "ASCII" text that contains foreign accented characters or anything other than upper- and lower-case English characters and a small set of punctuation marks is called "extended ASCII." Extended ASCII uses all the bits in an 8-bit byte, and this encoding cannot be unambiguously distinguished from other encodings (e.g. you have to know what encoding it is, you cannot get a program to tell you).

If the file has any legitimate extended characters, then give up now. If the file is expected to only have 7-bit characters, then you can filter the others relatively easily. Like this:

Code:
$ iconv -f (input encoding) -t ASCII < input-file > output-file
The problem will be in deciding what the input encoding is. To list the available encodings, do this:

Code:
$ iconv --list
This method won't just throw away the non-ASCII characters, which means you need to examine and spell-check the result -- but for a reason that should be obvious, this is something you would have to do even if the filter dropped all the invalid characters: there are variant spellings of foreign words that are used when extended ASCII is not available:

naïve -> naive

coöperation -> cooperation

And so forth. This means whatever method you adopt, the outcome will not be automatic.

Quote:
Originally Posted by PeteD View Post
(They may only be carriage returns and co, but I'd like to know.)
I just noticed this. To test this idea, just make a copy of the file and do this to the copy:

Code:
$ dos2unix filename
Then compare the copy to the original. If they are identical, then carriage returns aren't the issue.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need to remove the first three characters in the name of a ton of files... ooagentbender Linux - Newbie 17 10-04-2013 02:02 PM
ASCII characters in my script... Firebar Programming 9 10-27-2008 04:59 PM
Non-ascii characters make files/filenames inaccessable? ordinary Linux - General 6 07-18-2007 01:16 PM
Remove files found with "locate" JRR883 Linux - Software 3 11-23-2006 12:17 PM
ascii characters lakshman Linux - General 1 03-14-2003 11:28 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration