How to locate non-ascii characters in large files (and then remove them)

PeteD · 10-11-2009, 07:07 PM

Hi all

I have about a dozen text files that I am using to prepare a book in linux; the co-author is working in Windows (and we are using svn to collaborate). (If necessary: these are LaTeX files, which are then used to generate pdf.) *He* reports numerous latex warnings about non-ascii characters; I have never seen any such alerts.

I would like to remove these characters, but first, I would like to locate them. I don't know how to do either, but a Web search reveals a few tips that might get me going on replacing. But I would really like to locate them first: so see what they are, and to ensure they are replaced properly. (They may only be carriage returns and co, but I'd like to know.)

I'm a linux user who uses linux as I find it more efficient; I'm hardly a guru. So if anyone can help with these, I would be very grateful (I assume with al the linux tools about, these aren't too hard):

1. How to locate non-ascii characters in a series of text files.

2. How to remove these non-ascii characters efficiently.

Thanks all.

P.

lutusp · 10-11-2009, 07:46 PM

Quote:

Originally Posted by PeteD

Hi all

I have about a dozen text files that I am using to prepare a book in linux; the co-author is working in Windows (and we are using svn to collaborate). (If necessary: these are LaTeX files, which are then used to generate pdf.) *He* reports numerous latex warnings about non-ascii characters; I have never seen any such alerts.

I would like to remove these characters, but first, I would like to locate them. I don't know how to do either, but a Web search reveals a few tips that might get me going on replacing. But I would really like to locate them first: so see what they are, and to ensure they are replaced properly. (They may only be carriage returns and co, but I'd like to know.)

I'm a linux user who uses linux as I find it more efficient; I'm hardly a guru. So if anyone can help with these, I would be very grateful (I assume with al the linux tools about, these aren't too hard):

1. How to locate non-ascii characters in a series of text files.

2. How to remove these non-ascii characters efficiently.

Thanks all.

P.

We first must define our terms. If we define ASCII as 7-bit characters, then removing non-ASCII characters is child's play. But any "ASCII" text that contains foreign accented characters or anything other than upper- and lower-case English characters and a small set of punctuation marks is called "extended ASCII." Extended ASCII uses all the bits in an 8-bit byte, and this encoding cannot be unambiguously distinguished from other encodings (e.g. you have to know what encoding it is, you cannot get a program to tell you).

If the file has any legitimate extended characters, then give up now. If the file is expected to only have 7-bit characters, then you can filter the others relatively easily. Like this:

Code:

$ iconv -f (input encoding) -t ASCII < input-file > output-file

The problem will be in deciding what the input encoding is. To list the available encodings, do this:

Code:

$ iconv --list

This method won't just throw away the non-ASCII characters, which means you need to examine and spell-check the result -- but for a reason that should be obvious, this is something you would have to do even if the filter dropped all the invalid characters: there are variant spellings of foreign words that are used when extended ASCII is not available:

naïve -> naive

coöperation -> cooperation

And so forth. This means whatever method you adopt, the outcome will not be automatic.

Quote:

Originally Posted by PeteD

(They may only be carriage returns and co, but I'd like to know.)

I just noticed this. To test this idea, just make a copy of the file and do this to the copy:

Code:

$ dos2unix filename

Then compare the copy to the original. If they are identical, then carriage returns aren't the issue.