Quote:
Originally Posted by PeteD
Hi all
I have about a dozen text files that I am using to prepare a book in linux; the co-author is working in Windows (and we are using svn to collaborate). (If necessary: these are LaTeX files, which are then used to generate pdf.) *He* reports numerous latex warnings about non-ascii characters; I have never seen any such alerts.
I would like to remove these characters, but first, I would like to locate them. I don't know how to do either, but a Web search reveals a few tips that might get me going on replacing. But I would really like to locate them first: so see what they are, and to ensure they are replaced properly. (They may only be carriage returns and co, but I'd like to know.)
I'm a linux user who uses linux as I find it more efficient; I'm hardly a guru. So if anyone can help with these, I would be very grateful (I assume with al the linux tools about, these aren't too hard):
1. How to locate non-ascii characters in a series of text files.
2. How to remove these non-ascii characters efficiently.
Thanks all.
P.
|
We first must define our terms. If we define ASCII as 7-bit characters, then removing non-ASCII characters is child's play. But any "ASCII" text that contains foreign accented characters or anything other than upper- and lower-case English characters and a small set of punctuation marks is called "extended ASCII." Extended ASCII uses all the bits in an 8-bit byte, and this encoding cannot be unambiguously distinguished from other encodings (e.g. you have to know what encoding it is, you cannot get a program to tell you).
If the file has any legitimate extended characters, then
give up now. If the file is expected to only have 7-bit characters, then you can filter the others relatively easily. Like this:
Code:
$ iconv -f (input encoding) -t ASCII < input-file > output-file
The problem will be in deciding what the input encoding is. To list the available encodings, do this:
This method won't just throw away the non-ASCII characters, which means you need to examine and spell-check the result -- but for a reason that should be obvious, this is something you would have to do even if the filter dropped all the invalid characters: there are variant spellings of foreign words that are used when extended ASCII is not available:
naïve -> naive
coöperation -> cooperation
And so forth. This means
whatever method you adopt, the outcome will not be automatic.
Quote:
Originally Posted by PeteD
(They may only be carriage returns and co, but I'd like to know.)
|
I just noticed this. To test this idea, just make a copy of the file and do this to the copy:
Code:
$ dos2unix filename
Then compare the copy to the original. If they are identical, then carriage returns aren't the issue.